US11321092B1 - Tensor-based memory access - Google Patents

Tensor-based memory access Download PDF

Info

Publication number
US11321092B1
US11321092B1 US16/170,069 US201816170069A US11321092B1 US 11321092 B1 US11321092 B1 US 11321092B1 US 201816170069 A US201816170069 A US 201816170069A US 11321092 B1 US11321092 B1 US 11321092B1
Authority
US
United States
Prior art keywords
instruction
coordinates
tensor
dimensional
dimensional array
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active, expires
Application number
US16/170,069
Inventor
Shlomo Raikin
Sergei Gofman
Ran Halutz
Evgeny Spektor
Amos Goldman
Ron Shalev
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Habana Labs Ltd
Original Assignee
Habana Labs Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Habana Labs Ltd filed Critical Habana Labs Ltd
Priority to US16/170,069 priority Critical patent/US11321092B1/en
Assigned to Habana Labs Ltd. reassignment Habana Labs Ltd. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: GOLDMAN, AMOS, RAIKIN, SHLOMO, SPEKTOR, EVGENY, HALUTZ, RAN, GOFMAN, SERGEI, SHALEV, RON
Application granted granted Critical
Publication of US11321092B1 publication Critical patent/US11321092B1/en
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30145Instruction analysis, e.g. decoding, instruction word fields
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30025Format conversion instructions, e.g. Floating-Point to Integer, decimal conversion
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30032Movement instructions, e.g. MOVE, SHIFT, ROTATE, SHUFFLE
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • G06F9/30038Instructions to perform operations on packed data, e.g. vector, tile or matrix operations using a mask
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3004Arrangements for executing specific machine instructions to perform operations on memory
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3004Arrangements for executing specific machine instructions to perform operations on memory
    • G06F9/30043LOAD or STORE instructions; Clear instruction
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/3012Organisation of register space, e.g. banked or distributed register file
    • G06F9/3013Organisation of register space, e.g. banked or distributed register file according to data content, e.g. floating-point registers, address registers
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/3017Runtime instruction translation, e.g. macros
    • G06F9/30174Runtime instruction translation, e.g. macros for non-native instruction set, e.g. Javabyte, legacy code
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30181Instruction operation extension or modification
    • G06F9/30196Instruction operation extension or modification using decoder, e.g. decoder per instruction set, adaptable or programmable decoders
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/34Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes
    • G06F9/345Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes of multiple operands or results
    • G06F9/3455Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes of multiple operands or results using stride
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/34Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes
    • G06F9/35Indirect addressing

Definitions

  • the present invention relates generally to processor architectures, and particularly to tensor-based memory access in a processor.
  • Vector processing is a common operation for many applications such as deep learning.
  • Vector Processors may read some or all input data from memory, and, likewise, may store output data in memory. Consequently, vector processing may involve accessing memory for input and/or output.
  • U.S. Pat. No. 7,543,119 describes a vector processing system using a System-On-a-Chip (SOC) implementation technique.
  • SOC System-On-a-Chip
  • One or more scalar processors (or cores) operate in conjunction with a vector processor, and the processors collectively share access to a plurality of memory interfaces coupled to Dynamic Random Access read/write Memories (DRAMs).
  • DRAMs Dynamic Random Access read/write Memories
  • U.S. Pat. No. 9,262,165 describes a vector processor including an instruction fetching unit configured to acquire an instruction, a decoding/issuing unit configured to decode the instruction and issuing the instruction, an operation group configured to include a plurality of operation units, and a register configured to store the element data column.
  • An embodiment of the present invention that is described herein provides a processor including an internal memory and a processing circuitry.
  • the internal memory is configured to store a definition of a multi-dimensional array stored in an external memory and indices that specify elements of the multi-dimensional array in terms of multi-dimensional coordinates of the elements within the array.
  • the processing circuitry is configured to execute instructions in accordance with an Instruction Set Architecture (ISA) defined for the processor. At least some of the instructions in the ISA access the multi-dimensional array by operating on the multi-dimensional coordinates specified in the indices.
  • ISA Instruction Set Architecture
  • the processing circuitry is configured to execute at least one instruction that translates between the multi-dimensional coordinates of an element of the array and an address in which the element is stored in the external memory.
  • the processing circuitry is configured to execute an instruction that accesses an element of the array based on the multi-dimensional coordinates of the element.
  • the processing circuitry is configured to execute an instruction that performs a mathematical operation between sets of multi-dimensional coordinates.
  • the instructions sum corresponding coordinates of two sets.
  • the processing circuitry is configured to execute an instruction that performs a permutation among the multi-coordinates of an element of the array. In other embodiments, the processing circuitry is configured to identify that an executed tensor-access instruction exceeds a bound of the multi-dimensional array. In an embodiment, the processing circuitry is configured to return a padding value as a result of the tensor-access instruction in response to identifying that the tensor-access instruction exceeds the bound.
  • a method including storing in an internal memory of a processor a definition of a multi-dimensional array stored in an external memory, and indices that specify elements of the multi-dimensional array in terms of multi-dimensional coordinates of the elements within the array.
  • instructions are executed in accordance with an Instruction Set Architecture (ISA) defined for the processor. At least some of the instructions in the ISA access the multi-dimensional array by operating on the multi-dimensional coordinates specified in the indices.
  • ISA Instruction Set Architecture
  • FIG. 1 is a block diagram that schematically illustrates the architecture of a processor, in accordance with embodiments of the present invention
  • FIG. 2 is an illustration that schematically illustrates an example of a 2D tensor and Index Reference File (IRF) entries that point to locations in the tensor, in accordance with embodiments of the present invention
  • FIG. 3 is a block diagram that schematically illustrates an ALU that implements dedicated instructions for manipulation and processing of tensor indexes, in accordance with embodiments of the present invention.
  • FIG. 4 is an illustration that schematically illustrates an example out-of-bounds tensor accesses, in accordance with embodiments of the present invention.
  • Embodiments of the present invention that are described herein provide improved methods and apparatus for processing multi-dimensional arrays in processors.
  • Multi-dimensional arrays are also referred to as tensors, and both terms are used interchangeably herein.
  • Tensors may comprise, for example, two-dimensional arrays such as digital images, as well as data structures having more than two dimensions.
  • a processor stores one or more multi-dimensional arrays in an external memory.
  • the processor supports an Instruction Set Architecture (ISA), which specifies various tensor-access instructions.
  • ISA Instruction Set Architecture
  • the tensor-access instructions manipulate elements of multi-dimensional arrays by operating directly on the multi-dimensional coordinate values of the elements (as opposed to operating on the addresses in which the elements are stored in the external memory).
  • the processor manipulates tensor elements by dedicated hardware, achieving higher speed than software-based solutions.
  • the processor stores in its internal memory a definition of each array that specifies, for example, the number of dimensions of the tensor, and the address of the first tensor element in external memory.
  • the processor further stores in an internal memory indices that specify elements of the multi-dimensional arrays in terms of multi-dimensional coordinates of the elements.
  • the tensor-access instructions operate on the indices.
  • Example tensor-access instructions include instructions that calculate the address in the external memory of a tensor element, and instructions that convert an address in the external memory to tensor coordinates.
  • the number of dimensions is limited to five; in other embodiments the number of dimensions may be limited to any other number; and in yet other embodiments the number of dimensions may not be limited.
  • a vector processor may comprise a Scalar Register File (SRF), and an Index Register File (IRF).
  • the IRF comprises multiple entries, each having a number of fields equal to the number of tensor dimensions (DIMS) supported by the processor (e.g., if the processor supports 5-dimensional tensors, the number of fields is five, numbered DIM4, DIMS, DIM2, DIM1 and DIM0).
  • DIMS tensor dimensions
  • each field of an IRF entry holds the index of the tensor element in the relevant dimension that will be used by a load or store operation.
  • a 5-field IRF entry of ⁇ 2, 3, 5, 7, 16 ⁇ points to an element with coordinates (2, 3, 5, 7, 16) in a 5-dimensional tensor space.
  • the lowest field (DIM0) is the fastest-changing dimension (with reference to tensor processing order), while the highest field (DIM4) is the slowest-changing dimension.
  • the x axis (DIM0) may be the fastest-changing dimension and y axis (DIM1) the slowest-changing dimension.
  • the dimensions can be swapped (as will be described hereinbelow). In this case, a faster-changing dimension can become slower, and a slower-changing dimension can become faster.
  • the vector processor further comprises a Tensor Descriptor Memory (TDM), which stores tensor information (for example tensor base address, padding value, dimension offsets, strides and sizes).
  • TDM Tensor Descriptor Memory
  • the vector processor engine processes a vector that comprises a preset number of elements (the number of elements will be referred to hereinunder as processor granularity).
  • elements of the vector correspond to elements of a tensor that share the same n ⁇ 1 dimensions indexes (wherein n is the number of the dimensions of the tensor; referred to hereinunder as Tensor Dimensionality).
  • n is the number of the dimensions of the tensor; referred to hereinunder as Tensor Dimensionality.
  • a vector processor with granularity 256, reads or writes 256 elements of the cube; for example, elements ⁇ 19,205,1 ⁇ , ⁇ 19,205,2 ⁇ . . . ⁇ 19,205,256 ⁇ .
  • the vector processor further comprises a special ALU that supports a dedicated Instruction Set Architecture (ISA), which comprises instructions that manipulate the Index register file.
  • the ALU is configured to execute the dedicated ISA instructions, i.e. execute instructions that manipulate the IRF.
  • Such instructions read inputs from the IRF or from the SRF, and store execution results in the IRF or SRF.
  • the instructions comprise, for example, converting a multi-dimensional IRF entry to a scalar or from a scalar to a multi-dimensional IRF entry, permuting coordinates of a tensor entry in the IRF, performing mathematical operations between IRF entries, and others (the full instruction set of an example embodiment of the present invention will be described hereinbelow).
  • tensor access instructions may sometimes access memory locations that are not part of the tensors (Out Of Bound accesses). For example, when the value of a pixel is calculated from value of neighbor pixels, and the pixel is at or near the tensor boundary (and, hence, some of the pixel's neighbors are missing).
  • tensor read instructions that access out-of-bound location return a predefined padding value; and tensor write instructions to out-of-bound locations are skipped.
  • tensor access including out-of-bound handling, is done by hardware (typically transparently to the programmer), at a higher speed.
  • FIG. 1 is a block diagram that schematically describes the architecture of a processor 100 according to embodiments of the present invention.
  • the processor comprises a scalar processor 102 , a vector processor engine (VPE) 104 , and an external memory 106 .
  • VPE vector processor engine
  • Vector Processor 104 processes vectors, as described, for example, in U.S. patent application Ser. No. 16/150,299, filed Oct. 3, 2018, which is assigned to the assignee of the present patent application and whose disclosure is incorporated herein by reference.
  • Scalar Processor 102 comprises a Scalar Engine 108 , which is configured to decode and execute instructions; a Sequencer 110 , which is configured to govern the operation of Scalar Engine 108 (using start-PC, Halt and other control signals); An Instruction Cache 112 , which is configured to prefetch instructions; a Load-Store Address Generation Unit (AGU) 116 , which is configured to generate the addresses and the tags of the vector-processor data; a scalar register file 114 ; a Multiplexer 118 , which is configured to initiate memory access requests, through a read-request channel, in response to requests from Instruction Cache 112 and from AGU 116 ; an Index Register File (IRF) 122 , which is configured to store n-dimensional indexes of tensors of n dimensions and, a Tensor Descriptor Memory (TDM) 124 , which is configured to store tensor information (tensor base address, padding value, dimension offsets, strides, sizes, etc.
  • Scalar Engine 108 comprises an Index Processing ALU 120 , which is configured to support dedicated ISA instructions that allow manipulation of IRF 122 .
  • Vector Processor 100 is configured to store and manipulate data that pertains to tensors that the vector processor processes.
  • vector processor 100 is an example configuration that is depicted purely for the sake of conceptual clarity. Other suitable configurations may be used in alternative embodiments of the present invention.
  • sequencer 110 may use other signals to govern the operation of scalar engine 108
  • memory 106 may comprise two or more separately accessible memories (e.g., one for instructions and one for data).
  • processor 100 that carry out the disclosed techniques, e.g., scalar engine 108 , ALU 120 , sequencer 110 , AGU 116 , multiplexer 118 and VPE 104 , are referred to collectively as the “processing circuitry” of the processor.
  • processing circuitry e.g., scalar engine 108 , ALU 120 , sequencer 110 , AGU 116 , multiplexer 118 and VPE 104
  • SRF 114 , IRF 122 , TDM 124 and cache 112 are referred to collectively as the “internal memory” of the processor.
  • the processing circuitry and/or internal memory may have any other suitable structure.
  • scalar engine 108 comprises a general-purpose processor, which is programmed in software to carry out the functions described herein.
  • the software may be downloaded to the processor in electronic form, over a network or from a host, for example, or it may, alternatively or additionally, be provided and/or stored on non-transitory tangible media, such as magnetic, optical, or electronic memory.
  • IRF 122 comprises n-field entries, where n is the number of dimensions that the vector processor supports (e.g. 5 ).
  • Each field holds the index of a tensor element in the relevant dimension that will be used by a load or store operations that the vector processor will execute.
  • the 5-field index ⁇ 2, 3, 5, 7, 16 ⁇ may point to an element with coordinates (2, 3, 5, 7, 16) in a 5-dimensional tensor space.
  • the vector processor processes a tensor
  • the lowest field of an IRF entry is the fastest-changing dimension
  • the highest field e.g. DIM4
  • the slowest-changing dimension is the slowest-changing dimension.
  • the x axis may be the fastest-changing dimension
  • the y axis DIM1 the slowest-changing dimension.
  • the dimensions can be swapped (as will be described hereinbelow). In this case, a faster-changing dimension can become slower, and a slower-changing dimension can become faster.
  • a 2-dimensional tensor (2D image) is processed in a horizontal raster
  • the x axis is the fast-changing dimension and the y axis is the slow-changing dimension.
  • the 2D image is processed in a vertical raster
  • the y axis is the fast-changing dimension and the x axis is the slow-changing dimension.
  • FIG. 2 is an illustration 200 that schematically describes an example of a 2D tensor 202 and Index Reference File (IRF) entries that point to locations in the tensor, in accordance to embodiments of the present invention.
  • IRF Index Reference File
  • Tensor 202 has a first dimension and a second dimension (horizontal and vertical axes, respectively, in the example illustration of FIG. 2 ).
  • the Tensor has a start-dimension-1 location 204 , an end-dimension-1 location 206 , a start-dimension-2 location 208 , and an end-dimension-2 location 210 .
  • IRF 122 ( FIG. 1 ) comprises an entry 212 , and an entry 214 , which point to the start and end locations, respectively, in both dimensions.
  • FIG. 3 is a block diagram that schematically describes index-processing ALU 120 ( FIG. 1 ), according to some embodiments of the present invention.
  • the ALU reads input sources from IRF 122 or from SRF 114 , and stores execution results into the IRF or the SRF.
  • the ALU comprises logic units that execute dedicated ISA instructions:
  • a Get Index Unit 316 which takes tensor indexes from IRF 122 as the source, selects index based on a DIM field of the instruction, and stores the index into SRF 114 as destination;
  • a Set Index Unit 318 which takes scalar value from SRF 114 as the source and stores the value into IRF 122 as destination, using destination indexes, according to a DIM_MASK field of the instruction (as defined below);
  • An Index Permutation Unit 320 which takes tensor indexes from IRF 122 as the source, performs permutation between coordinates and indexes to switch between tensor dimensions (as described below), and stores updated indexes into IRF 122 as the destination;
  • An Arithmetic Unit 322 which further comprises circuits for the execution of Multiply (“X”), SUB/ADD (“+/ ⁇ ”), Max/Min, and OR/AND/XOR operations; and a MUX 324 , which selects an output that the index-processing ALU writes into IRF 122 .
  • Set Index Unit 318 is configured to execute SET INDEX (DST, DIM_MASK, SRC) instructions, which initialize an entry in the index register file with element coordinates.
  • the DST field defines which entry in the IRF will be initialized; the DIM_MASK field specifies which indexes (coordinates) are initialized; and, the SRC field defines an initialization value, which can be an immediate value, or a value that is read from a Scalar Register File 114 ( FIG. 1 ).
  • Index Permutation Unit 320 is configured to execute PRMT_INDX (DST, PRMT, SRC) instructions, which perform permutation between indexes (coordinates) to switch between tensor dimensions (for example, making a slow changing dimension faster).
  • the DST field defines the destination entry in the IRF after dimensions permutation
  • the SRC field defines the source entry in the IRF for dimensions permutation.
  • the PRMT field comprises one bit for every dimension of the tensor and specifies how dimensions are permuted. For example, for a 5-dimensional tensor:
  • Get Index Unit 316 is configured to execute GET_INDX (DST, DIM_MASK, SRC) instructions, which write an index value in SRF 114 .
  • the DST field defines the destination entry in the SRF into which the index will be written; the SRC field defines a source entry in IRF 122 from which the index is read; and the DIM field specifies which dimension indexes to get.
  • Arithmetic Unit 322 is configured to execute arithmetic (and logic) instructions on tensor indexes.
  • the arithmetic Unit performs operations between two sets of tensor indexes (sources), that are stored in IRF 122 ; and, to store the results of the operations into IRF 122 (destination).
  • the operations are done simultaneously for any of the pairs of indexes of the two tensor sources, as indicated by a DIM_MASK parameter of the instruction.
  • the DIM_MASK parameter comprises one bit for each dimension (for example, five bits for 5-D tensors).
  • the arithmetic operation will be done on the dimensions for which the corresponding bit in the DIM_MASK field is set. For example, for an ADD operation:
  • the mathematical instructions that ALU 300 supports comprise the following:
  • the DST field defines the destination entry in IRF 122 ; SRC0/SRC1 fields define the source entries in IRF 122 ; and the DIM_MASK field specifies what indexes (coordinates) participate in the operation.
  • the ISA instructions that index processing ALU 120 is configured to execute are examples that are depicted purely for the sake of conceptual clarity. Other suitable ISA instructions may be added, and some ISA instructions may be removed; for example, an instruction that calculate the intersection tensor of two tensors may be added.
  • index-processing ALU 120 as presented in block diagram 300 of FIG. 3 , is an example that is depicted purely for the sake of conceptual clarity. Different structures could be used in alternative embodiments; for example, arithmetic-logic units 322 may be replaced by a single unit with a configurable ALU.
  • vector processor 100 is further configured to execute ISA Tensor Access instructions, which access tensors in external memory 106 , using IRF 122 indexes.
  • the tensor addresses of the Tensor Access ISA instructions are calculated by AGU 116 .
  • Tensor Access ISA instructions comprise an Address Generation instruction (GEN_ADDR), a Load Tensor (LD_TNSR) instruction, and a Store Tensor (ST_TNSR) instruction.
  • the Address Generation instruction generates memory addresses (in external memory 106 ) for tensor load/store operation.
  • the ISA definition of the Address Generation instruction is:
  • ADDR GEN_ADDR (IRF_SRC, Tensor ID).
  • Tensor ID is the tensor number (in TDM 124 );
  • IRF_SRC represents the coordinates in the tensor to be accessed in the tensor load/store operations; and, ADDR is the start memory address for the tensor load/store operations.
  • the Load Tensor instruction loads tensor elements from external memory.
  • V_DST is the destination entry in VPE 104
  • Tensor ID is the tensor number (in TDM 124 )
  • IRF_SRC represents the coordinates of an element of the tensor.
  • the tensor base address, padding value, dimension offsets, strides, and sizes are extracted from the tensor descriptor in TDM 124 that is indicated by the Tensor ID.
  • the Store Tensor instruction stores a tensor in external memory 106 .
  • the ISA definition of the Store Tensor instruction is: ST_TNSR (IRF_SRC, Tensor ID, V SRC).
  • V SRC is source entry in VPE 104
  • Tensor ID is tensor number for storing
  • IRF_SRC represents the coordinates in the tensor where data will be stored.
  • the tensor base address, padding value and dimension offsets, strides, and sizes are extracted from the tensor descriptor in TDM 124 that is indicated by the Tensor ID.
  • the tensor access instructions described above are example instructions, which are chosen purely for the sake of conceptual clarity.
  • the ISA of vector processor 100 may support any other suitable instructions that access or otherwise manipulate tensors or tensor elements.
  • tensor accessing instructions may result in an out-of-bound tensor access—i.e.—may access memory locations that are beyond the addresses allocated for the tensor.
  • Out-of-bound tensor accesses may occur, for example, when a value of a pixel is calculated as a function of the values of neighbor pixels (e.g. a 5 ⁇ 5 low-pass kernel in a 2-D image).
  • neighbor pixels e.g. a 5 ⁇ 5 low-pass kernel in a 2-D image.
  • FIG. 4 is an illustration that schematically describes an example out-of-bounds tensor accesses, in accordance with an embodiment of the present invention.
  • Rectangles in 2-D space 200 represent x-y memory access zones: A zone 202 represents accesses to the tensor (not out of bound).
  • Zones 216 represent tensor accesses wherein dimension 1 index is less than dimension-1 start location 204 , or more than dimension-1 end location 206 .
  • Zones 218 represent tensor accesses wherein dimension-2 index is less than dimension-2 start location 208 , or more than dimension-2 end location 210 .
  • Zones 220 represent tensor accesses wherein both dimensions are out-of-bound.
  • the AGU detects reading from an out-of-bound zone; such reading will return a pre-defined padding value (as will be described hereinbelow).
  • FIG. 4 describes out-of-bound accesses of a two-dimensional tensor.
  • the tensor may have more than two dimensions, and accesses may be out-of-bound in any dimension, or in a plurality of dimensions.
  • vector processor 102 handles out of bound tensor accesses, according to some embodiments of the present invention.
  • the addresses for the accesses are calculated by AGU 116 ( FIG. 1 ).
  • the vector processor when the vector processor executes an LD_TNSR instruction, and one or more tensor dimensions in the IRF_SRC field are out of bounds, the vector processor will get a padding value for the locations which are out-of-bound.
  • the tensor descriptor defines tensor size in each dimension.
  • out-of-bound padding can be on any part of the vector up to the granularity of the vector engine processing size.
  • the vector processing size is N pixels
  • 2D image 2-dimensional tensor
  • the AGU pads pixels up to a granularity of N pixels for the horizontal dimension X (fastest changing dimension) and up to Width pixels for the vertical dimension Y.
  • AGU 116 when the vector processor executes a ST_TNSR instruction, and one or more tensor dimensions in the IRF_SRC field are out of bounds, AGU 116 does not write data to out-of-bounds addresses, and only the valid elements in the tensor are written to memory.
  • indexes of valid dimensions from IRF_SRC are compared to dimension sizes to identify what dimensions are out of bound.
  • out-of-bound addresses can be on part of the vector up to the granularity of the vector engine processing size. For example, if the vector processing size is N pixels, for a 2-dimensional tensor (2D image) of size Width ⁇ Height, out-of-bound addresses can be up to a granularity of N pixels for the horizontal dimension X (fastest changing dimension) and up to Width pixels for the vertical dimension Y.
  • 2D image 2-dimensional tensor
  • the address returned to the address register file is the tensor base address.
  • Vector Processor 100 Index Processing ALU 300 and index manipulating ISA instructions, which are described hereinabove, are example configurations that are shown purely for the sake of conceptual clarity. Any other suitable configurations can be used in alternative embodiments.
  • the different elements of Vector Processor 100 and index processing ALU 300 may be implemented using suitable hardware, such as in one or more Application-Specific Integrated Circuits (ASICs) or Field-Programmable Gate Arrays (FPGAs), using software, using hardware, or using a combination of hardware and software elements.
  • ASICs Application-Specific Integrated Circuits
  • FPGAs Field-Programmable Gate Arrays

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Executing Machine-Instructions (AREA)

Abstract

A processor includes an internal memory and processing circuitry. The internal memory is configured to store a definition of a multi-dimensional array stored in an external memory, and indices that specify elements of the multi-dimensional array in terms of multi-dimensional coordinates of the elements within the array. The processing circuitry is configured to execute instructions in accordance with an Instruction Set Architecture (ISA) defined for the processor. At least some of the instructions in the ISA access the multi-dimensional array by operating on the multi-dimensional coordinates specified in the indices.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS
This application claims the benefit of U.S. Provisional Patent Application 62/582,990, filed Nov. 8, 2017, whose disclosure is incorporated herein by reference.
FIELD OF THE INVENTION
The present invention relates generally to processor architectures, and particularly to tensor-based memory access in a processor.
BACKGROUND OF THE INVENTION
Vector processing is a common operation for many applications such as deep learning. Vector Processors may read some or all input data from memory, and, likewise, may store output data in memory. Consequently, vector processing may involve accessing memory for input and/or output.
U.S. Pat. No. 7,543,119, for example, describes a vector processing system using a System-On-a-Chip (SOC) implementation technique. One or more scalar processors (or cores) operate in conjunction with a vector processor, and the processors collectively share access to a plurality of memory interfaces coupled to Dynamic Random Access read/write Memories (DRAMs).
As another example, U.S. Pat. No. 9,262,165 describes a vector processor including an instruction fetching unit configured to acquire an instruction, a decoding/issuing unit configured to decode the instruction and issuing the instruction, an operation group configured to include a plurality of operation units, and a register configured to store the element data column.
SUMMARY OF THE INVENTION
An embodiment of the present invention that is described herein provides a processor including an internal memory and a processing circuitry. The internal memory is configured to store a definition of a multi-dimensional array stored in an external memory and indices that specify elements of the multi-dimensional array in terms of multi-dimensional coordinates of the elements within the array. The processing circuitry is configured to execute instructions in accordance with an Instruction Set Architecture (ISA) defined for the processor. At least some of the instructions in the ISA access the multi-dimensional array by operating on the multi-dimensional coordinates specified in the indices.
In an embodiment, in accordance with the ISA, the processing circuitry is configured to execute at least one instruction that translates between the multi-dimensional coordinates of an element of the array and an address in which the element is stored in the external memory.
In yet other embodiments, in accordance with the ISA, the processing circuitry is configured to execute an instruction that accesses an element of the array based on the multi-dimensional coordinates of the element.
In alternative embodiments, in accordance with the ISA, the processing circuitry is configured to execute an instruction that performs a mathematical operation between sets of multi-dimensional coordinates. In an embodiment, the instructions sum corresponding coordinates of two sets.
In some embodiments, in accordance with the ISA, the processing circuitry is configured to execute an instruction that performs a permutation among the multi-coordinates of an element of the array. In other embodiments, the processing circuitry is configured to identify that an executed tensor-access instruction exceeds a bound of the multi-dimensional array. In an embodiment, the processing circuitry is configured to return a padding value as a result of the tensor-access instruction in response to identifying that the tensor-access instruction exceeds the bound.
There is additionally provided, in accordance with an embodiment of the present invention, a method including storing in an internal memory of a processor a definition of a multi-dimensional array stored in an external memory, and indices that specify elements of the multi-dimensional array in terms of multi-dimensional coordinates of the elements within the array. Using processing circuitry of the processor, instructions are executed in accordance with an Instruction Set Architecture (ISA) defined for the processor. At least some of the instructions in the ISA access the multi-dimensional array by operating on the multi-dimensional coordinates specified in the indices.
The present invention will be more fully understood from the following detailed description of the embodiments thereof, taken together with the drawings in which:
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a block diagram that schematically illustrates the architecture of a processor, in accordance with embodiments of the present invention;
FIG. 2 is an illustration that schematically illustrates an example of a 2D tensor and Index Reference File (IRF) entries that point to locations in the tensor, in accordance with embodiments of the present invention;
FIG. 3 is a block diagram that schematically illustrates an ALU that implements dedicated instructions for manipulation and processing of tensor indexes, in accordance with embodiments of the present invention; and
FIG. 4 is an illustration that schematically illustrates an example out-of-bounds tensor accesses, in accordance with embodiments of the present invention.
DETAILED DESCRIPTION OF EMBODIMENTS Overview
Embodiments of the present invention that are described herein provide improved methods and apparatus for processing multi-dimensional arrays in processors. Multi-dimensional arrays are also referred to as tensors, and both terms are used interchangeably herein. Tensors may comprise, for example, two-dimensional arrays such as digital images, as well as data structures having more than two dimensions.
In the disclosed embodiments, a processor stores one or more multi-dimensional arrays in an external memory. The processor supports an Instruction Set Architecture (ISA), which specifies various tensor-access instructions. The tensor-access instructions manipulate elements of multi-dimensional arrays by operating directly on the multi-dimensional coordinate values of the elements (as opposed to operating on the addresses in which the elements are stored in the external memory). In this manner, the processor manipulates tensor elements by dedicated hardware, achieving higher speed than software-based solutions.
In some embodiments, the processor stores in its internal memory a definition of each array that specifies, for example, the number of dimensions of the tensor, and the address of the first tensor element in external memory. The processor further stores in an internal memory indices that specify elements of the multi-dimensional arrays in terms of multi-dimensional coordinates of the elements. The tensor-access instructions operate on the indices. Example tensor-access instructions, which will be described in detail below, include instructions that calculate the address in the external memory of a tensor element, and instructions that convert an address in the external memory to tensor coordinates.
In some example embodiments of the present invention, the number of dimensions is limited to five; in other embodiments the number of dimensions may be limited to any other number; and in yet other embodiments the number of dimensions may not be limited.
According to embodiments, a vector processor may comprise a Scalar Register File (SRF), and an Index Register File (IRF). The IRF comprises multiple entries, each having a number of fields equal to the number of tensor dimensions (DIMS) supported by the processor (e.g., if the processor supports 5-dimensional tensors, the number of fields is five, numbered DIM4, DIMS, DIM2, DIM1 and DIM0).
According to an embodiment, each field of an IRF entry holds the index of the tensor element in the relevant dimension that will be used by a load or store operation. For example, a 5-field IRF entry of {2, 3, 5, 7, 16} points to an element with coordinates (2, 3, 5, 7, 16) in a 5-dimensional tensor space. In an embodiment, the lowest field (DIM0) is the fastest-changing dimension (with reference to tensor processing order), while the highest field (DIM4) is the slowest-changing dimension. For example, for a 2-dimension tensor (a 2D array, e.g., an image), the x axis (DIM0) may be the fastest-changing dimension and y axis (DIM1) the slowest-changing dimension.
In some embodiments of the present invention, the dimensions can be swapped (as will be described hereinbelow). In this case, a faster-changing dimension can become slower, and a slower-changing dimension can become faster.
According to some embodiments of the present invention, the vector processor further comprises a Tensor Descriptor Memory (TDM), which stores tensor information (for example tensor base address, padding value, dimension offsets, strides and sizes).
The vector processor engine processes a vector that comprises a preset number of elements (the number of elements will be referred to hereinunder as processor granularity). According to an embodiment of the present invention, elements of the vector correspond to elements of a tensor that share the same n−1 dimensions indexes (wherein n is the number of the dimensions of the tensor; referred to hereinunder as Tensor Dimensionality). For example, a 1024×1024×1024 cube can be represented by a tensor of dimensionality=3, with sizes 1024 in each of the three dimensions. A vector processor with granularity=256, reads or writes 256 elements of the cube; for example, elements {19,205,1}, {19,205,2} . . . {19,205,256}.
According to embodiments, the vector processor further comprises a special ALU that supports a dedicated Instruction Set Architecture (ISA), which comprises instructions that manipulate the Index register file. The ALU is configured to execute the dedicated ISA instructions, i.e. execute instructions that manipulate the IRF. Such instructions read inputs from the IRF or from the SRF, and store execution results in the IRF or SRF. According to an embodiment, the instructions comprise, for example, converting a multi-dimensional IRF entry to a scalar or from a scalar to a multi-dimensional IRF entry, permuting coordinates of a tensor entry in the IRF, performing mathematical operations between IRF entries, and others (the full instruction set of an example embodiment of the present invention will be described hereinbelow).
According to some embodiments of the present invention, tensor access instructions may sometimes access memory locations that are not part of the tensors (Out Of Bound accesses). For example, when the value of a pixel is calculated from value of neighbor pixels, and the pixel is at or near the tensor boundary (and, hence, some of the pixel's neighbors are missing). In embodiments, tensor read instructions that access out-of-bound location return a predefined padding value; and tensor write instructions to out-of-bound locations are skipped.
Thus, in embodiments according to the present invention, tensor access, including out-of-bound handling, is done by hardware (typically transparently to the programmer), at a higher speed.
System Description
FIG. 1 is a block diagram that schematically describes the architecture of a processor 100 according to embodiments of the present invention. The processor comprises a scalar processor 102, a vector processor engine (VPE) 104, and an external memory 106. Vector Processor 104 processes vectors, as described, for example, in U.S. patent application Ser. No. 16/150,299, filed Oct. 3, 2018, which is assigned to the assignee of the present patent application and whose disclosure is incorporated herein by reference.
Scalar Processor 102 comprises a Scalar Engine 108, which is configured to decode and execute instructions; a Sequencer 110, which is configured to govern the operation of Scalar Engine 108 (using start-PC, Halt and other control signals); An Instruction Cache 112, which is configured to prefetch instructions; a Load-Store Address Generation Unit (AGU) 116, which is configured to generate the addresses and the tags of the vector-processor data; a scalar register file 114; a Multiplexer 118, which is configured to initiate memory access requests, through a read-request channel, in response to requests from Instruction Cache 112 and from AGU 116; an Index Register File (IRF) 122, which is configured to store n-dimensional indexes of tensors of n dimensions and, a Tensor Descriptor Memory (TDM) 124, which is configured to store tensor information (tensor base address, padding value, dimension offsets, strides, sizes, etc.), for a predefined number of tensors that is supported by the processor.
According to an embodiment of the present invention, Scalar Engine 108 comprises an Index Processing ALU 120, which is configured to support dedicated ISA instructions that allow manipulation of IRF 122.
Thus, according to the embodiment described in FIG. 1, Vector Processor 100 is configured to store and manipulate data that pertains to tensors that the vector processor processes.
The configuration of vector processor 100 is an example configuration that is depicted purely for the sake of conceptual clarity. Other suitable configurations may be used in alternative embodiments of the present invention. For example, sequencer 110 may use other signals to govern the operation of scalar engine 108, there could be more than one vector-processor-engine 104, memory 106 may comprise two or more separately accessible memories (e.g., one for instructions and one for data).
In the present context, the various elements of processor 100 that carry out the disclosed techniques, e.g., scalar engine 108, ALU 120, sequencer 110, AGU 116, multiplexer 118 and VPE 104, are referred to collectively as the “processing circuitry” of the processor. Similarly, SRF 114, IRF 122, TDM 124 and cache 112 are referred to collectively as the “internal memory” of the processor. In alternative embodiments, the processing circuitry and/or internal memory may have any other suitable structure.
In some embodiments, scalar engine 108 comprises a general-purpose processor, which is programmed in software to carry out the functions described herein. The software may be downloaded to the processor in electronic form, over a network or from a host, for example, or it may, alternatively or additionally, be provided and/or stored on non-transitory tangible media, such as magnetic, optical, or electronic memory.
According to embodiments of the present invention, IRF 122 comprises n-field entries, where n is the number of dimensions that the vector processor supports (e.g. 5).
Each field holds the index of a tensor element in the relevant dimension that will be used by a load or store operations that the vector processor will execute. For example, the 5-field index {2, 3, 5, 7, 16} may point to an element with coordinates (2, 3, 5, 7, 16) in a 5-dimensional tensor space.
According to an embodiment, when the vector processor processes a tensor, the lowest field of an IRF entry (DIM0) is the fastest-changing dimension, whereas the highest field (e.g. DIM4) is the slowest-changing dimension. For example, for a 2-dimension tensor (2D image), the x axis (DIM0) may be the fastest-changing dimension, and the y axis (DIM1) the slowest-changing dimension. According to embodiments of the present invention, the dimensions can be swapped (as will be described hereinbelow). In this case, a faster-changing dimension can become slower, and a slower-changing dimension can become faster. For example, if a 2-dimensional tensor (2D image) is processed in a horizontal raster, then the x axis is the fast-changing dimension and the y axis is the slow-changing dimension. If the 2D image is processed in a vertical raster, then the y axis is the fast-changing dimension and the x axis is the slow-changing dimension.
FIG. 2 is an illustration 200 that schematically describes an example of a 2D tensor 202 and Index Reference File (IRF) entries that point to locations in the tensor, in accordance to embodiments of the present invention.
Tensor 202 has a first dimension and a second dimension (horizontal and vertical axes, respectively, in the example illustration of FIG. 2). The Tensor has a start-dimension-1 location 204, an end-dimension-1 location 206, a start-dimension-2 location 208, and an end-dimension-2 location 210. According to an embodiment, IRF 122 (FIG. 1) comprises an entry 212, and an entry 214, which point to the start and end locations, respectively, in both dimensions.
The illustration of tensor 202 is an example that is depicted purely for the sake of conceptual clarity. In embodiments of the present invention, tensors typically have more than two dimensions. While the expansion of the example to more dimensions is trivial, illustration of examples with three dimensions is more complex, and illustrations with four or more dimensions may be illegible. FIG. 3 is a block diagram that schematically describes index-processing ALU 120 (FIG. 1), according to some embodiments of the present invention. The ALU reads input sources from IRF 122 or from SRF 114, and stores execution results into the IRF or the SRF. The ALU comprises logic units that execute dedicated ISA instructions:
A Get Index Unit 316, which takes tensor indexes from IRF 122 as the source, selects index based on a DIM field of the instruction, and stores the index into SRF 114 as destination; A Set Index Unit 318, which takes scalar value from SRF 114 as the source and stores the value into IRF 122 as destination, using destination indexes, according to a DIM_MASK field of the instruction (as defined below);
An Index Permutation Unit 320, which takes tensor indexes from IRF 122 as the source, performs permutation between coordinates and indexes to switch between tensor dimensions (as described below), and stores updated indexes into IRF 122 as the destination;
An Arithmetic Unit 322, which further comprises circuits for the execution of Multiply (“X”), SUB/ADD (“+/−”), Max/Min, and OR/AND/XOR operations; and a MUX 324, which selects an output that the index-processing ALU writes into IRF 122.
Set Index Unit 318 is configured to execute SET INDEX (DST, DIM_MASK, SRC) instructions, which initialize an entry in the index register file with element coordinates. The DST field defines which entry in the IRF will be initialized; the DIM_MASK field specifies which indexes (coordinates) are initialized; and, the SRC field defines an initialization value, which can be an immediate value, or a value that is read from a Scalar Register File 114 (FIG. 1).
Index Permutation Unit 320 is configured to execute PRMT_INDX (DST, PRMT, SRC) instructions, which perform permutation between indexes (coordinates) to switch between tensor dimensions (for example, making a slow changing dimension faster). The DST field defines the destination entry in the IRF after dimensions permutation, and the SRC field defines the source entry in the IRF for dimensions permutation. The PRMT field comprises one bit for every dimension of the tensor and specifies how dimensions are permuted. For example, for a 5-dimensional tensor:
IRF[DST][DIM0]=IRF[Src][PRMT[0]]
IRF[DST][DIM1]=IRF[Src][PRMT[1]]
IRF[DST][DIM2]=IRF[Src][PRMT[2]]
IRF[DST][DIM3]=IRF[Src][PRMT[3]]
IRF[DST][DIM4]=IRF[Src][PRMT[4]]
Get Index Unit 316 is configured to execute GET_INDX (DST, DIM_MASK, SRC) instructions, which write an index value in SRF 114. The DST field defines the destination entry in the SRF into which the index will be written; the SRC field defines a source entry in IRF 122 from which the index is read; and the DIM field specifies which dimension indexes to get.
Arithmetic Unit 322 is configured to execute arithmetic (and logic) instructions on tensor indexes. The arithmetic Unit performs operations between two sets of tensor indexes (sources), that are stored in IRF 122; and, to store the results of the operations into IRF 122 (destination). The operations are done simultaneously for any of the pairs of indexes of the two tensor sources, as indicated by a DIM_MASK parameter of the instruction. According to embodiments of the present invention, the DIM_MASK parameter comprises one bit for each dimension (for example, five bits for 5-D tensors). The arithmetic operation will be done on the dimensions for which the corresponding bit in the DIM_MASK field is set. For example, for an ADD operation:
IRF [DST] [DIM0] = IRF [Src0] [DIM0] + IRF [Src1] [DIM0] if
DIM_MASK [0] = 1;
IRF [DST] [DIM1] = IRF [Src0] [DIM1] + IRF [Src1] [DIM1] if
DIM_MASK [1] = 1;
IRF [DST] [DIM2] = IRF [Src0] [DIM2] + IRF [Src1] [DIM2] if
DIM_MASK [2] = 1;
IRF [DST] [DIM3] = IRF [Src0] [DIM3] + IRF [Src1] [DIM3] if
DIM_MASK [3] = 1;
IRF [DST] [DIM4] = IRF [Src0] [DIM4] + IRF [Src1] [DIM4] if
DIM_MASK [4] = 1;
If DIM_MASK[x]=0, IRF[DST][DIMx] is not updated in IRF.
According to the example embodiment of FIG. 3, the mathematical instructions that ALU 300 supports comprise the following:
MUL (DST, DIM_MASK, SRC0, SRC1);
ADD (DST, DIM_MASK, SRC0, SRC1);
SUB (DST, DIM_MASK, SRC0, SRC1);
MAX (DST, DIM_MASK, SRC0, SRC1);
MIN (DST, DIM_MASK, SRC0, SRC1);
OR (DST, DIM_MASK, SRC0, SRC1);
AND (DST, DIM_MASK, SRC0, SRC1); and
XOR (DST, DIM_MASK, SRC0, SRC1).
The DST field defines the destination entry in IRF 122; SRC0/SRC1 fields define the source entries in IRF 122; and the DIM_MASK field specifies what indexes (coordinates) participate in the operation.
The ISA instructions that index processing ALU 120 is configured to execute, are examples that are depicted purely for the sake of conceptual clarity. Other suitable ISA instructions may be added, and some ISA instructions may be removed; for example, an instruction that calculate the intersection tensor of two tensors may be added.
The structure of index-processing ALU 120, as presented in block diagram 300 of FIG. 3, is an example that is depicted purely for the sake of conceptual clarity. Different structures could be used in alternative embodiments; for example, arithmetic-logic units 322 may be replaced by a single unit with a configurable ALU.
Tensor-Access Instructions
According to embodiments of the present invention, vector processor 100 is further configured to execute ISA Tensor Access instructions, which access tensors in external memory 106, using IRF 122 indexes. The tensor addresses of the Tensor Access ISA instructions are calculated by AGU 116. In some embodiments, Tensor Access ISA instructions comprise an Address Generation instruction (GEN_ADDR), a Load Tensor (LD_TNSR) instruction, and a Store Tensor (ST_TNSR) instruction.
The Address Generation instruction generates memory addresses (in external memory 106) for tensor load/store operation. The ISA definition of the Address Generation instruction is:
ADDR=GEN_ADDR (IRF_SRC, Tensor ID). Tensor ID is the tensor number (in TDM 124); IRF_SRC represents the coordinates in the tensor to be accessed in the tensor load/store operations; and, ADDR is the start memory address for the tensor load/store operations.
The Load Tensor instruction loads tensor elements from external memory. The ISA definition of the Load Tensor instruction is: V_DST=LD_TNSR (IRF_SRC, Tensor ID). V_DST is the destination entry in VPE 104, Tensor ID is the tensor number (in TDM 124), and IRF_SRC represents the coordinates of an element of the tensor. The tensor base address, padding value, dimension offsets, strides, and sizes are extracted from the tensor descriptor in TDM 124 that is indicated by the Tensor ID.
The Store Tensor instruction stores a tensor in external memory 106. The ISA definition of the Store Tensor instruction is: ST_TNSR (IRF_SRC, Tensor ID, V SRC). V SRC is source entry in VPE 104, Tensor ID is tensor number for storing, and IRF_SRC represents the coordinates in the tensor where data will be stored. The tensor base address, padding value and dimension offsets, strides, and sizes are extracted from the tensor descriptor in TDM 124 that is indicated by the Tensor ID.
The tensor access instructions described above are example instructions, which are chosen purely for the sake of conceptual clarity. In alternative embodiments, the ISA of vector processor 100 may support any other suitable instructions that access or otherwise manipulate tensors or tensor elements.
Out-of-Bounds Memory Access Support Mechanism
In practice, tensor accessing instructions may result in an out-of-bound tensor access—i.e.—may access memory locations that are beyond the addresses allocated for the tensor. Out-of-bound tensor accesses may occur, for example, when a value of a pixel is calculated as a function of the values of neighbor pixels (e.g. a 5×5 low-pass kernel in a 2-D image). When the first or the last pixels in any dimension are calculated, some of the neighbor pixels extend beyond the tensor dimensions; as a result, the vector processor executes an out-of-bound access.
FIG. 4 is an illustration that schematically describes an example out-of-bounds tensor accesses, in accordance with an embodiment of the present invention. Rectangles in 2-D space 200 represent x-y memory access zones: A zone 202 represents accesses to the tensor (not out of bound). Zones 216 represent tensor accesses wherein dimension 1 index is less than dimension-1 start location 204, or more than dimension-1 end location 206. Zones 218 represent tensor accesses wherein dimension-2 index is less than dimension-2 start location 208, or more than dimension-2 end location 210. Zones 220 represent tensor accesses wherein both dimensions are out-of-bound.
According to embodiments of the present invention, the AGU detects reading from an out-of-bound zone; such reading will return a pre-defined padding value (as will be described hereinbelow).
The example of FIG. 4 describes out-of-bound accesses of a two-dimensional tensor. In other embodiments of the present invention, the tensor may have more than two dimensions, and accesses may be out-of-bound in any dimension, or in a plurality of dimensions.
In the foregoing, we describe how vector processor 102 handles out of bound tensor accesses, according to some embodiments of the present invention. The addresses for the accesses are calculated by AGU 116 (FIG. 1).
Load Tensor
According to an embodiment, when the vector processor executes an LD_TNSR instruction, and one or more tensor dimensions in the IRF_SRC field are out of bounds, the vector processor will get a padding value for the locations which are out-of-bound. The tensor descriptor defines tensor size in each dimension. When AGU 116 executes a LD_TNSR instruction, the AGU compares indexes of valid dimensions from IRF_SRC to dimension sizes to identify what dimensions are out of bound.
For the fastest changing dimension (DIM0), out-of-bound padding can be on any part of the vector up to the granularity of the vector engine processing size. For example, if the vector processing size (vector granularity) is N pixels, for a 2-dimensional tensor (2D image) of size Width×Height, the AGU pads pixels up to a granularity of N pixels for the horizontal dimension X (fastest changing dimension) and up to Width pixels for the vertical dimension Y.
Store tensor According to some embodiments, when the vector processor executes a ST_TNSR instruction, and one or more tensor dimensions in the IRF_SRC field are out of bounds, AGU 116 does not write data to out-of-bounds addresses, and only the valid elements in the tensor are written to memory. During ST_TNSR instruction execution by the Load/Store AGU, indexes of valid dimensions from IRF_SRC are compared to dimension sizes to identify what dimensions are out of bound.
For the fastest-changing dimension (DIM0) out-of-bound addresses can be on part of the vector up to the granularity of the vector engine processing size. For example, if the vector processing size is N pixels, for a 2-dimensional tensor (2D image) of size Width×Height, out-of-bound addresses can be up to a granularity of N pixels for the horizontal dimension X (fastest changing dimension) and up to Width pixels for the vertical dimension Y.
GEN_ADDR
When a GEN_ADDR instruction is issued, and the address generated is out of bounds, the address returned to the address register file is the tensor base address.
The configurations of Vector Processor 100, Index Processing ALU 300 and index manipulating ISA instructions, which are described hereinabove, are example configurations that are shown purely for the sake of conceptual clarity. Any other suitable configurations can be used in alternative embodiments. The different elements of Vector Processor 100 and index processing ALU 300 may be implemented using suitable hardware, such as in one or more Application-Specific Integrated Circuits (ASICs) or Field-Programmable Gate Arrays (FPGAs), using software, using hardware, or using a combination of hardware and software elements.
It will thus be appreciated that the embodiments described above are cited by way of example, and that the present invention is not limited to what has been particularly shown and described hereinabove. Rather, the scope of the present invention includes both combinations and sub-combinations of the various features described hereinabove, as well as variations and modifications thereof which would occur to persons skilled in the art upon reading the foregoing description and which are not disclosed in the prior art. Documents incorporated by reference in the present patent application are to be considered an integral part of the application except that to the extent any terms are defined in these incorporated documents in a manner that conflicts with the definitions made explicitly or implicitly in the present specification, only the definitions in the present specification should be considered.

Claims (22)

The invention claimed is:
1. A processor, comprising:
an internal memory, configured to store at least one definition of a multi-dimensional array stored in an external memory;
processing circuitry implementing a scalar engine, configured to execute instructions in accordance with an Instruction Set Architecture (ISA) defined for the processor, wherein at least some of the instructions in the ISA indicate an address of a single element in the multi-dimensional array, by an ID of the multi-dimensional array and an indication of a plurality of coordinates corresponding respectively to different dimensions of the multi-dimensional array; and
a load-store address generation unit configured to calculate a memory address corresponding to multi-dimensional coordinates of a single element indicated by an instruction executed by the processing circuitry, using a base address and dimension sizes of the multi-dimensional array from the at least one definition of the multi-dimensional array indicated by the ID and the plurality of coordinates indicated by the instruction.
2. The processor according to claim 1, wherein, in accordance with the ISA, the processing circuitry is configured to execute at least one instruction that translates between an input of multi-dimensional coordinates of an element of the array to provide as a result of the at least one instruction, an address in which the element is stored in the external memory.
3. The processor according to claim 1, wherein, in accordance with the ISA, the processing circuitry is configured to execute an instruction that accesses an element of the array based on the multi-dimensional coordinates of the element.
4. The processor according to claim 1, wherein the internal memory is further configured to store vectors of indices that specify elements of the multi-dimensional array in terms of multi-dimensional coordinates of the elements within the array, wherein the at least some of the instructions indicate the plurality of coordinates by referring to one of the vectors of indices in the internal memory and wherein the load-store address generation unit extracts the plurality of coordinates from the one of the vectors of indices in the internal memory indicated by the reference in the instruction.
5. The processor according to claim 4, wherein, in accordance with the ISA, the processing circuitry is configured to execute an instruction that performs a mathematical operation between vectors of indices in the internal memory.
6. The processor according to claim 5, wherein the processing circuitry is configured to execute an instruction that sums corresponding coordinates of two vectors of indices.
7. The processor according to claim 4, wherein, in accordance with the ISA, the processing circuitry is configured to execute an instruction that performs a permutation among the coordinates of one of the vectors of indices.
8. The processor according to claim 1, wherein the load-store address generation unit is configured to compare the plurality of coordinates indicated by the instruction to dimension sizes of the multi-dimensional array, to identify when multi-dimensional coordinates in an executed tensor-access instruction exceed a bound of the multi-dimensional array.
9. The processor according to claim 8, wherein, in response to identifying that the multi-dimensional coordinates in the tensor-access instruction exceed the bound, the load-store address generation unit is configured to return a padding value as a result of the tensor-access instruction.
10. The processor according to claim 1, wherein the load-store address generation unit comprises dedicated hardware.
11. The processor according to claim 1, wherein the load-store address generation unit is configured to extract the base address from the internal memory using the ID indicated by the instruction.
12. The processor according to claim 1, wherein the load-store address generation unit is configured to extract strides of the multi-dimensional array from the internal memory using the ID indicated by the instruction, and to calculate the memory address using the extracted strides.
13. The processor according to claim 1, wherein the load-store address generation unit is configured to extract dimension offsets of the multi-dimensional array from the internal memory using the ID indicated by the instruction, and to calculate the memory address using the extracted dimension offsets.
14. The processor according to claim 1, wherein the multi-dimensional array has more than two dimensions and wherein the plurality of coordinates include a coordinate for each of the dimensions.
15. A method, comprising:
storing in an internal memory of a processor at least one definition of a multi-dimensional array stored in an external memory; and
using processing circuitry of the processor, executing instructions in accordance with an Instruction Set Architecture (ISA) defined for the processor, wherein at least some of the instructions in the ISA indicate an address of a single element in the multi-dimensional array by an ID of the multi-dimensional array and an indication of a plurality of coordinates corresponding respectively to different dimensions of the multi-dimensional array; and
calculating a memory address corresponding to multi-dimensional coordinates of a single element indicated by an instruction executed by the processing circuitry, using a base address and dimension sizes of the multi-dimensional array from the at least one definition of the multi-dimensional array indicated by the ID and the plurality of coordinates indicated by the instruction, by a load-store address generation unit of the processor.
16. The method according to claim 15, wherein, in accordance with the ISA, executing the instructions comprises executing at least one instruction that translates between an input of multi-dimensional coordinates of an element of the array and provides as a result of the at least one instruction, an address in which the element is stored in the external memory.
17. The method according to claim 15, wherein, in accordance with the ISA, executing the instructions comprises executing an instruction that accesses an element of the array, based on the multi-dimensional coordinates of the element.
18. The method according to claim 15, wherein, in accordance with the ISA, executing the instructions comprises executing an instruction that performs a mathematical operation between sets of multi-dimensional coordinates.
19. The method according to claim 18, wherein, in accordance with the ISA, executing the instructions comprises executing an instruction that sums corresponding coordinates of two sets.
20. The method according to claim 15, wherein, in accordance with the ISA, executing the instructions comprises executing an instruction that performs a permutation among the multi-coordinates of an element of the array.
21. The method according to claim 15, wherein executing the instructions comprises comparing the plurality of coordinates indicated by the instruction to dimension sizes of the multi-dimensional array, to identify when multi-dimensional coordinates in an executed tensor-access instruction exceed a bound of the multi-dimensional array.
22. The method according to claim 21, wherein executing the instructions comprises, in response to identifying that the multi-dimensional coordinates in an executed tensor-access instruction exceed a bound of the multi-dimensional array, returning a padding value by the load-store address generation unit.
US16/170,069 2017-11-08 2018-10-25 Tensor-based memory access Active 2039-02-13 US11321092B1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/170,069 US11321092B1 (en) 2017-11-08 2018-10-25 Tensor-based memory access

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201762582990P 2017-11-08 2017-11-08
US16/170,069 US11321092B1 (en) 2017-11-08 2018-10-25 Tensor-based memory access

Publications (1)

Publication Number Publication Date
US11321092B1 true US11321092B1 (en) 2022-05-03

Family

ID=81385202

Family Applications (1)

Application Number Title Priority Date Filing Date
US16/170,069 Active 2039-02-13 US11321092B1 (en) 2017-11-08 2018-10-25 Tensor-based memory access

Country Status (1)

Country Link
US (1) US11321092B1 (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190370631A1 (en) * 2019-08-14 2019-12-05 Intel Corporation Methods and apparatus to tile walk a tensor for convolution operations
US20210406164A1 (en) * 2021-06-25 2021-12-30 Intel Corporation Methods and apparatus for sparse tensor storage for neural network accelerators
US20240069914A1 (en) * 2022-08-23 2024-02-29 Intel Corporation Hardware enhancements for matrix load/store instructions
WO2024065860A1 (en) * 2022-10-01 2024-04-04 Intel Corporation Hardware support for n-dimensional matrix load and store instructions
US20240161222A1 (en) * 2022-11-16 2024-05-16 Nvidia Corporation Application programming interface to indicate image-to-column transformation
US20240176663A1 (en) * 2022-11-28 2024-05-30 Nvidia Corporation Tensor map cache storage
GB2630328A (en) * 2023-05-24 2024-11-27 Advanced Risc Mach Ltd Methods and systems for data transfer
US12164161B1 (en) 2022-03-18 2024-12-10 Celestial Ai Inc. Stacked-dies optically bridged multicomponent package
US12191257B2 (en) 2022-07-26 2025-01-07 Celestial Ai Inc. Electrical bridge package with integrated off-bridge photonic channel interface
US12217056B2 (en) * 2023-01-27 2025-02-04 Celestial Ai Inc. Load/store unit for a tensor engine and methods for loading or storing a tensor
US12259575B2 (en) 2021-06-18 2025-03-25 Celestial Ai Inc. Clock signal distribution using photonic fabric
US12271595B2 (en) 2022-03-18 2025-04-08 Celestial Ai Inc. Photonic memory fabric for system memory interconnection
US12283584B2 (en) 2022-07-26 2025-04-22 Celestial Ai Inc. Electrical bridge package with integrated off-bridge photonic channel interface
WO2025139521A1 (en) * 2023-12-29 2025-07-03 摩尔线程智能科技(北京)股份有限公司 Data processing method and apparatus, electronic device, and storage medium
US12353988B2 (en) 2020-07-09 2025-07-08 Celestial Ai Inc. Neuromorphic photonics with coherent linear neurons
RU2843497C1 (en) * 2025-03-31 2025-07-14 Акционерное Общество "Софит" Tensor processor
US12436346B2 (en) 2022-03-18 2025-10-07 Celestial Ai Inc. Optically bridged multicomponent package with extended temperature range

Citations (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4760518A (en) * 1986-02-28 1988-07-26 Scientific Computer Systems Corporation Bi-directional databus system for supporting superposition of vector and scalar operations in a computer
US5099447A (en) 1990-01-22 1992-03-24 Alliant Computer Systems Corporation Blocked matrix multiplication for computers with hierarchical memory
EP0517241A2 (en) 1991-06-06 1992-12-09 Lsi Logic Corporation Interleaved multiplier accumulator
US5226171A (en) 1984-12-03 1993-07-06 Cray Research, Inc. Parallel vector processing system for individual and broadcast distribution of operands and control information
US5471627A (en) 1989-10-10 1995-11-28 Hnc, Inc. Systolic array image processing system and method
US20030225805A1 (en) 2002-05-14 2003-12-04 Nash James G. Digital systolic array architecture and method for computing the discrete fourier transform
US6675187B1 (en) 1999-06-10 2004-01-06 Agere Systems Inc. Pipelined linear array of processor elements for performing matrix computations
US20060095258A1 (en) 2004-08-21 2006-05-04 Postech Foundation Apparatus for separating blind source signals having systolic array structure
US20070143574A1 (en) 2005-12-19 2007-06-21 Bonebakker Jan L Method and apparatus for supporting vector operations on a multi-threaded microprocessor
US20080052693A1 (en) * 2006-08-08 2008-02-28 International Business Machines Corporation Method of simd-ization through data reshaping, padding, and alignment
US7543119B2 (en) 2005-02-10 2009-06-02 Richard Edward Hessel Vector processor
US20110307459A1 (en) * 2010-06-10 2011-12-15 Jacob Yaakov Jeffrey Allan Alon System, data structure, and method for simultaneously retrieving multi-dimensional data with zero contention
US20140181171A1 (en) 2012-12-24 2014-06-26 Pavel Dourbal Method and system for fast tensor-vector multiplication
US20150269122A1 (en) 2014-03-21 2015-09-24 Yahoo! Inc. Computation through array decomposition
US9262165B2 (en) 2012-02-23 2016-02-16 Socionext Inc. Vector processor and vector processor processing method
US20170004089A1 (en) * 2015-06-30 2017-01-05 Nvidia Corporation Patch memory system
US9647667B1 (en) 2014-04-30 2017-05-09 Altera Corporation Hybrid architecture for signal processing and signal processing accelerator
US20170147531A1 (en) 2015-11-24 2017-05-25 International Business Machines Corporation Sparse matrix multiplication using a single field programmable gate array module
US20170255572A1 (en) * 2016-03-07 2017-09-07 Ceva D.S.P. Ltd. System and method for preventing cache contention
US20170344514A1 (en) 2016-05-31 2017-11-30 Palo Alto Research Center Incorporated System and method for speeding up general matrix-matrix multiplication on the gpu
US20180074996A1 (en) 2016-09-15 2018-03-15 Altera Corporation Dot product based processing elements
US20180074962A1 (en) * 2016-09-09 2018-03-15 International Business Machines Corporation Index based memory access
US9946539B1 (en) * 2017-05-23 2018-04-17 Google Llc Accessing data in multi-dimensional tensors using adders
US20180336163A1 (en) 2017-05-17 2018-11-22 Google Llc Low latency matrix multiply unit
US20180365561A1 (en) * 2017-06-19 2018-12-20 Google Inc. Alternative loop limits
US20190012295A1 (en) 2017-07-07 2019-01-10 Intel Corporation Memory-Size- and Bandwidth-Efficient Method for Feeding Systolic Array Matrix Multipliers
US10387122B1 (en) 2018-05-04 2019-08-20 Olsen Ip Reserve, Llc Residue number matrix multiplier
US20190303743A1 (en) 2016-08-13 2019-10-03 Intel Corporation Apparatuses, methods, and systems for neural networks

Patent Citations (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5226171A (en) 1984-12-03 1993-07-06 Cray Research, Inc. Parallel vector processing system for individual and broadcast distribution of operands and control information
US4760518A (en) * 1986-02-28 1988-07-26 Scientific Computer Systems Corporation Bi-directional databus system for supporting superposition of vector and scalar operations in a computer
US5471627A (en) 1989-10-10 1995-11-28 Hnc, Inc. Systolic array image processing system and method
US5099447A (en) 1990-01-22 1992-03-24 Alliant Computer Systems Corporation Blocked matrix multiplication for computers with hierarchical memory
EP0517241A2 (en) 1991-06-06 1992-12-09 Lsi Logic Corporation Interleaved multiplier accumulator
US6675187B1 (en) 1999-06-10 2004-01-06 Agere Systems Inc. Pipelined linear array of processor elements for performing matrix computations
US20030225805A1 (en) 2002-05-14 2003-12-04 Nash James G. Digital systolic array architecture and method for computing the discrete fourier transform
US20060095258A1 (en) 2004-08-21 2006-05-04 Postech Foundation Apparatus for separating blind source signals having systolic array structure
US7543119B2 (en) 2005-02-10 2009-06-02 Richard Edward Hessel Vector processor
US20070143574A1 (en) 2005-12-19 2007-06-21 Bonebakker Jan L Method and apparatus for supporting vector operations on a multi-threaded microprocessor
US20080052693A1 (en) * 2006-08-08 2008-02-28 International Business Machines Corporation Method of simd-ization through data reshaping, padding, and alignment
US20110307459A1 (en) * 2010-06-10 2011-12-15 Jacob Yaakov Jeffrey Allan Alon System, data structure, and method for simultaneously retrieving multi-dimensional data with zero contention
US9262165B2 (en) 2012-02-23 2016-02-16 Socionext Inc. Vector processor and vector processor processing method
US20140181171A1 (en) 2012-12-24 2014-06-26 Pavel Dourbal Method and system for fast tensor-vector multiplication
US20150269122A1 (en) 2014-03-21 2015-09-24 Yahoo! Inc. Computation through array decomposition
US9647667B1 (en) 2014-04-30 2017-05-09 Altera Corporation Hybrid architecture for signal processing and signal processing accelerator
US20170004089A1 (en) * 2015-06-30 2017-01-05 Nvidia Corporation Patch memory system
US20170147531A1 (en) 2015-11-24 2017-05-25 International Business Machines Corporation Sparse matrix multiplication using a single field programmable gate array module
US10685082B2 (en) 2015-11-24 2020-06-16 International Business Machines Corporation Sparse matrix multiplication using a single field programmable gate array module
US20170255572A1 (en) * 2016-03-07 2017-09-07 Ceva D.S.P. Ltd. System and method for preventing cache contention
US20170344514A1 (en) 2016-05-31 2017-11-30 Palo Alto Research Center Incorporated System and method for speeding up general matrix-matrix multiplication on the gpu
US20190303743A1 (en) 2016-08-13 2019-10-03 Intel Corporation Apparatuses, methods, and systems for neural networks
US20180074962A1 (en) * 2016-09-09 2018-03-15 International Business Machines Corporation Index based memory access
US20180074996A1 (en) 2016-09-15 2018-03-15 Altera Corporation Dot product based processing elements
US20180336163A1 (en) 2017-05-17 2018-11-22 Google Llc Low latency matrix multiply unit
US9946539B1 (en) * 2017-05-23 2018-04-17 Google Llc Accessing data in multi-dimensional tensors using adders
US20180365561A1 (en) * 2017-06-19 2018-12-20 Google Inc. Alternative loop limits
US20190012295A1 (en) 2017-07-07 2019-01-10 Intel Corporation Memory-Size- and Bandwidth-Efficient Method for Feeding Systolic Array Matrix Multipliers
US10387122B1 (en) 2018-05-04 2019-08-20 Olsen Ip Reserve, Llc Residue number matrix multiplier

Non-Patent Citations (18)

* Cited by examiner, † Cited by third party
Title
Abuzaid et al., "Caffe con Troll: Shallow Ideas to Speed Up Deep Learning", DanaC'15 Proceedings of the Fourth Workshop on Data analytics in the Cloud, Article No. 2, 6 pages, May 31-Jun. 4, 2015.
Chetlur et al., "cuDNN: Efficient Primitives for Deep Learning", NVIDIA, Santa Clara, CA, arXiv:1410.0759v3 [cs.NE], pp. 1-9, Dec. 18, 2014.
Halutz et al., U.S. Appl. No. 16/186,620, filed Nov. 12, 2018.
Kang et al., "A Systematic Approach to Blocking Convolutional Neural Networks",Stanford University, arXiv:1606.04209v1 [cs.DC], pp. 1-12, Jun. 14, 2016.
Keller, "Computational Foundation of Cognitive Science—Lecture 15: Convolutions and Kernels", School of Informatics, University of Edinburgh, pp. 1-21, Feb. 23, 2010 downloaded from http://www.inf.ed.ac.uk/teaching/courses/cfcs1/lectures/cfcs_l5.pdf.
Lim et al., "Multidimensional systolic arrays for the implementation of discrete Fourier transforms", IEEE Transactions on Signal Processing, vol. 47, issue 5, pp. 1359-1370, May 1999.
Mellott et al., "The Gauss machine: A Galois-enhanced quadratic residue number system systolic array", Proceedings of IEEE 11th Symposium on Computer Arithmetic, pp. 156-162, Jun. 1993.
Scheiman et al., "A processor-time-minimal schedule for the standard tensor product algorithm", IEEE computer Society, pp. 176-187, year 1994.
Shalev et al., U.S. Appl. No. 15/700,207, filed Sep. 11, 2017.
Shalev et al., U.S. Appl. No. 15/700,213, filed Sep. 11, 2017.
Shalev et al., U.S. Appl. No. 16/136,294, filed Sep. 20, 2018.
Shalev et al., U.S. Appl. No. 16/150,299, filed Oct. 3, 2018.
Suda et al., "Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks", FPGA '16 Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 16-25, Feb. 21-23, 2016.
U.S. Appl. No. 15/700,213 Office Action dated Jun. 25, 2020.
U.S. Appl. No. 16/136,294 office action dated Mar. 3, 2020.
U.S. Appl. No. 16/186,620 Office Action dated Jul. 22, 2020.
Wikipedia, "Matrix Multiplication", pp. 1-14, Jun. 2, 2020 downloaded from https://en.wikipedia.org/w/index.php?title=Matrix_multiplication&oldid=96041746.
Wikipedia, "Outer Product", pp. 1-6, May 28, 2020 downloaded from https://en.wikipedia.org/w/index.php?title=Outer_product&oldid=959358295.

Cited By (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190370631A1 (en) * 2019-08-14 2019-12-05 Intel Corporation Methods and apparatus to tile walk a tensor for convolution operations
US11494608B2 (en) * 2019-08-14 2022-11-08 Intel Corporation Methods and apparatus to tile walk a tensor for convolution operations
US12353988B2 (en) 2020-07-09 2025-07-08 Celestial Ai Inc. Neuromorphic photonics with coherent linear neurons
US12353006B2 (en) 2021-06-18 2025-07-08 Celestial Ai Inc. Electro-photonic network for machine learning
US12339490B2 (en) 2021-06-18 2025-06-24 Celestial Ai Inc. Clock signal distribution using photonic fabric
US12259575B2 (en) 2021-06-18 2025-03-25 Celestial Ai Inc. Clock signal distribution using photonic fabric
US20210406164A1 (en) * 2021-06-25 2021-12-30 Intel Corporation Methods and apparatus for sparse tensor storage for neural network accelerators
US11940907B2 (en) * 2021-06-25 2024-03-26 Intel Corporation Methods and apparatus for sparse tensor storage for neural network accelerators
US12242122B2 (en) 2022-03-18 2025-03-04 Celestial Ai Inc. Multicomponent photonically intra-die bridged assembly
US12399333B2 (en) 2022-03-18 2025-08-26 Celestial AI, Inc. Optical multi-die interconnect bridge with electrical and optical interfaces
US12436346B2 (en) 2022-03-18 2025-10-07 Celestial Ai Inc. Optically bridged multicomponent package with extended temperature range
US12164161B1 (en) 2022-03-18 2024-12-10 Celestial Ai Inc. Stacked-dies optically bridged multicomponent package
US12164162B2 (en) 2022-03-18 2024-12-10 Celestial Ai Inc. Multicomponent photonically bridged assembly
US12298608B1 (en) 2022-03-18 2025-05-13 Celestial Ai Inc. Optically bridged multicomponent package with extended temperature range
US12216318B2 (en) 2022-03-18 2025-02-04 Celestial Ai Inc. Optical bridging element for separately stacked electrical ICs
US12271595B2 (en) 2022-03-18 2025-04-08 Celestial Ai Inc. Photonic memory fabric for system memory interconnection
US12283584B2 (en) 2022-07-26 2025-04-22 Celestial Ai Inc. Electrical bridge package with integrated off-bridge photonic channel interface
US12191257B2 (en) 2022-07-26 2025-01-07 Celestial Ai Inc. Electrical bridge package with integrated off-bridge photonic channel interface
US20240069914A1 (en) * 2022-08-23 2024-02-29 Intel Corporation Hardware enhancements for matrix load/store instructions
WO2024065860A1 (en) * 2022-10-01 2024-04-04 Intel Corporation Hardware support for n-dimensional matrix load and store instructions
US20240168765A1 (en) * 2022-11-16 2024-05-23 Nvidia Corporation Storage of tensor in a cache
US20240161222A1 (en) * 2022-11-16 2024-05-16 Nvidia Corporation Application programming interface to indicate image-to-column transformation
US20240176663A1 (en) * 2022-11-28 2024-05-30 Nvidia Corporation Tensor map cache storage
US12217056B2 (en) * 2023-01-27 2025-02-04 Celestial Ai Inc. Load/store unit for a tensor engine and methods for loading or storing a tensor
GB2630328A (en) * 2023-05-24 2024-11-27 Advanced Risc Mach Ltd Methods and systems for data transfer
GB2630328B (en) * 2023-05-24 2025-09-03 Advanced Risc Mach Ltd Methods and systems for data transfer
US20240394061A1 (en) * 2023-05-24 2024-11-28 Arm Limited Methods and systems for data transfer
WO2025139521A1 (en) * 2023-12-29 2025-07-03 摩尔线程智能科技(北京)股份有限公司 Data processing method and apparatus, electronic device, and storage medium
RU2843497C1 (en) * 2025-03-31 2025-07-14 Акционерное Общество "Софит" Tensor processor
US12442997B2 (en) 2025-04-04 2025-10-14 Celestial AI, Inc. Optically bridged multicomponent package with extended temperature range
US12442999B2 (en) 2025-04-04 2025-10-14 Celestial Ai Inc. Optically bridged multicomponent package with extended temperature range
US12443000B2 (en) 2025-04-04 2025-10-14 Celestial Ai Inc. Optically bridged multicomponent package with extended temperature range
US12442998B2 (en) 2025-04-04 2025-10-14 Celestial AI, Inc. Optically bridged multicomponent package with extended temperature range

Similar Documents

Publication Publication Date Title
US11321092B1 (en) Tensor-based memory access
US12406526B2 (en) Indirectly accessing sample data to perform multi-convolution operations in a parallel processing system
US11775313B2 (en) Hardware accelerator for convolutional neural networks and method of operation thereof
US20190179635A1 (en) Method and apparatus for tensor and convolution operations
US20100115233A1 (en) Dynamically-selectable vector register partitioning
JP2020527778A (en) Register-based matrix multiplication
US20110078415A1 (en) Efficient Predicated Execution For Parallel Processors
CN100489829C (en) System and method for indexed load and store operations in a dual-mode computer processor
JP2009116854A (en) System, method, and computer program product for performing scan operation
US9798543B2 (en) Fast mapping table register file allocation algorithm for SIMT processors
US9207919B2 (en) System, method, and computer program product for bulk synchronous binary program translation and optimization
CN106537330A (en) Parallelization of scalar operations by vector processors using data-indexed accumulators in vector register files, and related circuits, methods, and computer-readable media
US11636569B1 (en) Matrix transpose hardware acceleration
Kasagi et al. Parallel algorithms for the summed area table on the asynchronous hierarchical memory machine, with GPU implementations
WO2021118857A1 (en) Hardware accelerator having reconfigurable instruction set
CN115552371A (en) Variable position shifting for matrix processing
US20170255572A1 (en) System and method for preventing cache contention
US9570125B1 (en) Apparatuses and methods for shifting data during a masked write to a buffer
WO2015094721A2 (en) Apparatuses and methods for writing masked data to a buffer
JP2023542835A (en) Vertical and horizontal broadcast of shared operands
JP2018173956A (en) Semiconductor device
US20230195651A1 (en) Host device performing near data processing function and accelerator system including the same
US20230161626A1 (en) Point cloud adjacency-map and hash-map accelerator
CN114692844B (en) Data processing device, data processing method and related products
US8417735B1 (en) Instruction-efficient algorithm for parallel scan using initialized memory regions to replace conditional statements

Legal Events

Date Code Title Description
FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO SMALL (ORIGINAL EVENT CODE: SMAL); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

STCF Information on status: patent grant

Free format text: PATENTED CASE

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 4