WO2023015560A1 - Systems and methods for sparsity-aware vector processing in general purpose cpus - Google Patents

Systems and methods for sparsity-aware vector processing in general purpose cpus Download PDF

Info

Publication number
WO2023015560A1
WO2023015560A1 PCT/CN2021/112508 CN2021112508W WO2023015560A1 WO 2023015560 A1 WO2023015560 A1 WO 2023015560A1 CN 2021112508 W CN2021112508 W CN 2021112508W WO 2023015560 A1 WO2023015560 A1 WO 2023015560A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
stream
lane
instructions
instruction
Prior art date
Application number
PCT/CN2021/112508
Other languages
French (fr)
Inventor
Mostafa MAHMOUD
Reza AZIMI
Dawei Li
Wenbo SUN
Original Assignee
Huawei Technologies Co.,Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co.,Ltd. filed Critical Huawei Technologies Co.,Ltd.
Priority to PCT/CN2021/112508 priority Critical patent/WO2023015560A1/en
Publication of WO2023015560A1 publication Critical patent/WO2023015560A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3853Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution of compound instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • G06F9/30038Instructions to perform operations on packed data, e.g. vector, tile or matrix operations using a mask
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30076Arrangements for executing specific machine instructions to perform miscellaneous control operations, e.g. NOP
    • G06F9/3009Thread control instructions

Definitions

  • the present invention pertains to the field of computing, and in particular to systems and methods for sparsity-aware vector processing in general purpose CPUs.
  • HPC high-performance computing
  • AI artificial intelligence
  • Sparsity in vector operations presents challenges during processing such as unnecessary power consumption and wasting execution time.
  • Existing techniques for dealing with sparsity in data have limitations and deficiencies that render their implementations infeasible or unjustified. For example, existing techniques employ complex hardware requirements that may limit the operating frequency of a vector unit, leading to increased power consumption and chip area. Further, existing techniques rely on hardware dependency checking modules for checking and resolving dependencies among vector instructions, which adds a further layer of complexity to the hardware requirements.
  • An aspect of the disclosure provides for a method.
  • the method includes receiving a stream of vector instructions for processing, the stream of vector instructions including a plurality of vector instructions.
  • the method further includes determining an ineffectual computation corresponding to a first lane of a first vector instruction of the stream of vector instructions.
  • the method further includes determining an effectual computation corresponding to a second lane of a second vector instruction of the stream of vector instructions.
  • the method wherein the second vector instruction is subsequent to the first vector instruction according to a processing order of the stream of vector instructions, and the second lane of the second vector instruction and the first lane of the first vector instruction correspond to a same lane.
  • the method further includes coalescing the second lane of the second vector instruction with the first lane of the first vector instruction.
  • the method further includes processing the stream of vector instructions.
  • the method may provide for reduced hardware complexity and reduced cost due to same-lane coalescing.
  • the method may leverage instruction set architecture (ISA) support for simpler implementations.
  • the determining an effectual computation is based on a combination window size indicating a number of vector instructions of the stream of vector instructions subsequent to the first vector instruction.
  • the combination window size may indicate an upper limit or a peak for performance gain that method may achieve.
  • the stream of vector instructions accumulates to one output register.
  • the method may further provide for simplified hardware support due to coalescing that is based on a stream of instructions that accumulate to the same output register, which may obviate the need for hardware dependency checking modules that is needed for cross-lane coalescing.
  • the coalescing includes replacing the ineffectual computation with the effectual computation. In some embodiments, the coalescing further includes the processing the stream of vector instructions includes: processing the first vector instruction including the effectual computation.
  • the method may provide for condensing a stream of instruction in a reduced form.
  • the ineffectual computation includes a vector operand having a zero value.
  • the method may provide for extracting sparsity in finer granularity in addition to algorithmic level.
  • the stream of vector instructions is indicated by a start-stream indicator and an end-stream indicator.
  • the method may provide for enhanced ISA extensions to mark the beginning and end of a stream of target instructions.
  • the method further includes determining a second effectual computation corresponding to a third lane of a third vector instruction of the stream of vector instructions.
  • the third vector instruction is subsequent to the second vector instruction according to the processing order of the stream of vector instructions.
  • the third lane of the third vector instruction and the second lane of the second vector instruction correspond to a same lane.
  • the method further includes coalescing the third lane of the third vector instruction with the second lane of the second vector instruction. The method may provide for reducing a stream of instruction in a condensed form.
  • the processing the stream of vector instructions includes processing the second vector instruction including the second effectual computation.
  • the method further includes receiving a second stream of vector instructions for processing.
  • the second stream of vector instructions accumulates to a second output register.
  • the second stream of vector instructions is indicated by a second start-stream indicator and a second end-stream indicator.
  • the method further includes processing the second stream of vector instructions.
  • the processing the second stream of vector instructions is performed after processing the stream of vector instructions.
  • the apparatus includes one or more mask generation units.
  • the apparatus further includes one or more lane processing units.
  • the apparatus further includes one or more lane coalescing units.
  • the apparatus further includes at least one processor.
  • the apparatus further includes at least one machine readable medium storing executable instructions which when executed by the at least one processor configure the apparatus to perform the methods described herein.
  • the apparatus is configured for receiving a stream of vector instructions for processing, the stream of vector instructions including a plurality of vector instructions.
  • the apparatus is further configured for determining, via the one or more mask generation units, an ineffectual computation corresponding to a first lane of a first vector instruction of the stream of vector instructions.
  • the apparatus is further configured for determining, via the one or more mask generation units, an effectual computation corresponding to a second lane of a second vector instruction of the stream of vector instructions.
  • the second vector instruction is subsequent to the first vector instruction according to a processing order of the stream of vector instructions, and the second lane of the second vector instruction and the first lane of the first vector instruction correspond to a same lane.
  • the apparatus is further configured for coalescing, via the one or more lane coalescing units, the second lane of the second vector instruction with the first lane of the first vector instruction.
  • the apparatus is further configured for processing, via one or more lane processing units, the stream of vector instructions, wherein each lane processing unit corresponds to a corresponding lane in the stream of vector instruction.
  • the apparatus may provide for reduced hardware complexity and reduced cost due to same-lane coalescing.
  • the apparatus may leverage instruction set architecture (ISA) support for simpler implementations.
  • ISA instruction set architecture
  • the configuration for determining an effectual computation is based on a combination window size indicating a number of vector instructions of the stream of vector instructions subsequent to the first vector instruction.
  • the combination window size may indicate an upper limit or a peak for performance gain that the apparatus may achieve.
  • the stream of vector instructions accumulates to one output register.
  • the apparatus may further provide for simplified hardware support due to coalescing that is based on a stream of instructions that accumulate to the same output register, which may obviate the need for hardware dependency checking modules that is needed for cross-lane coalescing.
  • the coalescing includes replacing the ineffectual computation with the effectual computation. In some embodiments, the coalescing further includes the processing the stream of vector instructions includes: processing the first vector instruction including the effectual computation.
  • the apparatus may provide for condensing a stream of instruction in a reduced form.
  • the ineffectual computation includes a vector operand having a zero value.
  • the apparatus may provide for extracting sparsity in finer granularity in addition to algorithmic level.
  • the stream of vector instructions is indicated by a start-stream indicator and an end-stream indicator.
  • the apparatus may provide for enhanced ISA extensions to mark the beginning and end of a stream of target instructions.
  • the executable instructions which when executed by the at least one processor further configure the apparatus for determining, via the one or more mask generation units, a second effectual computation corresponding to a third lane of a third vector instruction of the stream of vector instructions.
  • the third vector instruction is subsequent to the second vector instruction according to the processing order of the stream of vector instructions.
  • the third lane of the third vector instruction and the second lane of the second vector instruction correspond to a same lane.
  • the apparatus is further configured for coalescing, via the one or more lane coalescing units, the third lane of the third vector instruction with the second lane of the second vector instruction.
  • the apparatus may provide for a sparsity-aware vector processing unit (sVPU) for general purpose CPUs that may address the challenge posed by high sparsity rations.
  • the apparatus may provide for reducing a stream of instruction in a condensed form.
  • the processing the stream of vector instructions includes processing the second vector instruction including the second effectual computation.
  • the executable instructions which when executed by the at least one processor further configure the apparatus for receiving a second stream of vector instructions for processing.
  • the second stream of vector instructions accumulates to a second output register.
  • the second stream of vector instructions is indicated by a second start-stream indicator and a second end-stream indicator.
  • the apparatus is further configured for processing, via the one or more lane processing units, the second stream of vector instructions.
  • the processing, via the one or more lane processing units, the second stream of vector instructions is performed after processing the stream of vector instructions.
  • an electronic device can be configured with machine readable memory containing instructions, which when executed by the processors of these devices, configures the device to perform the methods disclosed herein.
  • Embodiments have been described above in conjunction with aspects of the present invention upon which they can be implemented. Those skilled in the art will appreciate that embodiments may be implemented in conjunction with the aspect with which they are described but may also be implemented with other embodiments of that aspect. When embodiments are mutually exclusive, or are incompatible with each other, it will be apparent to those skilled in the art. Some embodiments may be described in relation to one aspect, but may also be applicable to other aspects, as will be apparent to those of skill in the art.
  • FIG. 1 illustrates vector lane coalescing, according to an embodiment of the present disclosure.
  • FIG. 2 illustrates an example of vector dot (vdot) instruction, according to an embodiment of the present disclosure.
  • FIG. 3A and 3B illustrate a matrix-vector multiplication, according to an embodiment of the present disclosure.
  • FIG. 4A and 4B illustrate a matrix-vector multiplication applying block compressed sparse row (BCSR) optimization, according to an embodiment of the present disclosure.
  • BCSR block compressed sparse row
  • FIG. 5A and 5B illustrate a matrix-vector multiplication applying lane coalescing, according to an embodiment of the present disclosure
  • FIG. 6A illustrates sVPU performance gain over BCSR based on SuiteSparse benchmark, according to an embodiment of the present disclosure.
  • FIG. 6B illustrate sVPU performance gain as a function of combination window size based on SuiteSparse benchmark, according to an embodiment of the present disclosure.
  • FIG. 6C illustrates sVPU performance gain as a function of combination window size based on high performance conjugate gradient (HPCG) workload, according to an embodiment of the present disclosure
  • FIG. 6D illustrates the average coalescing distance based on HPCG workload, according to an embodiment of the present disclosure.
  • FIG. 7 and 8 illustrate use of stream guards for indicating different streams of instructions, according to an embodiment of the present disclosure.
  • FIG. 9 illustrates a block diagram of an sVPU u-architecture, according to an embodiment of the present disclosure.
  • FIG. 10 illustrates a mask generation unit according to an embodiment of the present disclosure.
  • FIG. 11 illustrates a coalescing method, according to an embodiment of the present disclosure.
  • FIG. 12 illustrates a schematic diagram of an electronic device that may perform any or all of operations of the methods and features explicitly or implicitly described herein, according to different embodiments of the present disclosure.
  • Existing works have a number of deficiencies, for which, embodiments described herein may provide solutions.
  • existing works are limited due to their high hardware complexity requirements.
  • High hardware complexity may limit the frequency the vector unit can run at, leading to unjustified high-power consumption, and extra chip area.
  • the high hardware complexity in existing works may be due to cross-lane coalescing where zero values in a vector lane can be replaced with non-zero values from other lanes of subsequent vector instructions.
  • a lane may refer to a position in a vector register.
  • lane When a vector is being processed, lane may refer to a position in the sub-vector that is being processed at a time.
  • N elements at a time there may be 0...to N-1 lanes.
  • a vector processing unit may comprise N processing lane for processing N elements at a time such that each processing lane may correspond to an element or position in the sub-vector that is being processed at a time. Embodiments described herein further describe lane definition.
  • Embodiments described herein may provide for reduced hardware complexity and reduced cost due to same-lane coalescing.
  • Same-lane coalescing may refer to, for example, replacing an ineffectual value with a value from the same lane of a subsequent instruction.
  • Same-lane coalescing may provide for a reduced hardware cost in the u-Arch.
  • Hardware dependency checking modules may be needed for checking and resolving dependencies between instructions that are candidates for coalescing. Since candidates are expected to be writing to arbitrary destination vector registers and coalescing operations are based on cross-lane coalescing, existing methods require such dependency checking modules.
  • Embodiments described herein may provide for simplified hardware support due to coalescing that is based on a stream of instructions that accumulate to the same output register. As may be appreciated by a person skilled in the art, accumulating to the same output registers may obviate the need for hardware dependency checking modules that is needed for cross-lane coalescing.
  • Embodiments described herein may leverage ISA support to enable same-lane stream-based approach of coalescing candidate instructions.
  • Embodiments may provide for enhanced ISA extensions to mark the beginning and end of a stream of target instructions (or using writes to control registers) to mark them as "eligible for coalescing without further dependency checks" as they all will be accumulating to the same output.
  • Embodiments described herein may be applied to a wider set of instructions, thereby broadening the target instructions to vector Dot, add...etc.
  • Embodiments described herein may extend to one or more of operations and primitives that perform stream of computations and reduction on sparse operands.
  • HPC high-performance computing
  • AI artificial intelligence
  • HPCG high performance conjugate gradient
  • DNNs deep neural networks
  • DNNs deep neural networks
  • embodiments may provide for a sparsity-aware vector processing unit (sVPU) for general purpose CPUs that may address the challenge posed by high sparsity rations.
  • sVPU sparsity-aware vector processing unit
  • an sVPU may skip processing ineffectual computational operations involving zero operands.
  • an sVPU may fill the one or more lanes with the zero operands with effectual values from subsequent instructions. Effectively, an sVPU may coalesce multiple vector instructions with sparse vector operands into a single denser vector instruction with reduced (none or less) zero values in its operands.
  • Embodiments described herein may apply to any operations or compute primitives that perform stream of computations and reduction on sparse operands.
  • Such operations or primitives may include multiply-accumulate (MAC) , sparse matrix-matrix (SpMM) multiplication, matrix-vector (SpMV) multiplication and Embedding Operators in recommendation system (e.g., Sparse Length Sum) .
  • an sVPU may apply to one or more set of applications including: machine learning applications (e.g., convolution, multiplayer perceptron (MLP) , recommendation systems, and HPC.
  • machine learning applications e.g., convolution, multiplayer perceptron (MLP) , recommendation systems, and HPC.
  • vector instructions that are ready to be executed may be allocated to reservation stations (RS) waiting for execution.
  • An sVPU may operate on top of an existing vector unit, as follows.
  • an sVPU may search through the operands of instructions pending in reservation stations (RS) .
  • An sVPU may further perform lane coalescing operations.
  • Lane coalescing operations may comprise the sVPU finding and dynamically scheduling effectual lanes from subsequent instructions to vacant ineffectual VPU lanes in the current instruction. As a result, fewer instructions may be executed.
  • coalescing e.g., lane coalescing
  • FIG. 1 illustrates vector lane coalescing, according to an embodiment of the present disclosure.
  • a VPU e.g., sVPU 100
  • Each of the one or more lane processing units 105 may process vector operands in the corresponding instruction lane of the stream of instructions.
  • lane 106 processing unit 102 may process vector operands in the corresponding lane (e.g., 106) of the stream of vector instructions
  • lane 108 processing unit 104 may process vector operands in the corresponding lane (e.g., 108) of the stream of vector instructions.
  • sVPU 100 may process vector instructions (e.g., inst 110, inst 120, and inst 130) where operands are vector registers and the corresponding lanes, or vector elements, across the input vector operands are processed through the same processing lane as illustrated.
  • the operands that the instructions operate one may be vector registers of, for example, N lanes.
  • the stream of vector instructions may reside in reservation stations ready to be executed.
  • Inst 110 may comprise operation A0 X B0 in lane 106 and C0 X D0 in lane 108.
  • Inst 120 may comprise operation A1 X B1 in lane 106 and C1 X D1 in lane 108.
  • Inst 130 may comprise operation A2 X B2 in lane 106 and C2 X D2 in lane 108.
  • One or more vector operands in the stream of vector instructions may have a value of zero.
  • instruction 110 may have vector operand A0 in lane 106 as 0, instruction 120 may have vector operands C1 and D1 in lane 108 as 0, and instruction 3 may have vector operand B2, in lane 106, and vector operand C2 in lane 108 as 0.
  • a vector element with the value ‘zero’ may indicate that the corresponding lane operation is not effectual (ineffectual) and does not affect the final output.
  • sVPU 100 may determine one or more ineffectual lane operation in a stream of vector instructions. sVPU 100 may further determine one or more subsequent effectual lane operations corresponding to the one or more ineffectual lane operation in the stream of vector instructions. sVPU 100 may coalesce the one or more subsequent effectual lane operations with the corresponding ineffectual lane operations of the one or more ineffectual lane operations. sVPU 100 may then execute the coalesced instructions.
  • sVPU 100 may replace an ineffectual lane operation with an effectual operation from a subsequent instruction in the same lane. For example, referring to lane 106 of inst 110, vector operation A0 X B0 is ineffectual since A0 value is zero. However, lane 106 of inst 120 may have an effectual operation since no vector operand (e.g., A1 or B1) has zero value.
  • no vector operand e.g., A1 or B1
  • sVPU 100 coalesce lane 106 of inst 120 with lane 106 of inst 110, thereby, replacing an ineffectual operation (e.g., vector operation in lane 106 of inst 110 (A0 X B0) ) , with a subsequent effectual operation (e.g., vector operation in lane 106 of inst 120 (A1 X B1) ) .
  • an ineffectual operation e.g., vector operation in lane 106 of inst 110 (A0 X B0)
  • a subsequent effectual operation e.g., vector operation in lane 106 of inst 120 (A1 X B1)
  • sVPU 100 may fill the “bubble” in lane 106 of inst 110 with a subsequent vector operation in the same lane 106, which in embodiment of FIG. 1 happens to be lane 106 of inst 120.
  • sVPU 100 may limit its search scope for determining a subsequent effectual lane to a combination window (CW) 140 of some size N instructions, where N may be a design time parameter.
  • a size N combination window may indicate that sVPU may look into the operands of up to N ready instructions residing in the reservation stations.
  • CW combination window
  • FIG. 1 a size 3 (e.g., 3 instructions: 110, 120 and 13) combination window 140 is illustrated which includes the current instruction (e.g., Inst 110) .
  • Embodiments described in reference to FIG. 1, including operations performed via sVPU 100 may be applied to one or more vector instructions including vector fused multiply-add, vector dot operations, and vector reduction instruction.
  • FIG. 2 illustrates an example of vector dot (vdot) instruction, according to an embodiment of the present disclosure.
  • the vdot instruction of FIG. 2 may be similar to that implemented in RISC-V Divided Element Extension (EDIV) .
  • Similar vector dot instructions may also be available in other architectures with extensions such as ARM SVE and Intel x86 AVX.
  • the example vector dot instruction 200 may take two input vectors (e.g., vOp 210 and vOp 220) each of length 16 elements (16 x 32 b vector register for each operand) .
  • the dot product between the two sub-vectors of 4 elements (indicated by matching hash pattern) , in the two operands (e.g., vOp 210 and vOp 220) may be performed and the results accumulated to the corresponding accumulation register (as indicated by matching hash pattern) in the accumulator operand vAcc 206.
  • vAcc 230 may comprise 4 x 128b vector register for accumulation, for which only 64b may be used (indicated by hash pattern) .
  • FIG. 3A and 3B illustrate a matrix-vector multiplication, according to an embodiment of the present disclosure.
  • the matrix-vector multiplication algorithm (e.g., multiplication of matrix A 302 with vector B 304) illustrated in FIG. 3A and 3B may be implement using the vector dot instruction as described, for example, in reference to FIG. 2A and 2B, but without losing the generality of other implementations, for example, vector multiply-add instructions.
  • the rows of the input matrix A 302 may be grouped such that each group may comprise as many rows as the number of sub-groups in the vdot instruction, e.g., vdot instruction 200.
  • the number of subgroups is 4, so, each 4 rows of the matrix A may be grouped to be processed simultaneously.
  • each vdot instruction may process a block of, for example, 4x4 elements, shown as b1 311, b2 312, ..., b7 317 against the corresponding sub-vector of B 304 shown as v1 321, v2 322, ..., v7 327.
  • 4x4 elements shown as b1 311, b2 312, ..., b7 317 against the corresponding sub-vector of B 304 shown as v1 321, v2 322, ..., v7 327.
  • the illustrated 4x4 block in embodiments described herein is for illustration purposes only, and thus any block dimensions may be used according to the embodiments of the present disclosure.
  • the corresponding sub-vector for illustration purposes may 4x1.
  • processing each group of 4-rows against the input vector B may be implemented as a stream of vdot instructions, referring to FIG. 3B, wherein each instruction may take a block (e.g., b1 311, b2 312, ...b7 317) as its first vector operand along with a sub-vector of B (e.g., v1 321, v2 312, ..., v7 317) broadcasted to fill the second vector operand.
  • Each instruction e.g., vdot 330 involving b1 and v1, may be referred to as one timestep. Accordingly, for matrix A 302 vector B 304 multiplication may involve seven timesteps, one for each instruction of the instructions 340 as illustrated.
  • the registers indicated in the instructions 340 may refer to the corresponding block of matrix A 302 and sub-vector B 304.
  • vReg15 may refer to the register (e.g., vector register 1 (vReg1) ) that comprises block 1
  • vReg2 refers to the register (e.g., vReg2) that comprises sub-vector v1 (repeated 4 times to correspond with block 1) .
  • the results of all the instructions belonging to the same stream i.e., which in this embodiment may be defined by the 4-row group
  • FIG. 4A and 4B illustrate a matrix-vector multiplication applying block compressed sparse row (BCSR) optimization, according to an embodiment of the present disclosure.
  • input matrix A 402 may be a sparse matrix wherein the 4x4 blocks b2 412, b5 415 and b7 417 may be all zeros (i.e., having zero values for all elements in the block) .
  • the all-zero blocks, b2 412, b5 415 and b7 417, are illustrated as empty (no hash patterns) .
  • a typical algorithmic optimization such as BCSR, may be used to eliminate those 4x4 blocks that are all-zeros.
  • the corresponding vdot instructions may be avoided (illustrated as crossed out) altogether.
  • processing the 4-rows group (which may refer to the part the mmultiplication of matrix A 402 with vector B 404 that includes only a group of 4 rows) may only involve the subset of instructions corresponding to blocks b1 411, b3 413, b4 414, and b6 416 as illustrated (in FIG. 4B) .
  • the seven-timestep process may be reduced to four-timestep process 460 by avoiding the all-zero blocks.
  • the BCSR optimization may avoid only the all-zeros blocks (all elements of the block having zero values) .
  • Embodiments described herein may provide for extracting sparsity at block level and within blocks.
  • sVPU 100 may extract sparsity at block-level, if not implemented on algorithmic level (e.g., using BCSR representation on the matrix) , as well as fine-grain sparsity within blocks as described herein.
  • algorithmic level e.g., using BCSR representation on the matrix
  • fine-grain sparsity within blocks as described herein.
  • sVPU 100 may detect instructions with all-zero input operands and eliminate them altogether from the reservation station.
  • Sparsity may be partially extracted on algorithmic level using BCSR representation of the matrix as described herein. As described, BCSR may eliminate blocks that are entirely zeros. Embodiments described herein may provide for extracting sparsity both on the block level (like BSCR) and in finer granularity within blocks as well as described herein.
  • FIG. 5A and 5B illustrate a matrix-vector multiplication applying lane coalescing, according to an embodiment of the present disclosure.
  • input matrix A 502 may be a sparse matrix wherein the 4x4 blocks b2 512, b5 515 and b7 517 may be all zeros (similar to matrix A 402) .
  • the remaining blocks, b1 511, b3 513, b4 514, and b6 516 may comprise zero and non-zero values.
  • the elements within the 4x4 blocks b1 511 and b3 513 are shown, in which zero values are indicated as empty (no hash pattern) and non-zero values are indicated via hash patterns.
  • sVPU 100 may look at or examine the instructions to be executed next. sVPU 100 may determine one or more lanes having ineffectual computations due to one of the two corresponding input values being zero. For example, sVPU 100 may begin processing the vdot instructions from block b1 511, which is illustrated as one row.
  • sVPU 100 may determine one or more lanes having ineffectual computations, e.g., lanes 530, 532, 534, 536, 538 and 540, due to zero values in these lanes (zero values indicated as empty boxes –no hash pattern) .
  • sVPU 100 may search, in future or subsequent instructions, for an effectual computation (both operand values are non-zero) corresponding to the same lane (e.g., lane 530) .
  • the search may be based on a combination window of size N.
  • the combination window size N may be, for example, 4 as illustrated.
  • sVPU may search in future instructions, e.g., vdot instructions based on b3 513 and v3 523 and determine an effectual computation for the corresponding lane 530.
  • sVPU 100 may replace 550 the ineffectual computation (due to ineffectual computation) in the current instruction with the determined effectual computation in the same lane 530 (corresponding to a future computation) brought froward from a future instruction within the combination window scope as illustrated.
  • This replacement mechanism may be referred to as coalescing and is indicated via arrows pointing from an effectual lane in a future instruction to the corresponding ineffectual lane in the current instruction.
  • a computation may be ineffectual due to a zero value of either operand (i.e., operand from matrix A or vector B) .
  • operand i.e., operand from matrix A or vector B
  • embodiment described herein may refer to an ineffectual computation based on an operand of matrix A having zero value
  • a person skilled in the art may appreciate that a computation may be determined to be ineffectual due to a zero value of vector B operand despite matrix A operand being a non-zero.
  • sVPU 100 may take the same approach, as taken with respect to the ineffectual lane 530 of the current instruction, to the ineffectual lanes 532, 534, 536, 538, 540 of the current instruction. As illustrated, ineffectual lanes 532, 534 and 540 corresponding to the current instructions may be replaced with effectual lanes 532, 534, and 540 corresponding to the subsequent instruction (based on block b3 513) . For ineffectual lanes 536 and 538 in the current instruction, sVPU 100 may look into further subsequent instructions (based on combination window size N) to determine effectual lanes.
  • combination window may refer to the scope of coalescing.
  • Coalescing may be applied to two or more instructions that belong to the same row of blocks and accumulate to the same output vector register (e.g., vReg 15) .
  • embodiments may further enhance processing via coalescing to further reduce the instructions to be processed (i.e., avoiding zeros within BCSR blocks) , for example, a coalesced stream 560) as described herein.
  • the stream of instructions corresponding to the row of blocks may be coalesced into a smaller number of mixed instructions 560 where effectual lanes from different instructions may be packed and processed together.
  • a matrix-vector multiplication which may require seven timesteps (as discussed in reference to FIG. 3A and 3B) may be reduced to four timesteps (as discussed in reference to FIG. 4A and 4B) , and may further be reduced to two timesteps 560 as illustrated.
  • a combination window may determine the scope of coalescing wherein the window size may indicate the depth or how far in future instructions the sVPU may search for effectual lanes.
  • the window size may also determine the upper limit on the speedup that sVPU may achieve. For example, for a combination window size of N, N instructions may be coalesced into 1, which may lead to a speed up of N times.
  • the combination window may be a moving window, i.e., as an instruction is executed, the combination window may slide or move and may be rebased or repositioned to start at the next instruction to execute.
  • sVPU 100 may coalesce instructions that belong to the same stream, i.e., the same row of blocks and accumulate to the same output vector register as further described in reference to FIG. 7 and 8. Coalescing instructions that belong to the same stream may be a key factor in reducing the hardware overhead and complexity of lane coalescing mechanism. Limiting coalescing based on the same stream may avoid the need for any data-dependency checking hardware since all instruction considered for coalescing may accumulate to the same accumulator vector register.
  • sVPU 100 may coalesce instructions based on same-lane coalescing, as described, for example, in reference to FIG. 5B. Same-lane coalescing may replace an ineffectual lane x, for example, with an effectual lane x from a future instruction. Same-lane coalescing may avoid the complex hardware needed to support cross-lane coalescing.
  • embodiments described herein may be applied for compute or storage purposes.
  • data to be computed is available at their corresponding registers of a processor, such data may be examined and combined (e.g., coalesced) according to embodiments described herein to produce a reduced number of instructions for compute purposes.
  • sVPU 100 was tested with high performance conjugate gradient (HPCG) workload and the SuiteSparse benchmark (1000 matrices were randomly selected from the benchmark suite) . Different combination window sizes were attempted and the results observed. Based on the tests, the following observations may be drawn.
  • HPCG high performance conjugate gradient
  • the potential performance gain may also increase.
  • the increased window size may also lead to increased hardware complexity. Accordingly, at a certain combination window size, the performance gain may peak.
  • the performance gain may peak at combination window size of 16, where the speedup may saturate at 2.2x on average over the 1000 matrices.
  • the performance gain may peak at combination window size of 7, where the speedup may saturate at around 3x.
  • FIG. 6A illustrates sVPU performance gain over BCSR based on SuiteSparse benchmark, according to an embodiment of the present disclosure.
  • the horizontal axis of FIG. 6A refers to different matrices of the SuiteSparse benchmark (only 29 matrices are shown for illustrative purposes) .
  • the vertical axis of FIG. 6A refers to performance gain (i.e., speedup) of the sVPU over BCSR.
  • performance gain i.e., speedup
  • a random sample of 1000 matrices from SuiteSparse benchmark were experimented. Of the sample, up to 9x speedups (not shown) and an average of 2.2x (illustrated via line 602) speedups were observed. The speedup of sVPU over BCSR for a portion of the random sample are illustrated.
  • FIG. 6B illustrate sVPU performance gain as a function of combination window size based on SuiteSparse benchmark, according to an embodiment of the present disclosure.
  • the horizontal axis of FIG. 6B refers to different combination window sizes.
  • the vertical axis of FIG. 6B refers to performance gain (i.e., speedup) over a combination window size of 2.
  • the performance gain saturates as combination window size increases.
  • the performance gain peaks at combination window size 16 (illustrated via hash lines) having approximately 60%performance gain over combination window size 2 as illustrated via line 604.
  • FIG. 6C illustrates sVPU performance gain as a function of combination window size based on high performance conjugate gradient (HPCG) workload, according to an embodiment of the present disclosure.
  • the horizontal axis of FIG. 6C refers to different combination window sizes.
  • the vertical axis of FIG. 6C refers to performance gain (i.e., speedup) over BCSR. As illustrated, the performance gain saturates as combination window size increases.
  • the performance gain peaks at combination window size 7 (illustrated via hash lines) reaching a performance gain of just below 3 times that of BCSR (illustrated via line 606) .
  • FIG. 6D illustrates the average coalescing distance based on HPCG workload, according to an embodiment of the present disclosure.
  • the horizontal axis of FIG. 6D refers to different combination window sizes.
  • the vertical axis of FIG. 6D refers to the average coalescing distance associated with the corresponding combination window size.
  • the average distance illustrated indicates the average distance between the current instruction and the further coalesced lane (i.e., in future instruction) .
  • the average coalescing distance based on applying sVPU 100 to HPCG workload was determined to be 2.25 instructions (illustrated via line 608) at combination window size of 6 (illustrated via hash lines) , indicating, that an instruction, on average, became sufficiently dense by coalescing lanes from subsequent 3 (rounding up 2.2.5) instructions.
  • the performance gain (i.e., speedups) achieved may be significant considering the small hardware complexity associated with sVPU, coalescing conditions and combination window size.
  • Coalescing conditions may be based on same-lane coalescing within a stream of instructions accumulating to the same accumulation registers.
  • the combination window size may be based on the peak performance gain as described herein.
  • Embodiments will now describe indicators for a stream of instructions that accumulate to the same accumulation registers. Embodiments may provide for start and end indicators for a stream of instructions.
  • sVPU 100 may use stream-guards technique to mark or indicate the beginning and end of a stream of vector instructions that accumulate to the same output vector register. Such stream of instructions may be safely coalesced without the need for data-dependency checking hardware. Otherwise, if instructions accumulating to different output vector registers are considered for coalescing, a special expensive hardware may be needed to ensure the lane-wise data-independence of the outputs of these candidate instructions.
  • sVPU 100 may uses stream guards surrounding the stream of instructions accumulating to the same output vector register. These stream guards may indicate to the backend sVPU hardware that the instructions within the stream guards are functionally correct to coalesce.
  • sVPU 100 may insert stream guards surrounding each stream of vdot instructions where a stream corresponds to a complete row of 4x4 blocks.
  • sVPU 100 may insert stream guards before the beginning and after the end of matrix A, for example, to indicate that the row of blocks comprising b1 to b7 is one stream of vdot instructions to which coalescing operations, as described herein, may be applied.
  • the stream guards may limit sVPU coalescing to vdot instructions from the same row of blocks and avoid coalescing vdot instructions from the next row of blocks as further described herein.
  • FIG. 7 and 8 illustrate use of stream guards for indicating different streams of instructions, according to an embodiment of the present disclosure.
  • a matrix-vector multiplication may comprise an input matrix A which may include two rows of 4x4 blocks 740 and 742 as illustrated.
  • the first row of blocks 740 may correspond to vector instructions that accumulate to a first output vector registers
  • the second row of 4x4 blocks 742 may correspond to vector instructions that accumulate to a second output vector register.
  • sVPU 100 may use stream guards to indicate the beginning (via, for example, sVPU_stream_start) and the end (via, for example, sVPU_stream_end) of a stream of instructions that accumulate to the same output vector register.
  • the first rows of 4x4 bocks 740 may be indicated as a first stream of instructions 750 via using stream guards as illustrated.
  • the second rows of 4x4 blocks 742 may be indicated as a second stream of instructions 752 using stream guards as illustrated.
  • vReg15 is illustrated as the output register for accumulating both stream of instructions 750 and 752
  • the outputs in the register may be flushed and stored in a memory and the register may then be initialized to zero for preparing for the second stream of instruction 752.
  • a different output vector register may be used for each stream of instruction.
  • sVPU 100 may apply coalescing mechanisms as described herein to the remaining blocks (e.g., blocks b1 711, b3 713, b4 714, and b6 716 for the first rows of 4x4 blocks 740 corresponding to the first stream of instructions 750, and blocks b9 719, b10 720, b12 722 and b14 724 for the second rows of 4x4 blocks 742 corresponding to the second stream of instructions 752) , since the remaining blocks may comprise zero value elements.
  • the remaining blocks e.g., blocks b1 711, b3 713, b4 714, and b6 716 for the first rows of 4x4 blocks 740 corresponding to the first stream of instructions 750, and blocks b9 719, b10 720, b12 722 and b14 724 for the second rows of 4x4 blocks 742 corresponding to the second stream of instructions 752 , since the remaining blocks may comprise zero value elements.
  • stream guards may be implemented either as an extension to the instruction set architecture (ISA) or through introducing a new control/status register (CSR) .
  • ISA instruction set architecture
  • CSR control/status register
  • sVPU 100 may introduce a new CSR name SVPU_CR.
  • a stream start may be marked by writing “1” and a stream end may be marked by writing “0” to SVPU_CR to enable (when, for example, “1” is written) and disable (when, for example, “0” is written) coalescing by the sVPU backend.
  • the SVPU_CR may be implemented as a 1-bit register, accordingly, there may be no support for nesting of streams. No support for nesting of streams may mean that sVPU 100 may keep track of a single stream at time. So, if a new stream (i.e., a second stream) needs to be started, the current one (i.e., first stream) needs to be terminated (with the stream guard closure) before beginning the new stream. Then after finishing the new stream (i.e., the second stream) , the remainder of the old stream that was not processed due to termination may be then be processed as a new shorter stream (i.e., third stream) . When terminating the first stream, no state needs to retained, and the remainder of the old stream will be dealt with as a new shorter stream (i.e., third stream) .
  • a CSR may be written to use an atomic CSR read and write instruction which may be available in every ISA, such as CSRRW for RISC-V ISA.
  • Embodiments described herein introduce the concept of “instructions stream” .
  • Embodiments described herein may limit coalescing to a stream of instructions accumulating to the same output vector register as determined according to stream guards. Coalescing based on a stream of instructions, as described herein, may allow for reduced costs in terms of hardware by obviating the need for data-dependency checking hardware that is otherwise necessary to resolve dependency between candidate instructions. By limiting the coalescing candidate instructions to the same stream of instructions, coalescing may be performed without dependency checks.
  • FIG. 9 illustrates a block diagram of an sVPU u-architecture, according to an embodiment of the present disclosure.
  • the execution backend 900 may comprise one or more of a reorder buffer (ROB) 902, a lane coalescing unit (LCU) 904, a mask generation unit (MGU) 906, a reservation station (RS) 908, a vector register file 910, and an sVPU 100.
  • ROB reorder buffer
  • LCU lane coalescing unit
  • MGU mask generation unit
  • RS reservation station
  • FIG. 10 illustrates a mask generation unit according to an embodiment of the present disclosure.
  • the MGU 906 may generate effectual lane masks (ELM) for each vdot instruction that reside in RS 908 and are ready to be executed.
  • ELM effectual lane masks
  • the corresponding bit in the generated mask may indicate whether the input value is effectual or not.
  • the ELM may indicate which lane have effectual values.
  • the generated ELM masks may be kept either in a newly added field “ELM 1002” in the RS table or in some mask physical register file if available in the CPU architecture.
  • FIG. 10 illustrates one or more MGU units 906 including comparators, NOR gates, and a new field “ELM 1002” added to the RS table to keep the generated masks.
  • each vector operand 1004 and 1006 may be evaluated to determine whether either operand is zero and thus ineffectual.
  • the results i.e., ineffectual or effectual
  • the MGU 906 may be instantiated N times where N is the size of the combination window so that up to N instructions may be investigated in parallel at a time.
  • the LCU 904 may use the masks (e.g., ELM 1002) generated by the MGUs 906 as input and decide which lanes to coalesce accordingly.
  • the coalescing mechanism e.g., LCU 904 may employ the method described in FIG 11 to perform the coalescing operations as described herein.
  • FIG. 11 illustrates a coalescing method, according to an embodiment of the present disclosure.
  • one or more ineffectual lane in the current vector instruction is identified or determined according to embodiments described herein.
  • LCU 904 may inspect the corresponding lane position in each ready instruction within the combination window in program order.
  • the corresponding effectual lane in the earliest ready instruction is coalesced to fill the bubble of the ineffectual lane and the corresponding input operands are brought forward and packed into the current instruction vector inputs.
  • the corresponding bit in the ELM mask may be zeroed, thereby marking or indicating that the effectual lane has been successfully coalesced and will be executed. This ensures the lane will not be considered again while performing future coalescing.
  • the LCU 904 may keep a record of the original instruction from which a lane is coalesced so that, in case of an interruption, the machine state is maintained by squashing the lanes coalesced from subsequent instructions into the current one.
  • the mixed instruction may be issued for execution by the vector processing engine. Any subsequent instruction for which all the effectual lanes have been coalesced and executed, is removed from the reservation station and marked as “done” .
  • LCU may prioritize an earlier instruction ahead of a later instruction according to the program order, i.e., for an ineffectual lane in the current instruction, an effectual corresponding lane from an earlier instruction (e.g., a first subsequent instruction (I 1 ) ) may be given a higher priority over the corresponding lane of a later instruction (e.g., a second subsequent instruction (I 2 ) ) if I 1 is older than I 2 according to the program order.
  • Prioritizing instructions may help simplify the interrupt-handling mechanisms in case an interruption occurs and the coalesced lanes from subsequent instructions need to be squashed.
  • Embodiments described in reference to the coalescing mechanism may be simple and cost effective since such mechanism does not need expensive hardware support.
  • the coalescing mechanism embodiments described herein may offer a low-cost approach based on one or more of the following.
  • the coalescing mechanism embodiments described herein may offer a low-cost approach by limiting the combination window to a practical size rather than including all reservation station entries for performing the search for effectual lanes to coalesce.
  • the coalescing mechanism embodiments described herein may offer a low-cost approach since the hardware support for implementation may be based on an MGU (e.g., MGU 906) comprising comparators to detects zeros and NOR gates to take both operands into account for binary operators.
  • MGU e.g., MGU 906
  • the coalescing mechanism embodiments described herein may offer a low-cost approach since the mechanism (used in the LCU 904) may be implemented using priority-based selection hardware (to prioritize instructions in program order) that may have optimized and known implementations.
  • Embodiments described herein may provide for a vector processing unit (e.g., a sparsity-aware vector processing unit (sVPU) ) that coalesces vector instructions into fewer instructions by packing only effectual lanes from the original instructions.
  • a vector processing unit e.g., a sparsity-aware vector processing unit (sVPU)
  • sVPU sparsity-aware vector processing unit
  • the sVPU 100 may consider only same-lane coalescing and avoid cross-lane coalescing which require expensive hardware components in the micro-architecture.
  • the sVPU 100 may consider only instructions that accumulate to the same output vector register for coalescing. Limiting the instructions in such a way may simplify the implementation and avoid the otherwise necessary data-dependency checking hardware.
  • an instruction stream may be marked using “stream guards” that may be implemented using ISA extension instructions or using a new control/status register (CSR) which may be written to indicate the start and end of the stream.
  • stream guards may be implemented using ISA extension instructions or using a new control/status register (CSR) which may be written to indicate the start and end of the stream.
  • CSR control/status register
  • Embodiments described herein may target a wide set of operations or primitives that perform stream of computations and reduction on sparse operands.
  • a wide set of CPU instructions may be targeted for coalescing by sVPU 100 including: vector Dot, vector add, and vector multiply-accumulate.
  • embodiments are described in the context of CPU vector unit, embodiments may be equivalently applicable to other commodity architectures such as graphics processing units (GPUs) and digital signal processors (DSPs) .
  • GPUs graphics processing units
  • DSPs digital signal processors
  • These commodity architectures typically feature vector processing engines and their typical workloads are expected to have similar levels of sparsity in their input data. As such, embodiments described here may be applicable to such other commodity architectures.
  • FIG. 12 is a schematic diagram of an electronic device 1200 that may perform any or all of operations of the methods and features explicitly or implicitly described herein, according to different embodiments of the present disclosure.
  • the electronic device 1200 may include a processor 1210, such as a Central Processing Unit (CPU) or specialized processors such as a Graphics Processing Unit (GPU) or other such processor unit, memory 1220, non-transitory mass storage 1230, input-output interface 1240, network interface 1250, and a transceiver 1260, all of which are communicatively coupled via bi-directional bus 1270.
  • a processor 1210 such as a Central Processing Unit (CPU) or specialized processors such as a Graphics Processing Unit (GPU) or other such processor unit
  • memory 1220 such as a Central Processing Unit (CPU) or specialized processors such as a Graphics Processing Unit (GPU) or other such processor unit
  • non-transitory mass storage 1230 such as a graphics processing unit
  • input-output interface 1240 such as a graphics processing unit
  • network interface 1250 such as a graphics processing unit
  • the memory 1220 may include any type of non-transitory memory such as static random-access memory (SRAM) , dynamic random-access memory (DRAM) , synchronous DRAM (SDRAM) , read-only memory (ROM) , any combination of such, or the like.
  • the mass storage element 1230 may include any type of non-transitory storage device, such as a solid-state drive, hard disk drive, a magnetic disk drive, an optical disk drive, USB drive, or any computer program product configured to store data and machine executable program code. According to certain embodiments, the memory 1220 or mass storage 1230 may have recorded thereon statements and instructions executable by the processor 1210 for performing any of the aforementioned method operations described above.
  • Embodiments of the present invention can be implemented using electronics hardware, software, or a combination thereof.
  • the invention is implemented by one or multiple computer processors executing program instructions stored in memory.
  • the invention is implemented partially or fully in hardware, for example using one or more field programmable gate arrays (FPGAs) or application specific integrated circuits (ASICs) to rapidly perform processing operations.
  • FPGAs field programmable gate arrays
  • ASICs application specific integrated circuits
  • Acts associated with the method described herein can be implemented as coded instructions in a computer program product.
  • the computer program product is a computer-readable medium upon which software code is recorded to execute the method when the computer program product is loaded into memory and executed on the microprocessor of the wireless communication device.
  • each operation of the method may be executed on any computing device, such as a personal computer, server, PDA, or the like and pursuant to one or more, or a part of one or more, program elements, modules or objects generated from any programming language, such as C++, Java, or the like.
  • each operation, or a file or object or the like implementing each said operation may be executed by special purpose hardware or a circuit module designed for that purpose.
  • the present invention may be implemented by using hardware only or by using software and a necessary universal hardware platform. Based on such understandings, the technical solution of the present invention may be embodied in the form of a software product.
  • the software product may be stored in a non-volatile or non-transitory storage medium, which can be a compact disc read-only memory (CD-ROM) , USB flash disk, or a removable hard disk.
  • the software product includes a number of instructions that enable a computer device (personal computer, server, or network device) to execute the methods provided in the embodiments of the present invention. For example, such an execution may correspond to a simulation of the logical operations as described herein.
  • the software product may additionally or alternatively include a number of instructions that enable a computer device to execute operations for configuring or programming a digital logic apparatus in accordance with embodiments of the present invention.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Advance Control (AREA)

Abstract

Systems and methods for sparsity-aware vector processing in general purpose CPUs are described. An aspect of the disclosure provides for a method including receiving a stream of vector instructions for processing. The method further includes determining an ineffectual computation corresponding to a first lane of a first vector instruction of the stream of vector instructions and determining an effectual computation corresponding to a second lane of a second vector instruction of the stream of vector instructions, where the second vector instruction is subsequent to the first vector instruction according to a processing order of the stream of vector instructions, and the second lane of the second vector instruction and the first lane of the first vector instruction correspond to a same lane. The method further includes coalescing the second lane with the first lane and processing said stream of vector instructions.

Description

SYSTEMS AND METHODS FOR SPARSITY-AWARE VECTOR PROCESSING IN GENERAL PURPOSE CPUS
CROSS-REFERENCE TO RELATED APPLICATIONS
This is the first application filed for the present invention.
TECHNICAL FIELD
The present invention pertains to the field of computing, and in particular to systems and methods for sparsity-aware vector processing in general purpose CPUs.
BACKGROUND
Many high-performance computing (HPC) and artificial intelligence (AI) applications involve sparse data. Sparsity in vector operations presents challenges during processing such as unnecessary power consumption and wasting execution time. Existing techniques for dealing with sparsity in data have limitations and deficiencies that render their implementations infeasible or unjustified. For example, existing techniques employ complex hardware requirements that may limit the operating frequency of a vector unit, leading to increased power consumption and chip area. Further, existing techniques rely on hardware dependency checking modules for checking and resolving dependencies among vector instructions, which adds a further layer of complexity to the hardware requirements.
Therefore, there is a need for systems and methods to obviate or mitigate one or more limitations of the prior art.
This background information is provided an enhanced to reveal information believed by the applicant to be of possible relevance to the present invention. No admission is necessarily intended, nor should be construed, that any of the preceding information constitutes prior art against the present invention.
SUMMARY
An aspect of the disclosure provides for a method. The method includes receiving a stream of vector instructions for processing, the stream of vector instructions including a plurality of vector instructions. The method further includes determining an ineffectual computation corresponding to a first lane of a first vector instruction of the stream of vector instructions. The method further includes determining an effectual computation corresponding to a second lane of a second vector instruction of the stream of vector instructions. The method wherein the second vector instruction is subsequent to the first vector instruction according to a processing order of the stream of vector instructions, and the second lane of the second vector instruction and the first lane of the first vector instruction correspond to a same lane. The method further includes coalescing the second lane of the second vector instruction with the first lane of the first vector instruction. The method further includes processing the stream of vector instructions. The method may provide for reduced hardware complexity and reduced cost due to same-lane coalescing. The method may leverage instruction set architecture (ISA) support for simpler implementations.
In some embodiments, the determining an effectual computation is based on a combination window size indicating a number of vector instructions of the stream of vector instructions subsequent to the first vector instruction. The combination window size may indicate an upper limit or a peak for performance gain that method may achieve.
In some embodiments the stream of vector instructions accumulates to one output register. The method may further provide for simplified hardware support due to coalescing that is based on a stream of instructions that accumulate to the same output register, which may obviate the need for hardware dependency checking modules that is needed for cross-lane coalescing.
In some embodiments, the coalescing includes replacing the ineffectual computation with the effectual computation. In some embodiments, the coalescing further includes the processing the stream of vector instructions includes: processing the first vector instruction including the effectual computation. The method may provide for condensing a stream of instruction in a reduced form.
In some embodiments, the ineffectual computation includes a vector operand having a zero value. The method may provide for extracting sparsity in finer granularity in addition to algorithmic level.
In some embodiments, the stream of vector instructions is indicated by a start-stream indicator and an end-stream indicator. The method may provide for enhanced ISA extensions to mark the beginning and end of a stream of target instructions.
In some embodiments, the method further includes determining a second effectual computation corresponding to a third lane of a third vector instruction of the stream of vector instructions. In some embodiments, the third vector instruction is subsequent to the second vector instruction according to the processing order of the stream of vector instructions. In some embodiments, the third lane of the third vector instruction and the second lane of the second vector instruction correspond to a same lane. In some embodiments, the method further includes coalescing the third lane of the third vector instruction with the second lane of the second vector instruction. The method may provide for reducing a stream of instruction in a condensed form.
In some embodiments, the processing the stream of vector instructions includes processing the second vector instruction including the second effectual computation.
In some embodiments, the method further includes receiving a second stream of vector instructions for processing. In some embodiments, the second stream of vector instructions accumulates to a second output register. In some embodiments, the second stream of vector instructions is indicated by a second start-stream indicator and a second end-stream indicator. In some embodiments, the method further includes processing the second stream of vector instructions.
In some embodiments, the processing the second stream of vector instructions is performed after processing the stream of vector instructions.
Another aspect of the disclosure provides for an apparatus. The apparatus includes one or more mask generation units. The apparatus further includes one or more lane processing units. The apparatus further includes one or more lane coalescing units. The apparatus further includes at least one processor. The apparatus further includes at least one  machine readable medium storing executable instructions which when executed by the at least one processor configure the apparatus to perform the methods described herein. For example, the apparatus is configured for receiving a stream of vector instructions for processing, the stream of vector instructions including a plurality of vector instructions. The apparatus is further configured for determining, via the one or more mask generation units, an ineffectual computation corresponding to a first lane of a first vector instruction of the stream of vector instructions. The apparatus is further configured for determining, via the one or more mask generation units, an effectual computation corresponding to a second lane of a second vector instruction of the stream of vector instructions. The second vector instruction is subsequent to the first vector instruction according to a processing order of the stream of vector instructions, and the second lane of the second vector instruction and the first lane of the first vector instruction correspond to a same lane. The apparatus is further configured for coalescing, via the one or more lane coalescing units, the second lane of the second vector instruction with the first lane of the first vector instruction. The apparatus is further configured for processing, via one or more lane processing units, the stream of vector instructions, wherein each lane processing unit corresponds to a corresponding lane in the stream of vector instruction. The apparatus may provide for reduced hardware complexity and reduced cost due to same-lane coalescing. The apparatus may leverage instruction set architecture (ISA) support for simpler implementations.
In some embodiments, the configuration for determining an effectual computation is based on a combination window size indicating a number of vector instructions of the stream of vector instructions subsequent to the first vector instruction. The combination window size may indicate an upper limit or a peak for performance gain that the apparatus may achieve.
In some embodiments, the stream of vector instructions accumulates to one output register. The apparatus may further provide for simplified hardware support due to coalescing that is based on a stream of instructions that accumulate to the same output register, which may obviate the need for hardware dependency checking modules that is needed for cross-lane coalescing.
In some embodiments, the coalescing includes replacing the ineffectual computation with the effectual computation. In some embodiments, the coalescing further includes the  processing the stream of vector instructions includes: processing the first vector instruction including the effectual computation. The apparatus may provide for condensing a stream of instruction in a reduced form.
In some embodiments, the ineffectual computation includes a vector operand having a zero value. The apparatus may provide for extracting sparsity in finer granularity in addition to algorithmic level.
In some embodiments, the stream of vector instructions is indicated by a start-stream indicator and an end-stream indicator. The apparatus may provide for enhanced ISA extensions to mark the beginning and end of a stream of target instructions.
In some embodiment the executable instructions which when executed by the at least one processor further configure the apparatus for determining, via the one or more mask generation units, a second effectual computation corresponding to a third lane of a third vector instruction of the stream of vector instructions. In some embodiments, the third vector instruction is subsequent to the second vector instruction according to the processing order of the stream of vector instructions. In some embodiments, the third lane of the third vector instruction and the second lane of the second vector instruction correspond to a same lane. In some embodiments, the apparatus is further configured for coalescing, via the one or more lane coalescing units, the third lane of the third vector instruction with the second lane of the second vector instruction. The apparatus may provide for a sparsity-aware vector processing unit (sVPU) for general purpose CPUs that may address the challenge posed by high sparsity rations. The apparatus may provide for reducing a stream of instruction in a condensed form.
In some embodiments, the processing the stream of vector instructions includes processing the second vector instruction including the second effectual computation.
In some embodiments, the executable instructions which when executed by the at least one processor further configure the apparatus for receiving a second stream of vector instructions for processing. In some embodiments, the second stream of vector instructions accumulates to a second output register. In some embodiments, the second stream of vector instructions is indicated by a second start-stream indicator and a second end-stream indicator.  In some embodiments, the apparatus is further configured for processing, via the one or more lane processing units, the second stream of vector instructions.
In some embodiments the processing, via the one or more lane processing units, the second stream of vector instructions is performed after processing the stream of vector instructions.
Other aspects of the disclosure provide for machine readable mediums, apparatus and systems configured to implement the methods disclosed herein. For example, an electronic device can be configured with machine readable memory containing instructions, which when executed by the processors of these devices, configures the device to perform the methods disclosed herein.
Embodiments have been described above in conjunction with aspects of the present invention upon which they can be implemented. Those skilled in the art will appreciate that embodiments may be implemented in conjunction with the aspect with which they are described but may also be implemented with other embodiments of that aspect. When embodiments are mutually exclusive, or are incompatible with each other, it will be apparent to those skilled in the art. Some embodiments may be described in relation to one aspect, but may also be applicable to other aspects, as will be apparent to those of skill in the art.
BRIEF DESCRIPTION OF THE DRAWINGS
Further features and advantages of the present invention will become apparent from the following detailed description, taken in combination with the appended drawings, in which:
FIG. 1 illustrates vector lane coalescing, according to an embodiment of the present disclosure.
FIG. 2 illustrates an example of vector dot (vdot) instruction, according to an embodiment of the present disclosure.
FIG. 3A and 3B illustrate a matrix-vector multiplication, according to an embodiment of the present disclosure.
FIG. 4A and 4B illustrate a matrix-vector multiplication applying block compressed sparse row (BCSR) optimization, according to an embodiment of the present disclosure.
FIG. 5A and 5B illustrate a matrix-vector multiplication applying lane coalescing, according to an embodiment of the present disclosure
FIG. 6A illustrates sVPU performance gain over BCSR based on SuiteSparse benchmark, according to an embodiment of the present disclosure.
FIG. 6B illustrate sVPU performance gain as a function of combination window size based on SuiteSparse benchmark, according to an embodiment of the present disclosure.
FIG. 6C illustrates sVPU performance gain as a function of combination window size based on high performance conjugate gradient (HPCG) workload, according to an embodiment of the present disclosure
FIG. 6D illustrates the average coalescing distance based on HPCG workload, according to an embodiment of the present disclosure.
FIG. 7 and 8 illustrate use of stream guards for indicating different streams of instructions, according to an embodiment of the present disclosure.
FIG. 9 illustrates a block diagram of an sVPU u-architecture, according to an embodiment of the present disclosure.
FIG. 10 illustrates a mask generation unit according to an embodiment of the present disclosure.
FIG. 11 illustrates a coalescing method, according to an embodiment of the present disclosure.
FIG. 12 illustrates a schematic diagram of an electronic device that may perform any or all of operations of the methods and features explicitly or implicitly described herein, according to different embodiments of the present disclosure.
It will be noted that throughout the appended drawings, like features are identified by like reference numerals.
DETAILED DESCRIPTION
Exploiting sparsity in CPU vector processing units has been moderately explored. While existing works target sparsity in operands of vector instructions, their proposal comes with a number of limitations and deficiencies. For example, the hardware complexity of existing proposals may be prohibitive and render their implementation either infeasible or unjustified.
Existing works have a number of deficiencies, for which, embodiments described herein may provide solutions. As mentioned herein, existing works are limited due to their high hardware complexity requirements. High hardware complexity may limit the frequency the vector unit can run at, leading to unjustified high-power consumption, and extra chip area. The high hardware complexity in existing works may be due to cross-lane coalescing where zero values in a vector lane can be replaced with non-zero values from other lanes of subsequent vector instructions.
A lane may refer to a position in a vector register. When a vector is being processed, lane may refer to a position in the sub-vector that is being processed at a time. In the case of processing N elements at a time, there may be 0…to N-1 lanes. Accordingly, in an embodiment, a vector processing unit may comprise N processing lane for processing N elements at a time such that each processing lane may correspond to an element or position in the sub-vector that is being processed at a time. Embodiments described herein further describe lane definition.
Embodiments described herein may provide for reduced hardware complexity and reduced cost due to same-lane coalescing. Same-lane coalescing may refer to, for example, replacing an ineffectual value with a value from the same lane of a subsequent instruction. Same-lane coalescing may provide for a reduced hardware cost in the u-Arch.
The high hardware complexity in existing works may be further due to the need for hardware dependency checking modules. Hardware dependency checking modules may be needed for checking and resolving dependencies between instructions that are candidates for coalescing. Since candidates are expected to be writing to arbitrary destination vector registers and coalescing operations are based on cross-lane coalescing, existing methods  require such dependency checking modules.
Embodiments described herein may provide for simplified hardware support due to coalescing that is based on a stream of instructions that accumulate to the same output register. As may be appreciated by a person skilled in the art, accumulating to the same output registers may obviate the need for hardware dependency checking modules that is needed for cross-lane coalescing.
Existing works are further limited for not leveraging instruction set architecture (ISA) support for simpler implementation. Embodiments described herein may leverage ISA support to enable same-lane stream-based approach of coalescing candidate instructions. Embodiments may provide for enhanced ISA extensions to mark the beginning and end of a stream of target instructions (or using writes to control registers) to mark them as "eligible for coalescing without further dependency checks" as they all will be accumulating to the same output.
Existing works are further limited in terms of coverage and applicability since such works are limited to vector fused-multiply add instruction. As such, use cases that may be based on existing techniques are limited.
Embodiments described herein may be applied to a wider set of instructions, thereby broadening the target instructions to vector Dot, add…etc. Embodiments described herein may extend to one or more of operations and primitives that perform stream of computations and reduction on sparse operands.
Data that are being processed by many high-performance computing (HPC) and artificial intelligence (AI) applications are sparse, which means a lot of data elements have zero (0) value during execution time. For example, high performance conjugate gradient (HPCG) input matrices can be 99.99%sparse. These matrices may be very large with dimensions of 1 million x 1 million. Also, for machine learning (ML) applications such as computer vision, natural language processing, and speech recognition, high sparsity levels have been identified by several previous research works. Sparsity in deep neural networks (DNNs) model parameters can reach up to 98%due to advancements in model pruning techniques. In addition, sparsity of runtime activation values can be around 60%and even  higher with techniques like “dropout” being in use.
Such high sparsity ratios bring a hefty challenge to a central processing unit (CPU) running such applications. Processing a large number of operations (mostly multiplications, additions, and multiply-accumulate MAC) with zero operands wastes execution time. Since these operations (i.e., having zero operands) will not affect the output, they can be safely skipped together. Accordingly, embodiment may provide for a sparsity-aware vector processing unit (sVPU) for general purpose CPUs that may address the challenge posed by high sparsity rations.
In some embodiments, an sVPU may skip processing ineffectual computational operations involving zero operands. In some embodiments, for a vector instruction having one or more vector operand elements as zeros, an sVPU may fill the one or more lanes with the zero operands with effectual values from subsequent instructions. Effectively, an sVPU may coalesce multiple vector instructions with sparse vector operands into a single denser vector instruction with reduced (none or less) zero values in its operands.
Embodiments described herein may apply to any operations or compute primitives that perform stream of computations and reduction on sparse operands. Such operations or primitives may include multiply-accumulate (MAC) , sparse matrix-matrix (SpMM) multiplication, matrix-vector (SpMV) multiplication and Embedding Operators in recommendation system (e.g., Sparse Length Sum) .
In some embodiments, an sVPU may apply to one or more set of applications including: machine learning applications (e.g., convolution, multiplayer perceptron (MLP) , recommendation systems, and HPC.
In some embodiments, vector instructions that are ready to be executed may be allocated to reservation stations (RS) waiting for execution. An sVPU may operate on top of an existing vector unit, as follows. In some embodiments, an sVPU may search through the operands of instructions pending in reservation stations (RS) . An sVPU may further perform lane coalescing operations. Lane coalescing operations may comprise the sVPU finding and dynamically scheduling effectual lanes from subsequent instructions to vacant ineffectual VPU lanes in the current instruction. As a result, fewer instructions may be executed. As may  be appreciated by a person skilled in the art, coalescing (e.g., lane coalescing) may lead to skipping compute cycles and executing instructions based on already scheduled effectual lanes, thereby leading to speed-ups.
FIG. 1 illustrates vector lane coalescing, according to an embodiment of the present disclosure. A VPU (e.g., sVPU 100) may comprise one or more lane processing units 105 (e.g., lane 102 processing unit 102 and lane 104 processing unit 104) , corresponding to the one or more lanes (e.g., lane 106 and 108) of a set or a stream of vector instructions (e.g., inst 110, inst 120, and inst 130) . Each of the one or more lane processing units 105 may process vector operands in the corresponding instruction lane of the stream of instructions. For example, lane 106 processing unit 102 may process vector operands in the corresponding lane (e.g., 106) of the stream of vector instructions, and lane 108 processing unit 104 may process vector operands in the corresponding lane (e.g., 108) of the stream of vector instructions. Accordingly, in an embodiment, sVPU 100 may process vector instructions (e.g., inst 110, inst 120, and inst 130) where operands are vector registers and the corresponding lanes, or vector elements, across the input vector operands are processed through the same processing lane as illustrated. In an embodiment, the operands that the instructions operate one may be vector registers of, for example, N lanes.
The stream of vector instructions (e.g., inst 110, inst 120, inst 130) may reside in reservation stations ready to be executed. Inst 110 may comprise operation A0 X B0 in lane 106 and C0 X D0 in lane 108. Inst 120 may comprise operation A1 X B1 in lane 106 and C1 X D1 in lane 108. Inst 130 may comprise operation A2 X B2 in lane 106 and C2 X D2 in lane 108. One or more vector operands in the stream of vector instructions may have a value of zero. For, example, as illustrated, instruction 110 may have vector operand A0 in lane 106 as 0, instruction 120 may have vector operands C1 and D1 in lane 108 as 0, and instruction 3 may have vector operand B2, in lane 106, and vector operand C2 in lane 108 as 0. A vector element with the value ‘zero’ may indicate that the corresponding lane operation is not effectual (ineffectual) and does not affect the final output.
In an embodiment, sVPU 100 may determine one or more ineffectual lane operation in a stream of vector instructions. sVPU 100 may further determine one or more subsequent effectual lane operations corresponding to the one or more ineffectual lane operation in the  stream of vector instructions. sVPU 100 may coalesce the one or more subsequent effectual lane operations with the corresponding ineffectual lane operations of the one or more ineffectual lane operations. sVPU 100 may then execute the coalesced instructions.
In an embodiment, sVPU 100 may replace an ineffectual lane operation with an effectual operation from a subsequent instruction in the same lane. For example, referring to lane 106 of inst 110, vector operation A0 X B0 is ineffectual since A0 value is zero. However, lane 106 of inst 120 may have an effectual operation since no vector operand (e.g., A1 or B1) has zero value. Accordingly, in an embodiment, sVPU 100, coalesce lane 106 of inst 120 with lane 106 of inst 110, thereby, replacing an ineffectual operation (e.g., vector operation in lane 106 of inst 110 (A0 X B0) ) , with a subsequent effectual operation (e.g., vector operation in lane 106 of inst 120 (A1 X B1) ) .
Therefore, sVPU 100 may fill the “bubble” in lane 106 of inst 110 with a subsequent vector operation in the same lane 106, which in embodiment of FIG. 1 happens to be lane 106 of inst 120.
In some embodiments, for practical implementations, sVPU 100 may limit its search scope for determining a subsequent effectual lane to a combination window (CW) 140 of some size N instructions, where N may be a design time parameter. A size N combination window may indicate that sVPU may look into the operands of up to N ready instructions residing in the reservation stations. In embodiment of FIG. 1, a size 3 (e.g., 3 instructions: 110, 120 and 13) combination window 140 is illustrated which includes the current instruction (e.g., Inst 110) .
Embodiments described in reference to FIG. 1, including operations performed via sVPU 100 (e.g., searching, coalescing and executing mechanisms as described herein) may be applied to one or more vector instructions including vector fused multiply-add, vector dot operations, and vector reduction instruction.
FIG. 2 illustrates an example of vector dot (vdot) instruction, according to an embodiment of the present disclosure. The vdot instruction of FIG. 2 may be similar to that implemented in RISC-V Divided Element Extension (EDIV) . Similar vector dot instructions may also be available in other architectures with extensions such as ARM SVE and Intel x86  AVX.
The example vector dot instruction 200 may take two input vectors (e.g., vOp 210 and vOp 220) each of length 16 elements (16 x 32 b vector register for each operand) . The dot product between the two sub-vectors of 4 elements (indicated by matching hash pattern) , in the two operands (e.g., vOp 210 and vOp 220) may be performed and the results accumulated to the corresponding accumulation register (as indicated by matching hash pattern) in the accumulator operand vAcc 206.
For example, the dot product between the sub-vector of 4 elements 212 of vOP 210 and the sub-vector of 4 elements 222 of vOP 220 may be performed and accumulated to the corresponding accumulation register 232 in the accumulator operand vAcc 230. As illustrated, vAcc 230 may comprise 4 x 128b vector register for accumulation, for which only 64b may be used (indicated by hash pattern) .
FIG. 3A and 3B illustrate a matrix-vector multiplication, according to an embodiment of the present disclosure. The matrix-vector multiplication algorithm (e.g., multiplication of matrix A 302 with vector B 304) illustrated in FIG. 3A and 3B may be implement using the vector dot instruction as described, for example, in reference to FIG. 2A and 2B, but without losing the generality of other implementations, for example, vector multiply-add instructions.
Referring to FIG. 3A, the rows of the input matrix A 302 may be grouped such that each group may comprise as many rows as the number of sub-groups in the vdot instruction, e.g., vdot instruction 200. In the example vdot instruction 200, the number of subgroups is 4, so, each 4 rows of the matrix A may be grouped to be processed simultaneously.
Accordingly, referring to FIG. 3A, in an embodiment, each vdot instruction may process a block of, for example, 4x4 elements, shown as b1 311, b2 312, ..., b7 317 against the corresponding sub-vector of B 304 shown as v1 321, v2 322, …, v7 327. As may be appreciated by a person skilled in the art, the illustrated 4x4 block in embodiments described herein is for illustration purposes only, and thus any block dimensions may be used according to the embodiments of the present disclosure. Similarly, the corresponding sub-vector for illustration purposes may 4x1.
Thus, processing each group of 4-rows against the input vector B may be implemented as a stream of vdot instructions, referring to FIG. 3B, wherein each instruction may take a block (e.g., b1 311, b2 312, …b7 317) as its first vector operand along with a sub-vector of B (e.g., v1 321, v2 312, …, v7 317) broadcasted to fill the second vector operand. Each instruction, e.g., vdot 330 involving b1 and v1, may be referred to as one timestep. Accordingly, for matrix A 302 vector B 304 multiplication may involve seven timesteps, one for each instruction of the instructions 340 as illustrated. As may be appreciated by a person skilled in the art, the registers indicated in the instructions 340 may refer to the corresponding block of matrix A 302 and sub-vector B 304. For example, referring to the first instruction, “vdot vReg 15, vReg 1, vReg2” , vReg1 may refer to the register (e.g., vector register 1 (vReg1) ) that comprises block 1, and vReg2 refers to the register (e.g., vReg2) that comprises sub-vector v1 (repeated 4 times to correspond with block 1) . The results of all the instructions belonging to the same stream (i.e., which in this embodiment may be defined by the 4-row group) , may accumulate to the same output vector register, e.g., vReg15, as illustrated.
FIG. 4A and 4B illustrate a matrix-vector multiplication applying block compressed sparse row (BCSR) optimization, according to an embodiment of the present disclosure. Referring to FIG. 4A, input matrix A 402 may be a sparse matrix wherein the 4x4 blocks b2 412, b5 415 and b7 417 may be all zeros (i.e., having zero values for all elements in the block) . The all-zero blocks, b2 412, b5 415 and b7 417, are illustrated as empty (no hash patterns) . For sparse input matrix, e.g., matrix A 402, a typical algorithmic optimization, such as BCSR, may be used to eliminate those 4x4 blocks that are all-zeros. Accordingly, referring to FIG. 4B, the corresponding vdot instructions (for the all-zero blocks) may be avoided (illustrated as crossed out) altogether. Thus, processing the 4-rows group (which may refer to the part the mmultiplication of matrix A 402 with vector B 404 that includes only a group of 4 rows) may only involve the subset of instructions corresponding to blocks b1 411, b3 413, b4 414, and b6 416 as illustrated (in FIG. 4B) . Accordingly, the seven-timestep process may be reduced to four-timestep process 460 by avoiding the all-zero blocks. As may be appreciated by a person skilled in the art, the BCSR optimization may avoid only the all-zeros blocks (all elements of the block having zero values) .
Embodiments described herein may provide for extracting sparsity at block level and within blocks. sVPU 100 may extract sparsity at block-level, if not implemented on  algorithmic level (e.g., using BCSR representation on the matrix) , as well as fine-grain sparsity within blocks as described herein. For the former (i.e., at block level) , sVPU 100 may detect instructions with all-zero input operands and eliminate them altogether from the reservation station. For the later (i.e., fine-grain sparsity within blocks) , remaining instructions corresponding to blocks with fine-grain sparsity (i.e., having zero value elements but also non-zero elements) may be coalesced into smaller number of instructions which may further enhance processing (e.g., speedup) in addition to the algorithmic BCSR optimization.
As may be appreciated by a person skilled in the art, Sparsity may be partially extracted on algorithmic level using BCSR representation of the matrix as described herein. As described, BCSR may eliminate blocks that are entirely zeros. Embodiments described herein may provide for extracting sparsity both on the block level (like BSCR) and in finer granularity within blocks as well as described herein.
FIG. 5A and 5B illustrate a matrix-vector multiplication applying lane coalescing, according to an embodiment of the present disclosure. Referring to FIG. 5A, input matrix A 502 may be a sparse matrix wherein the 4x4 blocks b2 512, b5 515 and b7 517 may be all zeros (similar to matrix A 402) . The remaining blocks, b1 511, b3 513, b4 514, and b6 516 may comprise zero and non-zero values. For illustrative purposes, only the elements within the 4x4 blocks b1 511 and b3 513 are shown, in which zero values are indicated as empty (no hash pattern) and non-zero values are indicated via hash patterns.
As discussed previously, the all-zero blocks b2 512, b5 515 and b7 517 may be avoided for processing. Accordingly, the vdot instructions may be based on blocks b1 511, b3 513, b4 514, and b6 516. Referring to FIG. 5B, in an embodiment, sVPU 100 may look at or examine the instructions to be executed next. sVPU 100 may determine one or more lanes having ineffectual computations due to one of the two corresponding input values being zero. For example, sVPU 100 may begin processing the vdot instructions from block b1 511, which is illustrated as one row. sVPU 100 may determine one or more lanes having ineffectual computations, e.g.,  lanes  530, 532, 534, 536, 538 and 540, due to zero values in these lanes (zero values indicated as empty boxes –no hash pattern) .
For each determined lane having an ineffectual computation, sVPU 100 may search, in future or subsequent instructions, for an effectual computation (both operand values are  non-zero) corresponding to the same lane (e.g., lane 530) . The search may be based on a combination window of size N. In FIG. 5B, the combination window size N may be, for example, 4 as illustrated.
In an embodiment, for lane 530, sVPU may search in future instructions, e.g., vdot instructions based on b3 513 and v3 523 and determine an effectual computation for the corresponding lane 530. Upon determining the effectual computations, sVPU 100 may replace 550 the ineffectual computation (due to ineffectual computation) in the current instruction with the determined effectual computation in the same lane 530 (corresponding to a future computation) brought froward from a future instruction within the combination window scope as illustrated.
This replacement mechanism may be referred to as coalescing and is indicated via arrows pointing from an effectual lane in a future instruction to the corresponding ineffectual lane in the current instruction.
As may be appreciated by a person skilled in the art, a computation may be ineffectual due to a zero value of either operand (i.e., operand from matrix A or vector B) . As such, while embodiment described herein may refer to an ineffectual computation based on an operand of matrix A having zero value, a person skilled in the art may appreciate that a computation may be determined to be ineffectual due to a zero value of vector B operand despite matrix A operand being a non-zero.
sVPU 100 may take the same approach, as taken with respect to the ineffectual lane 530 of the current instruction, to the  ineffectual lanes  532, 534, 536, 538, 540 of the current instruction. As illustrated,  ineffectual lanes  532, 534 and 540 corresponding to the current instructions may be replaced with  effectual lanes  532, 534, and 540 corresponding to the subsequent instruction (based on block b3 513) . For  ineffectual lanes  536 and 538 in the current instruction, sVPU 100 may look into further subsequent instructions (based on combination window size N) to determine effectual lanes.
As may be appreciated by a person skilled in the art, combination window may refer to the scope of coalescing. Coalescing may be applied to two or more instructions that belong to the same row of blocks and accumulate to the same output vector register (e.g., vReg 15) .
Accordingly, in addition to the BCSR optimization (which reduced instructions by avoiding all-zero blocks 460 in FIG. 4B) , embodiments may further enhance processing via coalescing to further reduce the instructions to be processed (i.e., avoiding zeros within BCSR blocks) , for example, a coalesced stream 560) as described herein.
Accordingly, the stream of instructions corresponding to the row of blocks may be coalesced into a smaller number of mixed instructions 560 where effectual lanes from different instructions may be packed and processed together. As such, in an embodiment, a matrix-vector multiplication which may require seven timesteps (as discussed in reference to FIG. 3A and 3B) may be reduced to four timesteps (as discussed in reference to FIG. 4A and 4B) , and may further be reduced to two timesteps 560 as illustrated.
As discussed, a combination window may determine the scope of coalescing wherein the window size may indicate the depth or how far in future instructions the sVPU may search for effectual lanes. The window size may also determine the upper limit on the speedup that sVPU may achieve. For example, for a combination window size of N, N instructions may be coalesced into 1, which may lead to a speed up of N times. In an embodiment, the combination window may be a moving window, i.e., as an instruction is executed, the combination window may slide or move and may be rebased or repositioned to start at the next instruction to execute.
In an embodiment, sVPU 100 may coalesce instructions that belong to the same stream, i.e., the same row of blocks and accumulate to the same output vector register as further described in reference to FIG. 7 and 8. Coalescing instructions that belong to the same stream may be a key factor in reducing the hardware overhead and complexity of lane coalescing mechanism. Limiting coalescing based on the same stream may avoid the need for any data-dependency checking hardware since all instruction considered for coalescing may accumulate to the same accumulator vector register.
In some embodiments, sVPU 100 may coalesce instructions based on same-lane coalescing, as described, for example, in reference to FIG. 5B. Same-lane coalescing may replace an ineffectual lane x, for example, with an effectual lane x from a future instruction. Same-lane coalescing may avoid the complex hardware needed to support cross-lane coalescing.
As may be appreciated by a person skilled in the art, embodiments described herein may be applied for compute or storage purposes. In an embodiment, once the data to be computed is available at their corresponding registers of a processor, such data may be examined and combined (e.g., coalesced) according to embodiments described herein to produce a reduced number of instructions for compute purposes.
In an embodiment, sVPU 100 was tested with high performance conjugate gradient (HPCG) workload and the SuiteSparse benchmark (1000 matrices were randomly selected from the benchmark suite) . Different combination window sizes were attempted and the results observed. Based on the tests, the following observations may be drawn.
As the combination window size increases, the potential performance gain may also increase. On the other hand, the increased window size, may also lead to increased hardware complexity. Accordingly, at a certain combination window size, the performance gain may peak.
For example, with respect to SuiteSparse, the performance gain may peak at combination window size of 16, where the speedup may saturate at 2.2x on average over the 1000 matrices. In the case of HPCG workload, the performance gain may peak at combination window size of 7, where the speedup may saturate at around 3x.
FIG. 6A illustrates sVPU performance gain over BCSR based on SuiteSparse benchmark, according to an embodiment of the present disclosure. The horizontal axis of FIG. 6A refers to different matrices of the SuiteSparse benchmark (only 29 matrices are shown for illustrative purposes) . The vertical axis of FIG. 6A refers to performance gain (i.e., speedup) of the sVPU over BCSR. As mentioned, a random sample of 1000 matrices from SuiteSparse benchmark were experimented. Of the sample, up to 9x speedups (not shown) and an average of 2.2x (illustrated via line 602) speedups were observed. The speedup of sVPU over BCSR for a portion of the random sample are illustrated.
FIG. 6B illustrate sVPU performance gain as a function of combination window size based on SuiteSparse benchmark, according to an embodiment of the present disclosure. The horizontal axis of FIG. 6B refers to different combination window sizes. The vertical axis of FIG. 6B refers to performance gain (i.e., speedup) over a combination window size of  2. Referring to FIG. 6B, the performance gain saturates as combination window size increases. The performance gain peaks at combination window size 16 (illustrated via hash lines) having approximately 60%performance gain over combination window size 2 as illustrated via line 604.
FIG. 6C illustrates sVPU performance gain as a function of combination window size based on high performance conjugate gradient (HPCG) workload, according to an embodiment of the present disclosure. The horizontal axis of FIG. 6C refers to different combination window sizes. The vertical axis of FIG. 6C refers to performance gain (i.e., speedup) over BCSR. As illustrated, the performance gain saturates as combination window size increases. The performance gain peaks at combination window size 7 (illustrated via hash lines) reaching a performance gain of just below 3 times that of BCSR (illustrated via line 606) .
FIG. 6D illustrates the average coalescing distance based on HPCG workload, according to an embodiment of the present disclosure. The horizontal axis of FIG. 6D refers to different combination window sizes. The vertical axis of FIG. 6D refers to the average coalescing distance associated with the corresponding combination window size. The average distance illustrated indicates the average distance between the current instruction and the further coalesced lane (i.e., in future instruction) . The average coalescing distance based on applying sVPU 100 to HPCG workload was determined to be 2.25 instructions (illustrated via line 608) at combination window size of 6 (illustrated via hash lines) , indicating, that an instruction, on average, became sufficiently dense by coalescing lanes from subsequent 3 (rounding up 2.2.5) instructions.
The performance gain (i.e., speedups) achieved may be significant considering the small hardware complexity associated with sVPU, coalescing conditions and combination window size. Coalescing conditions may be based on same-lane coalescing within a stream of instructions accumulating to the same accumulation registers. The combination window size may be based on the peak performance gain as described herein.
Embodiments will now describe indicators for a stream of instructions that accumulate to the same accumulation registers. Embodiments may provide for start and end indicators for a stream of instructions.
In an embodiment, sVPU 100 may use stream-guards technique to mark or indicate the beginning and end of a stream of vector instructions that accumulate to the same output vector register. Such stream of instructions may be safely coalesced without the need for data-dependency checking hardware. Otherwise, if instructions accumulating to different output vector registers are considered for coalescing, a special expensive hardware may be needed to ensure the lane-wise data-independence of the outputs of these candidate instructions.
Accordingly, sVPU 100 may uses stream guards surrounding the stream of instructions accumulating to the same output vector register. These stream guards may indicate to the backend sVPU hardware that the instructions within the stream guards are functionally correct to coalesce.
In an embodiment, sVPU 100 may insert stream guards surrounding each stream of vdot instructions where a stream corresponds to a complete row of 4x4 blocks. For example, in reference to FIG. 3 to FIG. 5, sVPU 100 may insert stream guards before the beginning and after the end of matrix A, for example, to indicate that the row of blocks comprising b1 to b7 is one stream of vdot instructions to which coalescing operations, as described herein, may be applied. As such, the stream guards may limit sVPU coalescing to vdot instructions from the same row of blocks and avoid coalescing vdot instructions from the next row of blocks as further described herein.
FIG. 7 and 8 illustrate use of stream guards for indicating different streams of instructions, according to an embodiment of the present disclosure. In an embodiment, a matrix-vector multiplication may comprise an input matrix A which may include two rows of 4x4 blocks 740 and 742 as illustrated. The first row of blocks 740 may correspond to vector instructions that accumulate to a first output vector registers, and the second row of 4x4 blocks 742 may correspond to vector instructions that accumulate to a second output vector register.
In an embodiment, referring to FIG. 8, sVPU 100 may use stream guards to indicate the beginning (via, for example, sVPU_stream_start) and the end (via, for example, sVPU_stream_end) of a stream of instructions that accumulate to the same output vector register. Accordingly, the first rows of 4x4 bocks 740 may be indicated as a first stream of  instructions 750 via using stream guards as illustrated. Similarly, the second rows of 4x4 blocks 742 may be indicated as a second stream of instructions 752 using stream guards as illustrated. While the same register (e.g., vReg15) is illustrated as the output register for accumulating both stream of  instructions  750 and 752, a person skilled in the art may appreciated that after processing the first stream of instruction 750 and before processing or accumulating the second stream of instruction 752, the outputs in the register may be flushed and stored in a memory and the register may then be initialized to zero for preparing for the second stream of instruction 752. In another embodiment, a different output vector register may be used for each stream of instruction.
As may be appreciated by a person skilled in the art, all-zero blocks from the first rows of 4x4 blocks 740 and the second rows of 4x4 blocks 742 may be avoided as indicated in the streams of instructions750 and 752. sVPU 100 may apply coalescing mechanisms as described herein to the remaining blocks (e.g., blocks b1 711, b3 713, b4 714, and b6 716 for the first rows of 4x4 blocks 740 corresponding to the first stream of instructions 750, and blocks b9 719, b10 720, b12 722 and b14 724 for the second rows of 4x4 blocks 742 corresponding to the second stream of instructions 752) , since the remaining blocks may comprise zero value elements.
In an embodiment, stream guards may be implemented either as an extension to the instruction set architecture (ISA) or through introducing a new control/status register (CSR) .
A CSR approach to implementing stream guards may have minimal intrusion to the ISA. In an embodiment, sVPU 100 may introduce a new CSR name SVPU_CR. A stream start may be marked by writing “1” and a stream end may be marked by writing “0” to SVPU_CR to enable (when, for example, “1” is written) and disable (when, for example, “0” is written) coalescing by the sVPU backend.
In an embodiment, the SVPU_CR may be implemented as a 1-bit register, accordingly, there may be no support for nesting of streams. No support for nesting of streams may mean that sVPU 100 may keep track of a single stream at time. So, if a new stream (i.e., a second stream) needs to be started, the current one (i.e., first stream) needs to be terminated (with the stream guard closure) before beginning the new stream. Then after finishing the new stream (i.e., the second stream) , the remainder of the old stream that was  not processed due to termination may be then be processed as a new shorter stream (i.e., third stream) . When terminating the first stream, no state needs to retained, and the remainder of the old stream will be dealt with as a new shorter stream (i.e., third stream) .
Typically, a CSR may be written to use an atomic CSR read and write instruction which may be available in every ISA, such as CSRRW for RISC-V ISA.
Embodiments described herein introduce the concept of “instructions stream” . Embodiments described herein may limit coalescing to a stream of instructions accumulating to the same output vector register as determined according to stream guards. Coalescing based on a stream of instructions, as described herein, may allow for reduced costs in terms of hardware by obviating the need for data-dependency checking hardware that is otherwise necessary to resolve dependency between candidate instructions. By limiting the coalescing candidate instructions to the same stream of instructions, coalescing may be performed without dependency checks.
FIG. 9 illustrates a block diagram of an sVPU u-architecture, according to an embodiment of the present disclosure. In an embodiment, the execution backend 900 may comprise one or more of a reorder buffer (ROB) 902, a lane coalescing unit (LCU) 904, a mask generation unit (MGU) 906, a reservation station (RS) 908, a vector register file 910, and an sVPU 100.
FIG. 10 illustrates a mask generation unit according to an embodiment of the present disclosure. Referring to FIG. 9 and FIG. 10, the MGU 906 may generate effectual lane masks (ELM) for each vdot instruction that reside in RS 908 and are ready to be executed. In an embodiment, for each lane of the vector inputs, the corresponding bit in the generated mask may indicate whether the input value is effectual or not. As such, the ELM may indicate which lane have effectual values. The generated ELM masks may be kept either in a newly added field “ELM 1002” in the RS table or in some mask physical register file if available in the CPU architecture.
FIG. 10 illustrates one or more MGU units 906 including comparators, NOR gates, and a new field “ELM 1002” added to the RS table to keep the generated masks. For each lane, each  vector operand  1004 and 1006 may be evaluated to determine whether either  operand is zero and thus ineffectual. The results (i.e., ineffectual or effectual) may be indicated in the RS table accordingly. The MGU 906 may be instantiated N times where N is the size of the combination window so that up to N instructions may be investigated in parallel at a time.
Referring to FIG. 9, the LCU 904 may use the masks (e.g., ELM 1002) generated by the MGUs 906 as input and decide which lanes to coalesce accordingly. In an embodiment, the coalescing mechanism (e.g., LCU 904) may employ the method described in FIG 11 to perform the coalescing operations as described herein.
FIG. 11 illustrates a coalescing method, according to an embodiment of the present disclosure. At 1102, one or more ineffectual lane in the current vector instruction is identified or determined according to embodiments described herein.
At 1104, for each determined ineffectual lane, LCU 904 may inspect the corresponding lane position in each ready instruction within the combination window in program order.
At 1106, the corresponding effectual lane in the earliest ready instruction is coalesced to fill the bubble of the ineffectual lane and the corresponding input operands are brought forward and packed into the current instruction vector inputs.
At 1108, when a lane is coalesced from a subsequent instruction, the corresponding bit in the ELM mask may be zeroed, thereby marking or indicating that the effectual lane has been successfully coalesced and will be executed. This ensures the lane will not be considered again while performing future coalescing.
In some embodiments, at 1110, the LCU 904 may keep a record of the original instruction from which a lane is coalesced so that, in case of an interruption, the machine state is maintained by squashing the lanes coalesced from subsequent instructions into the current one.
After packing effectual lanes from current and subsequent instructions, the mixed instruction may be issued for execution by the vector processing engine. Any subsequent instruction for which all the effectual lanes have been coalesced and executed, is removed  from the reservation station and marked as “done” .
For any lane that the LCU 904 may need to coalesce, LCU may prioritize an earlier instruction ahead of a later instruction according to the program order, i.e., for an ineffectual lane in the current instruction, an effectual corresponding lane from an earlier instruction (e.g., a first subsequent instruction (I 1) ) may be given a higher priority over the corresponding lane of a later instruction (e.g., a second subsequent instruction (I 2) ) if I 1 is older than I 2 according to the program order. Prioritizing instructions may help simplify the interrupt-handling mechanisms in case an interruption occurs and the coalesced lanes from subsequent instructions need to be squashed.
Embodiments described in reference to the coalescing mechanism may be simple and cost effective since such mechanism does not need expensive hardware support. The coalescing mechanism embodiments described herein may offer a low-cost approach based on one or more of the following. The coalescing mechanism embodiments described herein may offer a low-cost approach by limiting the combination window to a practical size rather than including all reservation station entries for performing the search for effectual lanes to coalesce. The coalescing mechanism embodiments described herein may offer a low-cost approach since the hardware support for implementation may be based on an MGU (e.g., MGU 906) comprising comparators to detects zeros and NOR gates to take both operands into account for binary operators. The coalescing mechanism embodiments described herein may offer a low-cost approach since the mechanism (used in the LCU 904) may be implemented using priority-based selection hardware (to prioritize instructions in program order) that may have optimized and known implementations.
Embodiments described herein may provide for a vector processing unit (e.g., a sparsity-aware vector processing unit (sVPU) ) that coalesces vector instructions into fewer instructions by packing only effectual lanes from the original instructions.
In some embodiments, the sVPU 100 may consider only same-lane coalescing and avoid cross-lane coalescing which require expensive hardware components in the micro-architecture.
In some embodiments, the sVPU 100 may consider only instructions that  accumulate to the same output vector register for coalescing. Limiting the instructions in such a way may simplify the implementation and avoid the otherwise necessary data-dependency checking hardware.
In some embodiments, an instruction stream may be marked using “stream guards” that may be implemented using ISA extension instructions or using a new control/status register (CSR) which may be written to indicate the start and end of the stream.
Embodiments described herein may target a wide set of operations or primitives that perform stream of computations and reduction on sparse operands. Thus, a wide set of CPU instructions may be targeted for coalescing by sVPU 100 including: vector Dot, vector add, and vector multiply-accumulate.
While embodiments are described in the context of CPU vector unit, embodiments may be equivalently applicable to other commodity architectures such as graphics processing units (GPUs) and digital signal processors (DSPs) . These commodity architectures typically feature vector processing engines and their typical workloads are expected to have similar levels of sparsity in their input data. As such, embodiments described here may be applicable to such other commodity architectures.
FIG. 12 is a schematic diagram of an electronic device 1200 that may perform any or all of operations of the methods and features explicitly or implicitly described herein, according to different embodiments of the present disclosure.
As shown, the electronic device 1200 may include a processor 1210, such as a Central Processing Unit (CPU) or specialized processors such as a Graphics Processing Unit (GPU) or other such processor unit, memory 1220, non-transitory mass storage 1230, input-output interface 1240, network interface 1250, and a transceiver 1260, all of which are communicatively coupled via bi-directional bus 1270. According to certain embodiments, any or all of the depicted elements may be utilized, or only a subset of the elements. Further, electronic device 1200 may contain multiple instances of certain elements, such as multiple processors, memories, or transceivers. Also, elements of the hardware device may be directly coupled to other elements without the bi-directional bus. Additionally, or alternatively to a processor and memory, other electronics, such as integrated circuits, may be employed for  performing the required logical operations.
The memory 1220 may include any type of non-transitory memory such as static random-access memory (SRAM) , dynamic random-access memory (DRAM) , synchronous DRAM (SDRAM) , read-only memory (ROM) , any combination of such, or the like. The mass storage element 1230 may include any type of non-transitory storage device, such as a solid-state drive, hard disk drive, a magnetic disk drive, an optical disk drive, USB drive, or any computer program product configured to store data and machine executable program code. According to certain embodiments, the memory 1220 or mass storage 1230 may have recorded thereon statements and instructions executable by the processor 1210 for performing any of the aforementioned method operations described above.
Embodiments of the present invention can be implemented using electronics hardware, software, or a combination thereof. In some embodiments, the invention is implemented by one or multiple computer processors executing program instructions stored in memory. In some embodiments, the invention is implemented partially or fully in hardware, for example using one or more field programmable gate arrays (FPGAs) or application specific integrated circuits (ASICs) to rapidly perform processing operations.
It will be appreciated that, although specific embodiments of the technology have been described herein for purposes of illustration, various modifications may be made without departing from the scope of the technology. The specification and drawings are, accordingly, to be regarded simply as an illustration of the invention as defined by the appended claims, and are contemplated to cover any and all modifications, variations, combinations or equivalents that fall within the scope of the present invention. In particular, it is within the scope of the technology to provide a computer program product or program element, or a program storage or memory device such as a magnetic or optical wire, tape or disc, or the like, for storing signals readable by a machine, for controlling the operation of a computer according to the method of the technology and/or to structure some or all of its components in accordance with the system of the technology.
Acts associated with the method described herein can be implemented as coded instructions in a computer program product. In other words, the computer program product is a computer-readable medium upon which software code is recorded to execute the method  when the computer program product is loaded into memory and executed on the microprocessor of the wireless communication device.
Further, each operation of the method may be executed on any computing device, such as a personal computer, server, PDA, or the like and pursuant to one or more, or a part of one or more, program elements, modules or objects generated from any programming language, such as C++, Java, or the like. In addition, each operation, or a file or object or the like implementing each said operation, may be executed by special purpose hardware or a circuit module designed for that purpose.
Through the descriptions of the preceding embodiments, the present invention may be implemented by using hardware only or by using software and a necessary universal hardware platform. Based on such understandings, the technical solution of the present invention may be embodied in the form of a software product. The software product may be stored in a non-volatile or non-transitory storage medium, which can be a compact disc read-only memory (CD-ROM) , USB flash disk, or a removable hard disk. The software product includes a number of instructions that enable a computer device (personal computer, server, or network device) to execute the methods provided in the embodiments of the present invention. For example, such an execution may correspond to a simulation of the logical operations as described herein. The software product may additionally or alternatively include a number of instructions that enable a computer device to execute operations for configuring or programming a digital logic apparatus in accordance with embodiments of the present invention.
Although the present invention has been described with reference to specific features and embodiments thereof, it is evident that various modifications and combinations can be made thereto without departing from the invention. The specification and drawings are, accordingly, to be regarded simply as an illustration of the invention as defined by the appended claims, and are contemplated to cover any and all modifications, variations, combinations or equivalents that fall within the scope of the present invention.

Claims (21)

  1. A method comprising:
    receiving a stream of vector instructions for processing, the stream of vector instructions comprising a plurality of vector instructions;
    determining an ineffectual computation corresponding to a first lane of a first vector instruction of the stream of vector instructions;
    determining an effectual computation corresponding to a second lane of a second vector instruction of the stream of vector instructions, wherein:
    the second vector instruction is subsequent to the first vector instruction according to a processing order of the stream of vector instructions; and
    the second lane of the second vector instruction and the first lane of the first vector instruction correspond to a same lane;
    coalescing the second lane of the second vector instruction with the first lane of the first vector instruction; and
    processing the stream of vector instructions.
  2. The method of claim 1, wherein the determining an effectual computation is based on a combination window size indicating a number of vector instructions of the stream of vector instructions subsequent to the first vector instruction.
  3. The method of claim 1 or 2, wherein the stream of vector instructions accumulates to one output register.
  4. The method of any one of claims 1 to 3, wherein
    the coalescing comprises replacing the ineffectual computation with the effectual computation; and
    the processing the stream of vector instructions comprises: processing the first vector instruction comprising the effectual computation.
  5. The method of any one of claims 1 to 4, wherein the ineffectual computation comprises a vector operand having a zero value.
  6. The method of any one of claims 1 to 5, wherein the stream of vector instructions is indicated by a start-stream indicator and an end-stream indicator.
  7. The method of any one of claims 1 to 6, further comprising:
    determining a second effectual computation corresponding to a third lane of a third vector instruction of the stream of vector instructions, wherein:
    the third vector instruction is subsequent to the second vector instruction according to the processing order of the stream of vector instructions; and
    the third lane of the third vector instruction and the second lane of the second vector instruction correspond to a same lane;
    coalescing the third lane of the third vector instruction with the second lane of the second vector instruction.
  8. The method of claim 7, wherein the processing the stream of vector instructions comprises: processing the second vector instruction comprising the second effectual computation.
  9. The method of any one of claims 1 to 8, further comprising:
    receiving a second stream of vector instructions for processing, wherein the second stream of vector instructions:
    accumulates to a second output register; and
    is indicated by a second start-stream indicator and a second end-stream indicator; and
    processing the second stream of vector instructions.
  10. The method of claim 9, wherein the processing the second stream of vector instructions is performed after processing the stream of vector instructions.
  11. An apparatus comprising:
    one or more mask generation units;
    one or more lane processing units;
    one or more lane coalescing units;
    at least one processor; and
    and at least one machine-readable medium storing executable instructions which when executed by the at least one processor configure the apparatus for:
    receiving a stream of vector instructions for processing, the stream of vector instructions comprising a plurality of vector instructions;
    determining, via the one or more mask generation units, an ineffectual computation corresponding to a first lane of a first vector instruction of the stream of vector instructions;
    determining, via the one or more mask generation units, an effectual computation corresponding to a second lane of a second vector instruction of the stream of vector instructions, wherein:
    the second vector instruction is subsequent to the first vector instruction according to a processing order of the stream of vector instructions; and
    the second lane of the second vector instruction and the first lane of the first vector instruction correspond to a same lane;
    coalescing, via the one or more lane coalescing units, the second lane of the second vector instruction with the first lane of the first vector instruction; and
    processing, via one or more lane processing units, the stream of vector instructions, wherein each lane processing unit corresponds to a corresponding lane in the stream of vector instruction.
  12. The apparatus of claim 11, wherein the determining an effectual computation is based on a combination window size indicating a number of vector instructions of the stream of vector instructions subsequent to the first vector instruction.
  13. The apparatus of claim 11 or 12, wherein the stream of vector instructions accumulates to one output register.
  14. The apparatus of any one of claims 11 to 13, wherein
    the coalescing comprises replacing the ineffectual computation with the effectual computation; and
    the processing the stream of vector instructions comprises: processing the first vector instruction comprising the effectual computation.
  15. The apparatus of any one of claims 11 to 14, wherein the ineffectual computation comprises a vector operand having a zero value.
  16. The apparatus of any one of claims 11 to 15, wherein the stream of vector instructions is indicated by a start-stream indicator and an end-stream indicator.
  17. The apparatus of any one of claims 11 to 16, wherein the executable instructions which when executed by the at least one processor further configure the apparatus for:
    determining, via the one or more mask generation units, a second effectual computation corresponding to a third lane of a third vector instruction of the stream of vector instructions, wherein:
    the third vector instruction is subsequent to the second vector instruction according to the processing order of the stream of vector instructions; and
    the third lane of the third vector instruction and the second lane of the second vector instruction correspond to a same lane;
    coalescing, via the one or more lane coalescing units, the third lane of the third vector instruction with the second lane of the second vector instruction.
  18. The apparatus of claim 17, wherein the processing comprises: processing the second vector instruction comprising the second effectual computation.
  19. The apparatus of any one of claims 11 to 18, wherein the executable instructions which when executed by the at least one processor further configure the apparatus for:
    receiving a second stream of vector instructions for processing, wherein the second stream of vector instructions:
    accumulates to a second output register; and
    is indicated by a second start-stream indicator and a second end-stream indicator; and
    processing, via the one or more lane processing units, the second stream of vector instructions.
  20. The apparatus of claim 19, wherein the processing, via the one or more lane processing units, the second stream of vector instructions is performed after processing the stream of vector instructions.
  21. A machine-readable medium storing executable instructions which when executed by a processor configure the processor to perform a method according to any one of claims 1-10.
PCT/CN2021/112508 2021-08-13 2021-08-13 Systems and methods for sparsity-aware vector processing in general purpose cpus WO2023015560A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/112508 WO2023015560A1 (en) 2021-08-13 2021-08-13 Systems and methods for sparsity-aware vector processing in general purpose cpus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/112508 WO2023015560A1 (en) 2021-08-13 2021-08-13 Systems and methods for sparsity-aware vector processing in general purpose cpus

Publications (1)

Publication Number Publication Date
WO2023015560A1 true WO2023015560A1 (en) 2023-02-16

Family

ID=85199751

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/112508 WO2023015560A1 (en) 2021-08-13 2021-08-13 Systems and methods for sparsity-aware vector processing in general purpose cpus

Country Status (1)

Country Link
WO (1) WO2023015560A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105960630A (en) * 2014-02-07 2016-09-21 Arm 有限公司 A data processing apparatus and method for performing segmented operations
CN108008999A (en) * 2016-11-02 2018-05-08 华为技术有限公司 Index evaluating method and device
CN108369509A (en) * 2015-12-21 2018-08-03 英特尔公司 Instruction for the scatter operation that strides based on channel and logic
CN109643234A (en) * 2016-09-22 2019-04-16 英特尔公司 For merging data element and generate processor, method, system and the instruction of index upgrade
CN110554887A (en) * 2018-06-01 2019-12-10 英特尔公司 Indirect memory fetcher

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105960630A (en) * 2014-02-07 2016-09-21 Arm 有限公司 A data processing apparatus and method for performing segmented operations
CN108369509A (en) * 2015-12-21 2018-08-03 英特尔公司 Instruction for the scatter operation that strides based on channel and logic
CN109643234A (en) * 2016-09-22 2019-04-16 英特尔公司 For merging data element and generate processor, method, system and the instruction of index upgrade
CN108008999A (en) * 2016-11-02 2018-05-08 华为技术有限公司 Index evaluating method and device
CN110554887A (en) * 2018-06-01 2019-12-10 英特尔公司 Indirect memory fetcher

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
GONG ZHANGXIAOWEN; JI HOUXIANG; FLETCHER CHRISTOPHER W.; HUGHES CHRISTOPHER J.; BAGHSORKHI SARA; TORRELLAS JOSEP: "SAVE: Sparsity-Aware Vector Engine for Accelerating DNN Training and Inference on CPUs", 2020 53RD ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO), IEEE, 17 October 2020 (2020-10-17), pages 796 - 810, XP033856382, DOI: 10.1109/MICRO50266.2020.00070 *

Similar Documents

Publication Publication Date Title
CA3101214C (en) Modifying machine learning models to improve locality
CN109328361B (en) Accelerator for deep neural network
US10846621B2 (en) Fast context switching for computational networks
EP3686816A1 (en) Techniques for removing masks from pruned neural networks
CN111465943B (en) Integrated circuit and method for neural network processing
US20100161911A1 (en) Method and apparatus for mpi program optimization
CN104952032A (en) Graph processing method and device as well as rasterization representation and storage method
EP4191474A1 (en) Dynamic batching for inference system for transformer-based generation tasks
US20190278574A1 (en) Techniques for transforming serial program code into kernels for execution on a parallel processor
US9058301B2 (en) Efficient transfer of matrices for matrix based operations
CN118265983A (en) Memory optimized contrast learning
WO2023015560A1 (en) Systems and methods for sparsity-aware vector processing in general purpose cpus
KR102582079B1 (en) Matrix index information generation metohd, matrix process method and device using matrix index information
KR101075439B1 (en) String matching device based on multi-core processor and string matching method thereof
CN111352860B (en) Garbage recycling method and system in Linux Bcache
Mustapha et al. Research Article Evaluation of Parallel Self-organizing Map Using Heterogeneous System Platform
KR20230141672A (en) Matrix index information generation metohd, matrix process method and device using matrix index information
CN118839741A (en) Modifying machine learning models to improve locality
CN117538727A (en) Heterogeneous computation-oriented parallel fault simulation method, system and medium
WO2015004570A1 (en) Method and system for implementing a dynamic array data structure in a cache line
CN116029890A (en) Neural network enhanced graphics processor pipeline architecture
CN114090470A (en) Data preloading device and preloading method thereof, storage medium and computer equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21953173

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21953173

Country of ref document: EP

Kind code of ref document: A1