WO2023015560A1

WO2023015560A1 - Systems and methods for sparsity-aware vector processing in general purpose cpus

Info

Publication number: WO2023015560A1
Application number: PCT/CN2021/112508
Authority: WO
Inventors: Mostafa MAHMOUD; Reza AZIMI; Dawei Li; Wenbo SUN
Original assignee: Huawei Technologies Co.,Ltd.
Priority date: 2021-08-13
Filing date: 2021-08-13
Publication date: 2023-02-16

Abstract

Systems and methods for sparsity-aware vector processing in general purpose CPUs are described. An aspect of the disclosure provides for a method including receiving a stream of vector instructions for processing. The method further includes determining an ineffectual computation corresponding to a first lane of a first vector instruction of the stream of vector instructions and determining an effectual computation corresponding to a second lane of a second vector instruction of the stream of vector instructions, where the second vector instruction is subsequent to the first vector instruction according to a processing order of the stream of vector instructions, and the second lane of the second vector instruction and the first lane of the first vector instruction correspond to a same lane. The method further includes coalescing the second lane with the first lane and processing said stream of vector instructions.

Description

SYSTEMS AND METHODS FOR SPARSITY-AWARE VECTOR PROCESSING IN GENERAL PURPOSE CPUS

CROSS-REFERENCE TO RELATED APPLICATIONS

This is the first application filed for the present invention.

TECHNICAL FIELD

The present invention pertains to the field of computing, and in particular to systems and methods for sparsity-aware vector processing in general purpose CPUs.

BACKGROUND

Many high-performance computing (HPC) and artificial intelligence (AI) applications involve sparse data. Sparsity in vector operations presents challenges during processing such as unnecessary power consumption and wasting execution time. Existing techniques for dealing with sparsity in data have limitations and deficiencies that render their implementations infeasible or unjustified. For example, existing techniques employ complex hardware requirements that may limit the operating frequency of a vector unit, leading to increased power consumption and chip area. Further, existing techniques rely on hardware dependency checking modules for checking and resolving dependencies among vector instructions, which adds a further layer of complexity to the hardware requirements.

Therefore, there is a need for systems and methods to obviate or mitigate one or more limitations of the prior art.

This background information is provided an enhanced to reveal information believed by the applicant to be of possible relevance to the present invention. No admission is necessarily intended, nor should be construed, that any of the preceding information constitutes prior art against the present invention.

SUMMARY

An aspect of the disclosure provides for a method. The method includes receiving a stream of vector instructions for processing, the stream of vector instructions including a plurality of vector instructions. The method further includes determining an ineffectual computation corresponding to a first lane of a first vector instruction of the stream of vector instructions. The method further includes determining an effectual computation corresponding to a second lane of a second vector instruction of the stream of vector instructions. The method wherein the second vector instruction is subsequent to the first vector instruction according to a processing order of the stream of vector instructions, and the second lane of the second vector instruction and the first lane of the first vector instruction correspond to a same lane. The method further includes coalescing the second lane of the second vector instruction with the first lane of the first vector instruction. The method further includes processing the stream of vector instructions. The method may provide for reduced hardware complexity and reduced cost due to same-lane coalescing. The method may leverage instruction set architecture (ISA) support for simpler implementations.

In some embodiments, the determining an effectual computation is based on a combination window size indicating a number of vector instructions of the stream of vector instructions subsequent to the first vector instruction. The combination window size may indicate an upper limit or a peak for performance gain that method may achieve.

In some embodiments the stream of vector instructions accumulates to one output register. The method may further provide for simplified hardware support due to coalescing that is based on a stream of instructions that accumulate to the same output register, which may obviate the need for hardware dependency checking modules that is needed for cross-lane coalescing.

In some embodiments, the coalescing includes replacing the ineffectual computation with the effectual computation. In some embodiments, the coalescing further includes the processing the stream of vector instructions includes: processing the first vector instruction including the effectual computation. The method may provide for condensing a stream of instruction in a reduced form.

In some embodiments, the ineffectual computation includes a vector operand having a zero value. The method may provide for extracting sparsity in finer granularity in addition to algorithmic level.

In some embodiments, the stream of vector instructions is indicated by a start-stream indicator and an end-stream indicator. The method may provide for enhanced ISA extensions to mark the beginning and end of a stream of target instructions.

In some embodiments, the method further includes determining a second effectual computation corresponding to a third lane of a third vector instruction of the stream of vector instructions. In some embodiments, the third vector instruction is subsequent to the second vector instruction according to the processing order of the stream of vector instructions. In some embodiments, the third lane of the third vector instruction and the second lane of the second vector instruction correspond to a same lane. In some embodiments, the method further includes coalescing the third lane of the third vector instruction with the second lane of the second vector instruction. The method may provide for reducing a stream of instruction in a condensed form.

In some embodiments, the processing the stream of vector instructions includes processing the second vector instruction including the second effectual computation.

In some embodiments, the method further includes receiving a second stream of vector instructions for processing. In some embodiments, the second stream of vector instructions accumulates to a second output register. In some embodiments, the second stream of vector instructions is indicated by a second start-stream indicator and a second end-stream indicator. In some embodiments, the method further includes processing the second stream of vector instructions.

In some embodiments, the processing the second stream of vector instructions is performed after processing the stream of vector instructions.

Another aspect of the disclosure provides for an apparatus. The apparatus includes one or more mask generation units. The apparatus further includes one or more lane processing units. The apparatus further includes one or more lane coalescing units. The apparatus further includes at least one processor. The apparatus further includes at least one machine readable medium storing executable instructions which when executed by the at least one processor configure the apparatus to perform the methods described herein. For example, the apparatus is configured for receiving a stream of vector instructions for processing, the stream of vector instructions including a plurality of vector instructions. The apparatus is further configured for determining, via the one or more mask generation units, an ineffectual computation corresponding to a first lane of a first vector instruction of the stream of vector instructions. The apparatus is further configured for determining, via the one or more mask generation units, an effectual computation corresponding to a second lane of a second vector instruction of the stream of vector instructions. The second vector instruction is subsequent to the first vector instruction according to a processing order of the stream of vector instructions, and the second lane of the second vector instruction and the first lane of the first vector instruction correspond to a same lane. The apparatus is further configured for coalescing, via the one or more lane coalescing units, the second lane of the second vector instruction with the first lane of the first vector instruction. The apparatus is further configured for processing, via one or more lane processing units, the stream of vector instructions, wherein each lane processing unit corresponds to a corresponding lane in the stream of vector instruction. The apparatus may provide for reduced hardware complexity and reduced cost due to same-lane coalescing. The apparatus may leverage instruction set architecture (ISA) support for simpler implementations.

In some embodiments, the configuration for determining an effectual computation is based on a combination window size indicating a number of vector instructions of the stream of vector instructions subsequent to the first vector instruction. The combination window size may indicate an upper limit or a peak for performance gain that the apparatus may achieve.

In some embodiments, the stream of vector instructions accumulates to one output register. The apparatus may further provide for simplified hardware support due to coalescing that is based on a stream of instructions that accumulate to the same output register, which may obviate the need for hardware dependency checking modules that is needed for cross-lane coalescing.

In some embodiments, the coalescing includes replacing the ineffectual computation with the effectual computation. In some embodiments, the coalescing further includes the processing the stream of vector instructions includes: processing the first vector instruction including the effectual computation. The apparatus may provide for condensing a stream of instruction in a reduced form.

In some embodiments, the ineffectual computation includes a vector operand having a zero value. The apparatus may provide for extracting sparsity in finer granularity in addition to algorithmic level.

In some embodiments, the stream of vector instructions is indicated by a start-stream indicator and an end-stream indicator. The apparatus may provide for enhanced ISA extensions to mark the beginning and end of a stream of target instructions.

In some embodiment the executable instructions which when executed by the at least one processor further configure the apparatus for determining, via the one or more mask generation units, a second effectual computation corresponding to a third lane of a third vector instruction of the stream of vector instructions. In some embodiments, the third vector instruction is subsequent to the second vector instruction according to the processing order of the stream of vector instructions. In some embodiments, the third lane of the third vector instruction and the second lane of the second vector instruction correspond to a same lane. In some embodiments, the apparatus is further configured for coalescing, via the one or more lane coalescing units, the third lane of the third vector instruction with the second lane of the second vector instruction. The apparatus may provide for a sparsity-aware vector processing unit (sVPU) for general purpose CPUs that may address the challenge posed by high sparsity rations. The apparatus may provide for reducing a stream of instruction in a condensed form.

In some embodiments, the executable instructions which when executed by the at least one processor further configure the apparatus for receiving a second stream of vector instructions for processing. In some embodiments, the second stream of vector instructions accumulates to a second output register. In some embodiments, the second stream of vector instructions is indicated by a second start-stream indicator and a second end-stream indicator. In some embodiments, the apparatus is further configured for processing, via the one or more lane processing units, the second stream of vector instructions.

In some embodiments the processing, via the one or more lane processing units, the second stream of vector instructions is performed after processing the stream of vector instructions.

Other aspects of the disclosure provide for machine readable mediums, apparatus and systems configured to implement the methods disclosed herein. For example, an electronic device can be configured with machine readable memory containing instructions, which when executed by the processors of these devices, configures the device to perform the methods disclosed herein.

Embodiments have been described above in conjunction with aspects of the present invention upon which they can be implemented. Those skilled in the art will appreciate that embodiments may be implemented in conjunction with the aspect with which they are described but may also be implemented with other embodiments of that aspect. When embodiments are mutually exclusive, or are incompatible with each other, it will be apparent to those skilled in the art. Some embodiments may be described in relation to one aspect, but may also be applicable to other aspects, as will be apparent to those of skill in the art.

BRIEF DESCRIPTION OF THE DRAWINGS

Further features and advantages of the present invention will become apparent from the following detailed description, taken in combination with the appended drawings, in which:

FIG. 1 illustrates vector lane coalescing, according to an embodiment of the present disclosure.

FIG. 2 illustrates an example of vector dot (vdot) instruction, according to an embodiment of the present disclosure.

FIG. 3A and 3B illustrate a matrix-vector multiplication, according to an embodiment of the present disclosure.

FIG. 4A and 4B illustrate a matrix-vector multiplication applying block compressed sparse row (BCSR) optimization, according to an embodiment of the present disclosure.

FIG. 5A and 5B illustrate a matrix-vector multiplication applying lane coalescing, according to an embodiment of the present disclosure

FIG. 6A illustrates sVPU performance gain over BCSR based on SuiteSparse benchmark, according to an embodiment of the present disclosure.

FIG. 6B illustrate sVPU performance gain as a function of combination window size based on SuiteSparse benchmark, according to an embodiment of the present disclosure.

FIG. 6C illustrates sVPU performance gain as a function of combination window size based on high performance conjugate gradient (HPCG) workload, according to an embodiment of the present disclosure

FIG. 6D illustrates the average coalescing distance based on HPCG workload, according to an embodiment of the present disclosure.

FIG. 7 and 8 illustrate use of stream guards for indicating different streams of instructions, according to an embodiment of the present disclosure.

FIG. 9 illustrates a block diagram of an sVPU u-architecture, according to an embodiment of the present disclosure.

FIG. 10 illustrates a mask generation unit according to an embodiment of the present disclosure.

FIG. 11 illustrates a coalescing method, according to an embodiment of the present disclosure.

FIG. 12 illustrates a schematic diagram of an electronic device that may perform any or all of operations of the methods and features explicitly or implicitly described herein, according to different embodiments of the present disclosure.

It will be noted that throughout the appended drawings, like features are identified by like reference numerals.

DETAILED DESCRIPTION

Exploiting sparsity in CPU vector processing units has been moderately explored. While existing works target sparsity in operands of vector instructions, their proposal comes with a number of limitations and deficiencies. For example, the hardware complexity of existing proposals may be prohibitive and render their implementation either infeasible or unjustified.

Existing works have a number of deficiencies, for which, embodiments described herein may provide solutions. As mentioned herein, existing works are limited due to their high hardware complexity requirements. High hardware complexity may limit the frequency the vector unit can run at, leading to unjustified high-power consumption, and extra chip area. The high hardware complexity in existing works may be due to cross-lane coalescing where zero values in a vector lane can be replaced with non-zero values from other lanes of subsequent vector instructions.

A lane may refer to a position in a vector register. When a vector is being processed, lane may refer to a position in the sub-vector that is being processed at a time. In the case of processing N elements at a time, there may be 0…to N-1 lanes. Accordingly, in an embodiment, a vector processing unit may comprise N processing lane for processing N elements at a time such that each processing lane may correspond to an element or position in the sub-vector that is being processed at a time. Embodiments described herein further describe lane definition.

Embodiments described herein may provide for reduced hardware complexity and reduced cost due to same-lane coalescing. Same-lane coalescing may refer to, for example, replacing an ineffectual value with a value from the same lane of a subsequent instruction. Same-lane coalescing may provide for a reduced hardware cost in the u-Arch.

The high hardware complexity in existing works may be further due to the need for hardware dependency checking modules. Hardware dependency checking modules may be needed for checking and resolving dependencies between instructions that are candidates for coalescing. Since candidates are expected to be writing to arbitrary destination vector registers and coalescing operations are based on cross-lane coalescing, existing methods require such dependency checking modules.

Embodiments described herein may provide for simplified hardware support due to coalescing that is based on a stream of instructions that accumulate to the same output register. As may be appreciated by a person skilled in the art, accumulating to the same output registers may obviate the need for hardware dependency checking modules that is needed for cross-lane coalescing.

Existing works are further limited for not leveraging instruction set architecture (ISA) support for simpler implementation. Embodiments described herein may leverage ISA support to enable same-lane stream-based approach of coalescing candidate instructions. Embodiments may provide for enhanced ISA extensions to mark the beginning and end of a stream of target instructions (or using writes to control registers) to mark them as "eligible for coalescing without further dependency checks" as they all will be accumulating to the same output.

Existing works are further limited in terms of coverage and applicability since such works are limited to vector fused-multiply add instruction. As such, use cases that may be based on existing techniques are limited.

Embodiments described herein may be applied to a wider set of instructions, thereby broadening the target instructions to vector Dot, add…etc. Embodiments described herein may extend to one or more of operations and primitives that perform stream of computations and reduction on sparse operands.

Data that are being processed by many high-performance computing (HPC) and artificial intelligence (AI) applications are sparse, which means a lot of data elements have zero (0) value during execution time. For example, high performance conjugate gradient (HPCG) input matrices can be 99.99%sparse. These matrices may be very large with dimensions of 1 million x 1 million. Also, for machine learning (ML) applications such as computer vision, natural language processing, and speech recognition, high sparsity levels have been identified by several previous research works. Sparsity in deep neural networks (DNNs) model parameters can reach up to 98%due to advancements in model pruning techniques. In addition, sparsity of runtime activation values can be around 60%and even higher with techniques like “dropout” being in use.

Such high sparsity ratios bring a hefty challenge to a central processing unit (CPU) running such applications. Processing a large number of operations (mostly multiplications, additions, and multiply-accumulate MAC) with zero operands wastes execution time. Since these operations (i.e., having zero operands) will not affect the output, they can be safely skipped together. Accordingly, embodiment may provide for a sparsity-aware vector processing unit (sVPU) for general purpose CPUs that may address the challenge posed by high sparsity rations.

In some embodiments, an sVPU may skip processing ineffectual computational operations involving zero operands. In some embodiments, for a vector instruction having one or more vector operand elements as zeros, an sVPU may fill the one or more lanes with the zero operands with effectual values from subsequent instructions. Effectively, an sVPU may coalesce multiple vector instructions with sparse vector operands into a single denser vector instruction with reduced (none or less) zero values in its operands.

Embodiments described herein may apply to any operations or compute primitives that perform stream of computations and reduction on sparse operands. Such operations or primitives may include multiply-accumulate (MAC) , sparse matrix-matrix (SpMM) multiplication, matrix-vector (SpMV) multiplication and Embedding Operators in recommendation system (e.g., Sparse Length Sum) .

In some embodiments, an sVPU may apply to one or more set of applications including: machine learning applications (e.g., convolution, multiplayer perceptron (MLP) , recommendation systems, and HPC.

In some embodiments, vector instructions that are ready to be executed may be allocated to reservation stations (RS) waiting for execution. An sVPU may operate on top of an existing vector unit, as follows. In some embodiments, an sVPU may search through the operands of instructions pending in reservation stations (RS) . An sVPU may further perform lane coalescing operations. Lane coalescing operations may comprise the sVPU finding and dynamically scheduling effectual lanes from subsequent instructions to vacant ineffectual VPU lanes in the current instruction. As a result, fewer instructions may be executed. As may be appreciated by a person skilled in the art, coalescing (e.g., lane coalescing) may lead to skipping compute cycles and executing instructions based on already scheduled effectual lanes, thereby leading to speed-ups.

FIG. 1 illustrates vector lane coalescing, according to an embodiment of the present disclosure. A VPU (e.g., sVPU 100) may comprise one or more lane processing units 105 (e.g., lane 102 processing unit 102 and lane 104 processing unit 104) , corresponding to the one or more lanes (e.g., lane 106 and 108) of a set or a stream of vector instructions (e.g., inst 110, inst 120, and inst 130) . Each of the one or more lane processing units 105 may process vector operands in the corresponding instruction lane of the stream of instructions. For example, lane 106 processing unit 102 may process vector operands in the corresponding lane (e.g., 106) of the stream of vector instructions, and lane 108 processing unit 104 may process vector operands in the corresponding lane (e.g., 108) of the stream of vector instructions. Accordingly, in an embodiment, sVPU 100 may process vector instructions (e.g., inst 110, inst 120, and inst 130) where operands are vector registers and the corresponding lanes, or vector elements, across the input vector operands are processed through the same processing lane as illustrated. In an embodiment, the operands that the instructions operate one may be vector registers of, for example, N lanes.

The stream of vector instructions (e.g., inst 110, inst 120, inst 130) may reside in reservation stations ready to be executed. Inst 110 may comprise operation A0 X B0 in lane 106 and C0 X D0 in lane 108. Inst 120 may comprise operation A1 X B1 in lane 106 and C1 X D1 in lane 108. Inst 130 may comprise operation A2 X B2 in lane 106 and C2 X D2 in lane 108. One or more vector operands in the stream of vector instructions may have a value of zero. For, example, as illustrated, instruction 110 may have vector operand A0 in lane 106 as 0, instruction 120 may have vector operands C1 and D1 in lane 108 as 0, and instruction 3 may have vector operand B2, in lane 106, and vector operand C2 in lane 108 as 0. A vector element with the value ‘zero’ may indicate that the corresponding lane operation is not effectual (ineffectual) and does not affect the final output.

In an embodiment, sVPU 100 may determine one or more ineffectual lane operation in a stream of vector instructions. sVPU 100 may further determine one or more subsequent effectual lane operations corresponding to the one or more ineffectual lane operation in the stream of vector instructions. sVPU 100 may coalesce the one or more subsequent effectual lane operations with the corresponding ineffectual lane operations of the one or more ineffectual lane operations. sVPU 100 may then execute the coalesced instructions.

In an embodiment, sVPU 100 may replace an ineffectual lane operation with an effectual operation from a subsequent instruction in the same lane. For example, referring to lane 106 of inst 110, vector operation A0 X B0 is ineffectual since A0 value is zero. However, lane 106 of inst 120 may have an effectual operation since no vector operand (e.g., A1 or B1) has zero value. Accordingly, in an embodiment, sVPU 100, coalesce lane 106 of inst 120 with lane 106 of inst 110, thereby, replacing an ineffectual operation (e.g., vector operation in lane 106 of inst 110 (A0 X B0) ) , with a subsequent effectual operation (e.g., vector operation in lane 106 of inst 120 (A1 X B1) ) .

Therefore, sVPU 100 may fill the “bubble” in lane 106 of inst 110 with a subsequent vector operation in the same lane 106, which in embodiment of FIG. 1 happens to be lane 106 of inst 120.

In some embodiments, for practical implementations, sVPU 100 may limit its search scope for determining a subsequent effectual lane to a combination window (CW) 140 of some size N instructions, where N may be a design time parameter. A size N combination window may indicate that sVPU may look into the operands of up to N ready instructions residing in the reservation stations. In embodiment of FIG. 1, a size 3 (e.g., 3 instructions: 110, 120 and 13) combination window 140 is illustrated which includes the current instruction (e.g., Inst 110) .

Embodiments described in reference to FIG. 1, including operations performed via sVPU 100 (e.g., searching, coalescing and executing mechanisms as described herein) may be applied to one or more vector instructions including vector fused multiply-add, vector dot operations, and vector reduction instruction.

FIG. 2 illustrates an example of vector dot (vdot) instruction, according to an embodiment of the present disclosure. The vdot instruction of FIG. 2 may be similar to that implemented in RISC-V Divided Element Extension (EDIV) . Similar vector dot instructions may also be available in other architectures with extensions such as ARM SVE and Intel x86 AVX.

The example vector dot instruction 200 may take two input vectors (e.g., vOp 210 and vOp 220) each of length 16 elements (16 x 32 b vector register for each operand) . The dot product between the two sub-vectors of 4 elements (indicated by matching hash pattern) , in the two operands (e.g., vOp 210 and vOp 220) may be performed and the results accumulated to the corresponding accumulation register (as indicated by matching hash pattern) in the accumulator operand vAcc 206.

For example, the dot product between the sub-vector of 4 elements 212 of vOP 210 and the sub-vector of 4 elements 222 of vOP 220 may be performed and accumulated to the corresponding accumulation register 232 in the accumulator operand vAcc 230. As illustrated, vAcc 230 may comprise 4 x 128b vector register for accumulation, for which only 64b may be used (indicated by hash pattern) .

FIG. 3A and 3B illustrate a matrix-vector multiplication, according to an embodiment of the present disclosure. The matrix-vector multiplication algorithm (e.g., multiplication of matrix A 302 with vector B 304) illustrated in FIG. 3A and 3B may be implement using the vector dot instruction as described, for example, in reference to FIG. 2A and 2B, but without losing the generality of other implementations, for example, vector multiply-add instructions.

Referring to FIG. 3A, the rows of the input matrix A 302 may be grouped such that each group may comprise as many rows as the number of sub-groups in the vdot instruction, e.g., vdot instruction 200. In the example vdot instruction 200, the number of subgroups is 4, so, each 4 rows of the matrix A may be grouped to be processed simultaneously.

Accordingly, referring to FIG. 3A, in an embodiment, each vdot instruction may process a block of, for example, 4x4 elements, shown as b1 311, b2 312, ..., b7 317 against the corresponding sub-vector of B 304 shown as v1 321, v2 322, …, v7 327. As may be appreciated by a person skilled in the art, the illustrated 4x4 block in embodiments described herein is for illustration purposes only, and thus any block dimensions may be used according to the embodiments of the present disclosure. Similarly, the corresponding sub-vector for illustration purposes may 4x1.

Thus, processing each group of 4-rows against the input vector B may be implemented as a stream of vdot instructions, referring to FIG. 3B, wherein each instruction may take a block (e.g., b1 311, b2 312, …b7 317) as its first vector operand along with a sub-vector of B (e.g., v1 321, v2 312, …, v7 317) broadcasted to fill the second vector operand. Each instruction, e.g., vdot 330 involving b1 and v1, may be referred to as one timestep. Accordingly, for matrix A 302 vector B 304 multiplication may involve seven timesteps, one for each instruction of the instructions 340 as illustrated. As may be appreciated by a person skilled in the art, the registers indicated in the instructions 340 may refer to the corresponding block of matrix A 302 and sub-vector B 304. For example, referring to the first instruction, “vdot vReg 15, vReg 1, vReg2” , vReg1 may refer to the register (e.g., vector register 1 (vReg1) ) that comprises block 1, and vReg2 refers to the register (e.g., vReg2) that comprises sub-vector v1 (repeated 4 times to correspond with block 1) . The results of all the instructions belonging to the same stream (i.e., which in this embodiment may be defined by the 4-row group) , may accumulate to the same output vector register, e.g., vReg15, as illustrated.

FIG. 4A and 4B illustrate a matrix-vector multiplication applying block compressed sparse row (BCSR) optimization, according to an embodiment of the present disclosure. Referring to FIG. 4A, input matrix A 402 may be a sparse matrix wherein the 4x4 blocks b2 412, b5 415 and b7 417 may be all zeros (i.e., having zero values for all elements in the block) . The all-zero blocks, b2 412, b5 415 and b7 417, are illustrated as empty (no hash patterns) . For sparse input matrix, e.g., matrix A 402, a typical algorithmic optimization, such as BCSR, may be used to eliminate those 4x4 blocks that are all-zeros. Accordingly, referring to FIG. 4B, the corresponding vdot instructions (for the all-zero blocks) may be avoided (illustrated as crossed out) altogether. Thus, processing the 4-rows group (which may refer to the part the mmultiplication of matrix A 402 with vector B 404 that includes only a group of 4 rows) may only involve the subset of instructions corresponding to blocks b1 411, b3 413, b4 414, and b6 416 as illustrated (in FIG. 4B) . Accordingly, the seven-timestep process may be reduced to four-timestep process 460 by avoiding the all-zero blocks. As may be appreciated by a person skilled in the art, the BCSR optimization may avoid only the all-zeros blocks (all elements of the block having zero values) .

Embodiments described herein may provide for extracting sparsity at block level and within blocks. sVPU 100 may extract sparsity at block-level, if not implemented on algorithmic level (e.g., using BCSR representation on the matrix) , as well as fine-grain sparsity within blocks as described herein. For the former (i.e., at block level) , sVPU 100 may detect instructions with all-zero input operands and eliminate them altogether from the reservation station. For the later (i.e., fine-grain sparsity within blocks) , remaining instructions corresponding to blocks with fine-grain sparsity (i.e., having zero value elements but also non-zero elements) may be coalesced into smaller number of instructions which may further enhance processing (e.g., speedup) in addition to the algorithmic BCSR optimization.

As may be appreciated by a person skilled in the art, Sparsity may be partially extracted on algorithmic level using BCSR representation of the matrix as described herein. As described, BCSR may eliminate blocks that are entirely zeros. Embodiments described herein may provide for extracting sparsity both on the block level (like BSCR) and in finer granularity within blocks as well as described herein.

FIG. 5A and 5B illustrate a matrix-vector multiplication applying lane coalescing, according to an embodiment of the present disclosure. Referring to FIG. 5A, input matrix A 502 may be a sparse matrix wherein the 4x4 blocks b2 512, b5 515 and b7 517 may be all zeros (similar to matrix A 402) . The remaining blocks, b1 511, b3 513, b4 514, and b6 516 may comprise zero and non-zero values. For illustrative purposes, only the elements within the 4x4 blocks b1 511 and b3 513 are shown, in which zero values are indicated as empty (no hash pattern) and non-zero values are indicated via hash patterns.

As discussed previously, the all-zero blocks b2 512, b5 515 and b7 517 may be avoided for processing. Accordingly, the vdot instructions may be based on blocks b1 511, b3 513, b4 514, and b6 516. Referring to FIG. 5B, in an embodiment, sVPU 100 may look at or examine the instructions to be executed next. sVPU 100 may determine one or more lanes having ineffectual computations due to one of the two corresponding input values being zero. For example, sVPU 100 may begin processing the vdot instructions from block b1 511, which is illustrated as one row. sVPU 100 may determine one or more lanes having ineffectual computations, e.g.,

lanes

530, 532, 534, 536, 538 and 540, due to zero values in these lanes (zero values indicated as empty boxes –no hash pattern) .

For each determined lane having an ineffectual computation, sVPU 100 may search, in future or subsequent instructions, for an effectual computation (both operand values are non-zero) corresponding to the same lane (e.g., lane 530) . The search may be based on a combination window of size N. In FIG. 5B, the combination window size N may be, for example, 4 as illustrated.

In an embodiment, for lane 530, sVPU may search in future instructions, e.g., vdot instructions based on b3 513 and v3 523 and determine an effectual computation for the corresponding lane 530. Upon determining the effectual computations, sVPU 100 may replace 550 the ineffectual computation (due to ineffectual computation) in the current instruction with the determined effectual computation in the same lane 530 (corresponding to a future computation) brought froward from a future instruction within the combination window scope as illustrated.

This replacement mechanism may be referred to as coalescing and is indicated via arrows pointing from an effectual lane in a future instruction to the corresponding ineffectual lane in the current instruction.

As may be appreciated by a person skilled in the art, a computation may be ineffectual due to a zero value of either operand (i.e., operand from matrix A or vector B) . As such, while embodiment described herein may refer to an ineffectual computation based on an operand of matrix A having zero value, a person skilled in the art may appreciate that a computation may be determined to be ineffectual due to a zero value of vector B operand despite matrix A operand being a non-zero.

sVPU 100 may take the same approach, as taken with respect to the ineffectual lane 530 of the current instruction, to the

ineffectual lanes

532, 534, 536, 538, 540 of the current instruction. As illustrated,

ineffectual lanes

532, 534 and 540 corresponding to the current instructions may be replaced with

effectual lanes

532, 534, and 540 corresponding to the subsequent instruction (based on block b3 513) . For

ineffectual lanes

536 and 538 in the current instruction, sVPU 100 may look into further subsequent instructions (based on combination window size N) to determine effectual lanes.

As may be appreciated by a person skilled in the art, combination window may refer to the scope of coalescing. Coalescing may be applied to two or more instructions that belong to the same row of blocks and accumulate to the same output vector register (e.g., vReg 15) .

Accordingly, in addition to the BCSR optimization (which reduced instructions by avoiding all-zero blocks 460 in FIG. 4B) , embodiments may further enhance processing via coalescing to further reduce the instructions to be processed (i.e., avoiding zeros within BCSR blocks) , for example, a coalesced stream 560) as described herein.

Accordingly, the stream of instructions corresponding to the row of blocks may be coalesced into a smaller number of mixed instructions 560 where effectual lanes from different instructions may be packed and processed together. As such, in an embodiment, a matrix-vector multiplication which may require seven timesteps (as discussed in reference to FIG. 3A and 3B) may be reduced to four timesteps (as discussed in reference to FIG. 4A and 4B) , and may further be reduced to two timesteps 560 as illustrated.

As discussed, a combination window may determine the scope of coalescing wherein the window size may indicate the depth or how far in future instructions the sVPU may search for effectual lanes. The window size may also determine the upper limit on the speedup that sVPU may achieve. For example, for a combination window size of N, N instructions may be coalesced into 1, which may lead to a speed up of N times. In an embodiment, the combination window may be a moving window, i.e., as an instruction is executed, the combination window may slide or move and may be rebased or repositioned to start at the next instruction to execute.

In an embodiment, sVPU 100 may coalesce instructions that belong to the same stream, i.e., the same row of blocks and accumulate to the same output vector register as further described in reference to FIG. 7 and 8. Coalescing instructions that belong to the same stream may be a key factor in reducing the hardware overhead and complexity of lane coalescing mechanism. Limiting coalescing based on the same stream may avoid the need for any data-dependency checking hardware since all instruction considered for coalescing may accumulate to the same accumulator vector register.

In some embodiments, sVPU 100 may coalesce instructions based on same-lane coalescing, as described, for example, in reference to FIG. 5B. Same-lane coalescing may replace an ineffectual lane x, for example, with an effectual lane x from a future instruction. Same-lane coalescing may avoid the complex hardware needed to support cross-lane coalescing.

As may be appreciated by a person skilled in the art, embodiments described herein may be applied for compute or storage purposes. In an embodiment, once the data to be computed is available at their corresponding registers of a processor, such data may be examined and combined (e.g., coalesced) according to embodiments described herein to produce a reduced number of instructions for compute purposes.

In an embodiment, sVPU 100 was tested with high performance conjugate gradient (HPCG) workload and the SuiteSparse benchmark (1000 matrices were randomly selected from the benchmark suite) . Different combination window sizes were attempted and the results observed. Based on the tests, the following observations may be drawn.

As the combination window size increases, the potential performance gain may also increase. On the other hand, the increased window size, may also lead to increased hardware complexity. Accordingly, at a certain combination window size, the performance gain may peak.

For example, with respect to SuiteSparse, the performance gain may peak at combination window size of 16, where the speedup may saturate at 2.2x on average over the 1000 matrices. In the case of HPCG workload, the performance gain may peak at combination window size of 7, where the speedup may saturate at around 3x.

FIG. 6A illustrates sVPU performance gain over BCSR based on SuiteSparse benchmark, according to an embodiment of the present disclosure. The horizontal axis of FIG. 6A refers to different matrices of the SuiteSparse benchmark (only 29 matrices are shown for illustrative purposes) . The vertical axis of FIG. 6A refers to performance gain (i.e., speedup) of the sVPU over BCSR. As mentioned, a random sample of 1000 matrices from SuiteSparse benchmark were experimented. Of the sample, up to 9x speedups (not shown) and an average of 2.2x (illustrated via line 602) speedups were observed. The speedup of sVPU over BCSR for a portion of the random sample are illustrated.

FIG. 6B illustrate sVPU performance gain as a function of combination window size based on SuiteSparse benchmark, according to an embodiment of the present disclosure. The horizontal axis of FIG. 6B refers to different combination window sizes. The vertical axis of FIG. 6B refers to performance gain (i.e., speedup) over a combination window size of 2. Referring to FIG. 6B, the performance gain saturates as combination window size increases. The performance gain peaks at combination window size 16 (illustrated via hash lines) having approximately 60%performance gain over combination window size 2 as illustrated via line 604.

FIG. 6C illustrates sVPU performance gain as a function of combination window size based on high performance conjugate gradient (HPCG) workload, according to an embodiment of the present disclosure. The horizontal axis of FIG. 6C refers to different combination window sizes. The vertical axis of FIG. 6C refers to performance gain (i.e., speedup) over BCSR. As illustrated, the performance gain saturates as combination window size increases. The performance gain peaks at combination window size 7 (illustrated via hash lines) reaching a performance gain of just below 3 times that of BCSR (illustrated via line 606) .

FIG. 6D illustrates the average coalescing distance based on HPCG workload, according to an embodiment of the present disclosure. The horizontal axis of FIG. 6D refers to different combination window sizes. The vertical axis of FIG. 6D refers to the average coalescing distance associated with the corresponding combination window size. The average distance illustrated indicates the average distance between the current instruction and the further coalesced lane (i.e., in future instruction) . The average coalescing distance based on applying sVPU 100 to HPCG workload was determined to be 2.25 instructions (illustrated via line 608) at combination window size of 6 (illustrated via hash lines) , indicating, that an instruction, on average, became sufficiently dense by coalescing lanes from subsequent 3 (rounding up 2.2.5) instructions.

The performance gain (i.e., speedups) achieved may be significant considering the small hardware complexity associated with sVPU, coalescing conditions and combination window size. Coalescing conditions may be based on same-lane coalescing within a stream of instructions accumulating to the same accumulation registers. The combination window size may be based on the peak performance gain as described herein.

Embodiments will now describe indicators for a stream of instructions that accumulate to the same accumulation registers. Embodiments may provide for start and end indicators for a stream of instructions.

In an embodiment, sVPU 100 may use stream-guards technique to mark or indicate the beginning and end of a stream of vector instructions that accumulate to the same output vector register. Such stream of instructions may be safely coalesced without the need for data-dependency checking hardware. Otherwise, if instructions accumulating to different output vector registers are considered for coalescing, a special expensive hardware may be needed to ensure the lane-wise data-independence of the outputs of these candidate instructions.

Accordingly, sVPU 100 may uses stream guards surrounding the stream of instructions accumulating to the same output vector register. These stream guards may indicate to the backend sVPU hardware that the instructions within the stream guards are functionally correct to coalesce.

In an embodiment, sVPU 100 may insert stream guards surrounding each stream of vdot instructions where a stream corresponds to a complete row of 4x4 blocks. For example, in reference to FIG. 3 to FIG. 5, sVPU 100 may insert stream guards before the beginning and after the end of matrix A, for example, to indicate that the row of blocks comprising b1 to b7 is one stream of vdot instructions to which coalescing operations, as described herein, may be applied. As such, the stream guards may limit sVPU coalescing to vdot instructions from the same row of blocks and avoid coalescing vdot instructions from the next row of blocks as further described herein.

FIG. 7 and 8 illustrate use of stream guards for indicating different streams of instructions, according to an embodiment of the present disclosure. In an embodiment, a matrix-vector multiplication may comprise an input matrix A which may include two rows of 4x4 blocks 740 and 742 as illustrated. The first row of blocks 740 may correspond to vector instructions that accumulate to a first output vector registers, and the second row of 4x4 blocks 742 may correspond to vector instructions that accumulate to a second output vector register.

In an embodiment, referring to FIG. 8, sVPU 100 may use stream guards to indicate the beginning (via, for example, sVPU_stream_start) and the end (via, for example, sVPU_stream_end) of a stream of instructions that accumulate to the same output vector register. Accordingly, the first rows of 4x4 bocks 740 may be indicated as a first stream of instructions 750 via using stream guards as illustrated. Similarly, the second rows of 4x4 blocks 742 may be indicated as a second stream of instructions 752 using stream guards as illustrated. While the same register (e.g., vReg15) is illustrated as the output register for accumulating both stream of

instructions

750 and 752, a person skilled in the art may appreciated that after processing the first stream of instruction 750 and before processing or accumulating the second stream of instruction 752, the outputs in the register may be flushed and stored in a memory and the register may then be initialized to zero for preparing for the second stream of instruction 752. In another embodiment, a different output vector register may be used for each stream of instruction.

As may be appreciated by a person skilled in the art, all-zero blocks from the first rows of 4x4 blocks 740 and the second rows of 4x4 blocks 742 may be avoided as indicated in the streams of instructions750 and 752. sVPU 100 may apply coalescing mechanisms as described herein to the remaining blocks (e.g., blocks b1 711, b3 713, b4 714, and b6 716 for the first rows of 4x4 blocks 740 corresponding to the first stream of instructions 750, and blocks b9 719, b10 720, b12 722 and b14 724 for the second rows of 4x4 blocks 742 corresponding to the second stream of instructions 752) , since the remaining blocks may comprise zero value elements.

In an embodiment, stream guards may be implemented either as an extension to the instruction set architecture (ISA) or through introducing a new control/status register (CSR) .

A CSR approach to implementing stream guards may have minimal intrusion to the ISA. In an embodiment, sVPU 100 may introduce a new CSR name SVPU_CR. A stream start may be marked by writing “1” and a stream end may be marked by writing “0” to SVPU_CR to enable (when, for example, “1” is written) and disable (when, for example, “0” is written) coalescing by the sVPU backend.

In an embodiment, the SVPU_CR may be implemented as a 1-bit register, accordingly, there may be no support for nesting of streams. No support for nesting of streams may mean that sVPU 100 may keep track of a single stream at time. So, if a new stream (i.e., a second stream) needs to be started, the current one (i.e., first stream) needs to be terminated (with the stream guard closure) before beginning the new stream. Then after finishing the new stream (i.e., the second stream) , the remainder of the old stream that was not processed due to termination may be then be processed as a new shorter stream (i.e., third stream) . When terminating the first stream, no state needs to retained, and the remainder of the old stream will be dealt with as a new shorter stream (i.e., third stream) .

Typically, a CSR may be written to use an atomic CSR read and write instruction which may be available in every ISA, such as CSRRW for RISC-V ISA.

Embodiments described herein introduce the concept of “instructions stream” . Embodiments described herein may limit coalescing to a stream of instructions accumulating to the same output vector register as determined according to stream guards. Coalescing based on a stream of instructions, as described herein, may allow for reduced costs in terms of hardware by obviating the need for data-dependency checking hardware that is otherwise necessary to resolve dependency between candidate instructions. By limiting the coalescing candidate instructions to the same stream of instructions, coalescing may be performed without dependency checks.

FIG. 9 illustrates a block diagram of an sVPU u-architecture, according to an embodiment of the present disclosure. In an embodiment, the execution backend 900 may comprise one or more of a reorder buffer (ROB) 902, a lane coalescing unit (LCU) 904, a mask generation unit (MGU) 906, a reservation station (RS) 908, a vector register file 910, and an sVPU 100.

FIG. 10 illustrates a mask generation unit according to an embodiment of the present disclosure. Referring to FIG. 9 and FIG. 10, the MGU 906 may generate effectual lane masks (ELM) for each vdot instruction that reside in RS 908 and are ready to be executed. In an embodiment, for each lane of the vector inputs, the corresponding bit in the generated mask may indicate whether the input value is effectual or not. As such, the ELM may indicate which lane have effectual values. The generated ELM masks may be kept either in a newly added field “ELM 1002” in the RS table or in some mask physical register file if available in the CPU architecture.

FIG. 10 illustrates one or more MGU units 906 including comparators, NOR gates, and a new field “ELM 1002” added to the RS table to keep the generated masks. For each lane, each

vector operand

1004 and 1006 may be evaluated to determine whether either operand is zero and thus ineffectual. The results (i.e., ineffectual or effectual) may be indicated in the RS table accordingly. The MGU 906 may be instantiated N times where N is the size of the combination window so that up to N instructions may be investigated in parallel at a time.

Referring to FIG. 9, the LCU 904 may use the masks (e.g., ELM 1002) generated by the MGUs 906 as input and decide which lanes to coalesce accordingly. In an embodiment, the coalescing mechanism (e.g., LCU 904) may employ the method described in FIG 11 to perform the coalescing operations as described herein.

FIG. 11 illustrates a coalescing method, according to an embodiment of the present disclosure. At 1102, one or more ineffectual lane in the current vector instruction is identified or determined according to embodiments described herein.

At 1104, for each determined ineffectual lane, LCU 904 may inspect the corresponding lane position in each ready instruction within the combination window in program order.

At 1106, the corresponding effectual lane in the earliest ready instruction is coalesced to fill the bubble of the ineffectual lane and the corresponding input operands are brought forward and packed into the current instruction vector inputs.

At 1108, when a lane is coalesced from a subsequent instruction, the corresponding bit in the ELM mask may be zeroed, thereby marking or indicating that the effectual lane has been successfully coalesced and will be executed. This ensures the lane will not be considered again while performing future coalescing.

In some embodiments, at 1110, the LCU 904 may keep a record of the original instruction from which a lane is coalesced so that, in case of an interruption, the machine state is maintained by squashing the lanes coalesced from subsequent instructions into the current one.

After packing effectual lanes from current and subsequent instructions, the mixed instruction may be issued for execution by the vector processing engine. Any subsequent instruction for which all the effectual lanes have been coalesced and executed, is removed from the reservation station and marked as “done” .

For any lane that the LCU 904 may need to coalesce, LCU may prioritize an earlier instruction ahead of a later instruction according to the program order, i.e., for an ineffectual lane in the current instruction, an effectual corresponding lane from an earlier instruction (e.g., a first subsequent instruction (I ₁) ) may be given a higher priority over the corresponding lane of a later instruction (e.g., a second subsequent instruction (I ₂) ) if I ₁ is older than I ₂ according to the program order. Prioritizing instructions may help simplify the interrupt-handling mechanisms in case an interruption occurs and the coalesced lanes from subsequent instructions need to be squashed.

Embodiments described in reference to the coalescing mechanism may be simple and cost effective since such mechanism does not need expensive hardware support. The coalescing mechanism embodiments described herein may offer a low-cost approach based on one or more of the following. The coalescing mechanism embodiments described herein may offer a low-cost approach by limiting the combination window to a practical size rather than including all reservation station entries for performing the search for effectual lanes to coalesce. The coalescing mechanism embodiments described herein may offer a low-cost approach since the hardware support for implementation may be based on an MGU (e.g., MGU 906) comprising comparators to detects zeros and NOR gates to take both operands into account for binary operators. The coalescing mechanism embodiments described herein may offer a low-cost approach since the mechanism (used in the LCU 904) may be implemented using priority-based selection hardware (to prioritize instructions in program order) that may have optimized and known implementations.

Embodiments described herein may provide for a vector processing unit (e.g., a sparsity-aware vector processing unit (sVPU) ) that coalesces vector instructions into fewer instructions by packing only effectual lanes from the original instructions.

In some embodiments, the sVPU 100 may consider only same-lane coalescing and avoid cross-lane coalescing which require expensive hardware components in the micro-architecture.

In some embodiments, the sVPU 100 may consider only instructions that accumulate to the same output vector register for coalescing. Limiting the instructions in such a way may simplify the implementation and avoid the otherwise necessary data-dependency checking hardware.

In some embodiments, an instruction stream may be marked using “stream guards” that may be implemented using ISA extension instructions or using a new control/status register (CSR) which may be written to indicate the start and end of the stream.

Embodiments described herein may target a wide set of operations or primitives that perform stream of computations and reduction on sparse operands. Thus, a wide set of CPU instructions may be targeted for coalescing by sVPU 100 including: vector Dot, vector add, and vector multiply-accumulate.

While embodiments are described in the context of CPU vector unit, embodiments may be equivalently applicable to other commodity architectures such as graphics processing units (GPUs) and digital signal processors (DSPs) . These commodity architectures typically feature vector processing engines and their typical workloads are expected to have similar levels of sparsity in their input data. As such, embodiments described here may be applicable to such other commodity architectures.

FIG. 12 is a schematic diagram of an electronic device 1200 that may perform any or all of operations of the methods and features explicitly or implicitly described herein, according to different embodiments of the present disclosure.

As shown, the electronic device 1200 may include a processor 1210, such as a Central Processing Unit (CPU) or specialized processors such as a Graphics Processing Unit (GPU) or other such processor unit, memory 1220, non-transitory mass storage 1230, input-output interface 1240, network interface 1250, and a transceiver 1260, all of which are communicatively coupled via bi-directional bus 1270. According to certain embodiments, any or all of the depicted elements may be utilized, or only a subset of the elements. Further, electronic device 1200 may contain multiple instances of certain elements, such as multiple processors, memories, or transceivers. Also, elements of the hardware device may be directly coupled to other elements without the bi-directional bus. Additionally, or alternatively to a processor and memory, other electronics, such as integrated circuits, may be employed for performing the required logical operations.

The memory 1220 may include any type of non-transitory memory such as static random-access memory (SRAM) , dynamic random-access memory (DRAM) , synchronous DRAM (SDRAM) , read-only memory (ROM) , any combination of such, or the like. The mass storage element 1230 may include any type of non-transitory storage device, such as a solid-state drive, hard disk drive, a magnetic disk drive, an optical disk drive, USB drive, or any computer program product configured to store data and machine executable program code. According to certain embodiments, the memory 1220 or mass storage 1230 may have recorded thereon statements and instructions executable by the processor 1210 for performing any of the aforementioned method operations described above.

Embodiments of the present invention can be implemented using electronics hardware, software, or a combination thereof. In some embodiments, the invention is implemented by one or multiple computer processors executing program instructions stored in memory. In some embodiments, the invention is implemented partially or fully in hardware, for example using one or more field programmable gate arrays (FPGAs) or application specific integrated circuits (ASICs) to rapidly perform processing operations.

It will be appreciated that, although specific embodiments of the technology have been described herein for purposes of illustration, various modifications may be made without departing from the scope of the technology. The specification and drawings are, accordingly, to be regarded simply as an illustration of the invention as defined by the appended claims, and are contemplated to cover any and all modifications, variations, combinations or equivalents that fall within the scope of the present invention. In particular, it is within the scope of the technology to provide a computer program product or program element, or a program storage or memory device such as a magnetic or optical wire, tape or disc, or the like, for storing signals readable by a machine, for controlling the operation of a computer according to the method of the technology and/or to structure some or all of its components in accordance with the system of the technology.

Acts associated with the method described herein can be implemented as coded instructions in a computer program product. In other words, the computer program product is a computer-readable medium upon which software code is recorded to execute the method when the computer program product is loaded into memory and executed on the microprocessor of the wireless communication device.

Further, each operation of the method may be executed on any computing device, such as a personal computer, server, PDA, or the like and pursuant to one or more, or a part of one or more, program elements, modules or objects generated from any programming language, such as C++, Java, or the like. In addition, each operation, or a file or object or the like implementing each said operation, may be executed by special purpose hardware or a circuit module designed for that purpose.

Through the descriptions of the preceding embodiments, the present invention may be implemented by using hardware only or by using software and a necessary universal hardware platform. Based on such understandings, the technical solution of the present invention may be embodied in the form of a software product. The software product may be stored in a non-volatile or non-transitory storage medium, which can be a compact disc read-only memory (CD-ROM) , USB flash disk, or a removable hard disk. The software product includes a number of instructions that enable a computer device (personal computer, server, or network device) to execute the methods provided in the embodiments of the present invention. For example, such an execution may correspond to a simulation of the logical operations as described herein. The software product may additionally or alternatively include a number of instructions that enable a computer device to execute operations for configuring or programming a digital logic apparatus in accordance with embodiments of the present invention.

Although the present invention has been described with reference to specific features and embodiments thereof, it is evident that various modifications and combinations can be made thereto without departing from the invention. The specification and drawings are, accordingly, to be regarded simply as an illustration of the invention as defined by the appended claims, and are contemplated to cover any and all modifications, variations, combinations or equivalents that fall within the scope of the present invention.

Claims

A method comprising:

receiving a stream of vector instructions for processing, the stream of vector instructions comprising a plurality of vector instructions;

determining an ineffectual computation corresponding to a first lane of a first vector instruction of the stream of vector instructions;

determining an effectual computation corresponding to a second lane of a second vector instruction of the stream of vector instructions, wherein:

the second vector instruction is subsequent to the first vector instruction according to a processing order of the stream of vector instructions; and

the second lane of the second vector instruction and the first lane of the first vector instruction correspond to a same lane;

coalescing the second lane of the second vector instruction with the first lane of the first vector instruction; and

processing the stream of vector instructions.
The method of claim 1, wherein the determining an effectual computation is based on a combination window size indicating a number of vector instructions of the stream of vector instructions subsequent to the first vector instruction.
The method of claim 1 or 2, wherein the stream of vector instructions accumulates to one output register.
The method of any one of claims 1 to 3, wherein

the coalescing comprises replacing the ineffectual computation with the effectual computation; and

the processing the stream of vector instructions comprises: processing the first vector instruction comprising the effectual computation.
The method of any one of claims 1 to 4, wherein the ineffectual computation comprises a vector operand having a zero value.
The method of any one of claims 1 to 5, wherein the stream of vector instructions is indicated by a start-stream indicator and an end-stream indicator.
The method of any one of claims 1 to 6, further comprising:

determining a second effectual computation corresponding to a third lane of a third vector instruction of the stream of vector instructions, wherein:

the third vector instruction is subsequent to the second vector instruction according to the processing order of the stream of vector instructions; and

the third lane of the third vector instruction and the second lane of the second vector instruction correspond to a same lane;

coalescing the third lane of the third vector instruction with the second lane of the second vector instruction.
The method of claim 7, wherein the processing the stream of vector instructions comprises: processing the second vector instruction comprising the second effectual computation.
The method of any one of claims 1 to 8, further comprising:

receiving a second stream of vector instructions for processing, wherein the second stream of vector instructions:

accumulates to a second output register; and

is indicated by a second start-stream indicator and a second end-stream indicator; and

processing the second stream of vector instructions.
The method of claim 9, wherein the processing the second stream of vector instructions is performed after processing the stream of vector instructions.
An apparatus comprising:

one or more mask generation units;

one or more lane processing units;

one or more lane coalescing units;

at least one processor; and

and at least one machine-readable medium storing executable instructions which when executed by the at least one processor configure the apparatus for:

receiving a stream of vector instructions for processing, the stream of vector instructions comprising a plurality of vector instructions;

determining, via the one or more mask generation units, an ineffectual computation corresponding to a first lane of a first vector instruction of the stream of vector instructions;

determining, via the one or more mask generation units, an effectual computation corresponding to a second lane of a second vector instruction of the stream of vector instructions, wherein:

the second vector instruction is subsequent to the first vector instruction according to a processing order of the stream of vector instructions; and

the second lane of the second vector instruction and the first lane of the first vector instruction correspond to a same lane;

coalescing, via the one or more lane coalescing units, the second lane of the second vector instruction with the first lane of the first vector instruction; and

processing, via one or more lane processing units, the stream of vector instructions, wherein each lane processing unit corresponds to a corresponding lane in the stream of vector instruction.
The apparatus of claim 11, wherein the determining an effectual computation is based on a combination window size indicating a number of vector instructions of the stream of vector instructions subsequent to the first vector instruction.
The apparatus of claim 11 or 12, wherein the stream of vector instructions accumulates to one output register.
The apparatus of any one of claims 11 to 13, wherein

the coalescing comprises replacing the ineffectual computation with the effectual computation; and

the processing the stream of vector instructions comprises: processing the first vector instruction comprising the effectual computation.
The apparatus of any one of claims 11 to 14, wherein the ineffectual computation comprises a vector operand having a zero value.
The apparatus of any one of claims 11 to 15, wherein the stream of vector instructions is indicated by a start-stream indicator and an end-stream indicator.
The apparatus of any one of claims 11 to 16, wherein the executable instructions which when executed by the at least one processor further configure the apparatus for:

determining, via the one or more mask generation units, a second effectual computation corresponding to a third lane of a third vector instruction of the stream of vector instructions, wherein:

the third vector instruction is subsequent to the second vector instruction according to the processing order of the stream of vector instructions; and

the third lane of the third vector instruction and the second lane of the second vector instruction correspond to a same lane;

coalescing, via the one or more lane coalescing units, the third lane of the third vector instruction with the second lane of the second vector instruction.
The apparatus of claim 17, wherein the processing comprises: processing the second vector instruction comprising the second effectual computation.
The apparatus of any one of claims 11 to 18, wherein the executable instructions which when executed by the at least one processor further configure the apparatus for:

receiving a second stream of vector instructions for processing, wherein the second stream of vector instructions:

accumulates to a second output register; and

is indicated by a second start-stream indicator and a second end-stream indicator; and

processing, via the one or more lane processing units, the second stream of vector instructions.
The apparatus of claim 19, wherein the processing, via the one or more lane processing units, the second stream of vector instructions is performed after processing the stream of vector instructions.
A machine-readable medium storing executable instructions which when executed by a processor configure the processor to perform a method according to any one of claims 1-10.