CN116821576B - Method and apparatus for accelerating N: M sparse networks based on RISC-V - Google Patents

Method and apparatus for accelerating N: M sparse networks based on RISC-V Download PDF

Info

Publication number
CN116821576B
CN116821576B CN202311091713.9A CN202311091713A CN116821576B CN 116821576 B CN116821576 B CN 116821576B CN 202311091713 A CN202311091713 A CN 202311091713A CN 116821576 B CN116821576 B CN 116821576B
Authority
CN
China
Prior art keywords
matrix
multiplication
processor
risc
convolution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311091713.9A
Other languages
Chinese (zh)
Other versions
CN116821576A (en
Inventor
梁华岳
姚安邦
张博为
张新欣
吕剑桥
吴向斌
郑珊珊
张森杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Intel China Research Center Co ltd
Original Assignee
Intel China Research Center Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Intel China Research Center Co ltd filed Critical Intel China Research Center Co ltd
Priority to CN202311091713.9A priority Critical patent/CN116821576B/en
Publication of CN116821576A publication Critical patent/CN116821576A/en
Application granted granted Critical
Publication of CN116821576B publication Critical patent/CN116821576B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • G06F17/153Multidimensional correlation or convolution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/544Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
    • G06F7/5443Sum of products
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Algebra (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Complex Calculations (AREA)

Abstract

The present application relates to methods and apparatus for RISC-V based acceleration of N: M sparse networks. The method comprises the following steps: obtaining an effective value matrix A 'of a matrix A of the M sparse network, wherein the effective value matrix A' only comprises non-zero values of the matrix A; obtaining a matrix mask for indicating locations of non-zero values of matrix a, wherein each row of the matrix mask includes a plurality of location indicators to indicate locations of non-zero values of a corresponding row of matrix a; and performing convolution operations of the matrix A and the matrix B by using the effective value matrix A' and the matrix mask. The invention accelerates the N:M sparse network on RISC-V, thereby linearly reducing the convolution calculation cost in the N:M sparse network.

Description

Method and apparatus for accelerating N: M sparse networks based on RISC-V
Technical Field
The present application relates to the field of neural networks, and more particularly, to a method and apparatus for accelerating an N: M sparse network based on a reduced instruction set computer-V (RISC-V).
Background
Aiming at the characteristic of overlarge scale of the neural network model, a sparse network is introduced to save the calculated amount and the memory. For hardware, since the hardware is inefficient in processing a random sparse network, it is not easy to obtain the same gain as sparsity, and thus, an N: M sparse network is proposed, N: M sparsity means that N0 are included in each set of consecutive M values. In this fixed parameter tuning mode, the hardware can take advantage of sparsity to achieve linear acceleration without losing model accuracy.
RISC-V is a brand new open source instruction set architecture based on the reduced instruction set principle, wherein the letter V comprises two layers of meanings, namely a fifth generation instruction set architecture designed by Berkeley from RISC-I; and two are variance (vector) and vector (vector). RISC-V is a newer instruction set architecture, widely used as a controller or accelerator in embedded devices, edge networks, data centers, clouds, etc. RISC-V may also be used to accelerate artificial intelligence workloads.
Disclosure of Invention
The application provides a novel mechanism for accelerating the N:M sparse network, so that the calculation cost of convolution in the N:M sparse network is linearly reduced, and the linear acceleration of the N:M sparse network is obtained by using RISC-V.
According to an embodiment of the present disclosure, there is provided a method for accelerating an N: M sparse network based on RISC-V, the method comprising: obtaining an effective value matrix A 'of a matrix A of an M sparse network, wherein the effective value matrix A' only comprises non-zero values of the matrix A; obtaining a matrix mask for indicating locations of non-zero values of the matrix a, wherein respective rows of the matrix mask include a plurality of location indicators indicating locations of non-zero values of respective rows of the matrix a; and performing a convolution operation of the matrix a and the matrix B using the valid value matrix a' and the matrix mask.
According to an embodiment of the present disclosure, there is provided a method for RISC-V based acceleration of N: M sparse network devices, comprising: a memory, and at least one processor, the at least one processor to: obtaining an effective value matrix A 'of a matrix A of an M sparse network, wherein the effective value matrix A' only comprises non-zero values of the matrix A; obtaining a matrix mask for indicating locations of non-zero values of the matrix a, wherein respective rows of the matrix mask include a plurality of location indicators indicating locations of non-zero values of respective rows of the matrix a; and performing a convolution operation of the matrix a and the matrix B using the valid value matrix a' and the matrix mask.
According to an embodiment of the present disclosure, there is provided a computer-readable storage medium having instructions stored thereon, wherein the instructions, when executed by a processor, cause the processor to perform the above-described method for accelerating an N: M sparse network based on RISC-V.
Drawings
Embodiments of the present disclosure will now be described, by way of example and not limitation, with reference to the figures of the accompanying drawings in which like reference numerals refer to similar elements and in which:
fig. 1 illustrates a block diagram of an example processor and/or SoC 100, which processor and/or SoC 100 may have one or more cores and an integrated memory controller.
FIG. 2 shows a schematic diagram of a comparison of a random sparse matrix and an N: M sparse matrix.
FIG. 3 illustrates an example of a conventional operation for performing convolution calculations using RISC-V vector instructions.
FIG. 4 illustrates an overall operational flow diagram for accelerating an N:M sparse network based on RISC-V according to an embodiment of the present disclosure.
Fig. 5 shows a schematic diagram of a structured sparse matrix storage format according to an embodiment of the present disclosure.
Fig. 6 shows a flow diagram of combining multiplication and addition operations in hardware according to an embodiment of the present disclosure.
FIG. 7 illustrates a flow chart of a method 700 of accelerating an N:M sparse network based on RISC-V in accordance with an embodiment of the present disclosure.
Fig. 8 is a block diagram illustrating components capable of reading instructions from a machine-readable or computer-readable medium (e.g., a non-transitory machine-readable storage medium) and performing any one or more of the methods discussed herein, according to some example embodiments.
Detailed Description
Features and exemplary embodiments of various aspects of the present application are described in detail below. In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced without some of these specific details. The following description of the embodiments is merely intended to provide a better understanding of the present application by showing an example of the present application. The present application is in no way limited to any particular configuration set forth below, but rather covers any modification, substitution, or improvement of elements, components, and algorithms without departing from the spirit of the present application. In the drawings and following description, well-known structures and techniques are not shown in order to avoid unnecessarily obscuring the present application.
Moreover, various operations will be described as multiple discrete operations in a manner that is most helpful in understanding the illustrative embodiments; however, the order of description should not be construed as to imply that these operations are necessarily order dependent. In particular, these operations need not be performed in the order of presentation.
The phrases "in an embodiment," "in one embodiment," and "in some embodiments" are repeated herein. The phrase generally does not refer to the same embodiment; however, it may refer to the same embodiment. The terms "comprising," "having," and "including" are synonymous, unless the context dictates otherwise. The phrases "A or B" and "A/B" mean "(A), (B) or (A and B)".
Fig. 1 illustrates a block diagram of an example processor and/or SoC 100, which processor and/or SoC 100 may have one or more cores and an integrated memory controller. The processor 100 illustrated in solid line boxes has a single core 102 (a), a system agent unit circuit 110, and a set of one or more interface controller unit circuits 116, while the optionally added dashed line boxes illustrate the alternative processor 100 as having a plurality of cores 102 (a) - (N), a set of one or more integrated memory control unit circuits 114 in the system agent unit circuit 110, dedicated logic 108, and a set of one or more interface controller unit circuits 116.
Different implementations of the processor 100 may include: 1) A CPU, wherein the dedicated logic 108 is integrated graphics and/or scientific (throughput) logic (may include one or more cores, not shown), the cores 102 (a) - (N) are one or more general-purpose cores (e.g., general-purpose ordered cores, general-purpose out-of-order cores, or a combination of both); 2) Coprocessors in which cores 102 (a) - (N) are a large number of specialized cores primarily for graphics and/or scientific (throughput) purposes; and 3) a coprocessor, wherein cores 102 (A) - (N) are a number of general purpose ordered cores. Thus, processor 100 may be a general purpose processor, a coprocessor or special-purpose processor, such as, for example, a network or communication processor, compression engine, graphics processor, GPGPU (general purpose graphics processing unit), a high-throughput integrated many-core (MIC) coprocessor (including 30 or more cores), embedded processor, or the like. The processor may be implemented on one or more chips. The processor 100 may be part of one or more substrates and/or may be implemented on one or more substrates using any of a variety of process technologies, such as complementary metal oxide semiconductor (complementary metal oxide semiconductor, CMOS), bipolar CMOS (BiCMOS), P-type metal oxide semiconductor (P-type metal oxide semiconductor, PMOS), or N-type metal oxide semiconductor (N-type metal oxide semiconductor, NMOS).
The memory hierarchy includes one or more levels of cache cell circuitry 104 (a) - (N) within cores 102 (a) - (N), a set of one or more shared cache cell circuitry 106, and an external memory (not shown) coupled to the set of integrated memory controller cell circuitry 114. The set of one or more shared cache unit circuits 106 may include one or more intermediate level caches, such as level 2 (L2), level 3 (L3), level 4 (4), or other levels of cache, such as Last Level Cache (LLC), and/or combinations of these. While in some examples the interface network circuitry 112 (e.g., a ring interconnect) provides an interface to the dedicated logic 108 (e.g., integrated graphics logic), the set of shared cache unit circuitry 106, and the system agent unit circuitry 110, alternative examples use any number of well-known techniques to provide an interface to these units. In some examples, coherency is maintained between one or more of the shared cache unit circuits 106 and cores 102 (a) - (N). In some examples, the interface controller unit circuitry 116 couples these cores to one or more other devices 118, such as one or more I/O devices, storage, one or more communication devices (e.g., wireless network, wired network, etc.), and so forth.
In some examples, one or more of cores 102 (A) - (N) have multi-threading capabilities. System agent unit circuitry 110 includes those components that coordinate and operate cores 102 (A) - (N). The system agent unit circuit 110 may include, for example, a power control unit (power control unit, PCU) circuit and/or a display unit circuit (not shown). The PCU may be (or may include) logic and components required to adjust the power states of cores 102 (a) - (N) and/or dedicated logic 108 (e.g., integrated graphics logic). The display element circuit is used to drive one or more externally connected displays.
Cores 102 (a) - (N) may be homogenous in terms of instruction set architecture (instruction set architecture, ISA). Alternatively, cores 102 (A) - (N) may also be heterogeneous with respect to the ISA; that is, a subset of cores 102 (a) - (N) may be capable of executing one ISA, while other cores may be capable of executing only a subset of that ISA or capable of executing another ISA. Processor cores 102 (a) - (N) may employ a RISC-V instruction set architecture in whole or in part.
At present, RISC-V instruction set architecture can only be used to accelerate general artificial intelligence workloads, whereas RISC-V type processors have not been considered for how to accelerate N: M sparse networks. Sparsity acceleration for fine-tuning workflows is only implemented on specific GPUs with specific instructions and hardware, e.g., researchers have proposed that networks can be tuned to have a 2:4 sparsity and then map the tuned network to specific tensor cores in the GPU. However, RISC-V processors cannot use these instructions and hardware. The invention provides a design of an N:M sparse network based on RISC-V acceleration, so that linear acceleration can be obtained by utilizing sparsity.
The invention aims to obtain linear acceleration of an N:M sparse network by using a RISC-V instruction set architecture. To achieve this objective, the present invention accelerates the N: M sparse network on RISC-V by (1) structuring the sparse matrix storage format (converting to valid values and position indicators) (2) reordering the vector instruction sequence performing the convolution operation to obtain the instruction sequence in a fixed format (in a hardware-triggered mode) (3) combining the multiplication and/or summation operations in the convolution operation (by computing only valid values), thereby linearly reducing the computation cost of the convolution in the N: M sparse network.
Matrix sparsity includes random sparsity and N: M sparsity. The location and number of non-zero values in the random sparsity representation matrix are not fixed, while N: M sparsity represents that each set of consecutive M values contains N zero values, e.g., 2:4 sparsity represents that each set of consecutive 4 values includes 2 zero values, and 3:5 sparsity represents that each set of consecutive 5 values includes 3 zero values. Thus, for an N:M sparse network, an efficiency boost of M/(M-N) can theoretically be obtained. FIG. 2 shows a schematic diagram of a comparison of a random sparse matrix and an N:M sparse matrix, with a 2:4 sparse matrix being specifically shown. As shown in fig. 2, for matrix a (m, k), in the case where it has random sparsity, the number and positions of zero values included in k values per row are not fixed; in the case where it has, for example, 2:4 sparsity, the matrix is divided into a plurality of groups, for example, 2 groups by columns, wherein each row in each group includes 4 values, and 2 zero values are included in the 4 values of each row.
FIG. 3 illustrates an example of a conventional operation for performing a convolution operation using RISC-V vector instructions. In the example shown in FIG. 3, matrix A is a 2:4 sparse matrix, and matrix A (m, k) is convolved with matrix B (k, n) to form matrix C (k, n). As shown in fig. 3, in the conventional convolution operation, the 1 st row of the matrix a and the 1 st column of the matrix B perform a multiply-add operation to obtain the 1 st row 1 st column of the matrix C, and then the 1 st row of the matrix a and the 2 nd column of the matrix B perform a multiply-add operation to obtain the 1 st row 2 nd column of the matrix C, and so on. The specific RISC-V vector instruction sequence for performing multiply-add operation on the 1 st row of the matrix A and the 1 st and 2 nd columns of the matrix B is as follows:
vmul.vv v4, v1, v2, v0.t
the value in vredsum. Vs v6, v5, v4, v0. T#v5 is 0;
vmul.vv v4, v1, v3, v0.t
vredsum.vs v7, v5, v4, v0.t
wherein vmul.v is a vector multiply instruction operator, vredsum.vs is a vector sum instruction operator, register v1 is used to load the values of each row in matrix a, registers v2, v3 are used to load the values of each column in matrix B, respectively, register v0 is used to load the position indicator as a mask bit, register v4 is used to store the result of the multiplication operation, register v5 is used for the summation operation with all values therein being 0, and registers v6, v7 are used to store the result of the corresponding summation operation, respectively.
Embodiments of the present disclosure provide a generic modular design based on RISC-V for accelerating N: M sparse networks. FIG. 4 illustrates an overall operational flow diagram for accelerating an N:M sparse network based on RISC-V according to an embodiment of the present disclosure. Specifically, as shown in FIG. 4, the design has three interdependent operations, including: 402, structuring a storage format of a sparse matrix; 404, reordering the vector instruction sequence for performing sparse matrix convolution operation to obtain an instruction sequence in a fixed format; 406, combining the multiplication and summation operations in the sparse matrix convolution operation.
As shown in fig. 4, at 402, the sparse matrix a is structured into a structured sparse matrix storage format that stores only non-zero significant values of the sparse matrix and a position indicator for indicating a corresponding position of the non-zero significant values, wherein the position indicator has the same size as the sparse matrix, wherein the position indicator indicates the corresponding position of the sparse matrix as a non-zero value with a 1 and the corresponding position of the sparse matrix as a 0 value with a 0, e.g., in the example of fig. 4, the position indicator for indicating a non-zero value of a single row of matrix a is an 8-bit indicator. Since only non-zero significant values in the sparse matrix a will be used for the convolution operation, only the values in matrix B corresponding to the non-zero significant values in matrix a are the significant data required for the convolution operation, by means of which the indicator can be used to select the corresponding values for the convolution operation from the matrix B for which the convolution operation is performed with the sparse matrix a.
A schematic diagram of a structured sparse matrix storage format according to an embodiment of the present disclosure is shown in fig. 5. Storage may be saved using the structured sparse matrix storage format. In the example shown in fig. 5, a 2:4 sparse matrix is shown, the size of the sparse matrix a is m x k, the size of the structured sparse matrix a' is m x k/2, the position indicator is used as a matrix mask, the size of which is m x k, as shown in fig. 5.
Referring back to fig. 4, after structuring the sparse matrix a into a structured sparse matrix storage format, at 404, the vector instruction sequence performing the sparse matrix convolution operation is reordered to obtain a fixed format instruction sequence. Specifically, RISC-V vector instructions for performing multiplication and addition operations in convolution operations are reordered to: such that instructions for multiplication operations to be combined are moved together and instructions for summation operations to be combined are moved together. For example, RISC-V vector instructions that perform a multiply-add operation on row 1 of matrix A and columns 1, 2 of matrix B as shown in FIG. 3 may be reordered into a fixed-format instruction sequence as follows:
vmul.vv v4, v1, v2, v0.t
vmul.vv v4, v1, v3, v0.t
vredsum.vs v6, v5, v4, v0.t
vredsum.vs v7, v5, v4, v0.t
where the valid value of matrix a is loaded to v1, the value of matrix B is loaded to v2, v3 by column, and the position indicator is loaded to v0 as a mask bit. With the reordered fixed format instruction sequence, multiple multiplication operations and multiple addition operations to be combined in subsequent computations may be moved together in view of: the hardware may detect successive multiplication instructions using the same source 1 (v 1), mask (v 0), and destination (v 4) as hints to merge multiple operations into one operation according to the indicator bits in v0. The summation instruction is similarly reordered.
At 406, a plurality of successive multiply and/or sum operations are combined in one multiplication and/or sum unit to be implemented as one operation using the combined multiply and sum operation. For example, as shown in FIG. 4, two consecutive multiplication operations are combined in one multiplication unit to be implemented as one operation, since no zero value is involved in the calculation, and there are free calculation unit data bits in the multiplication unit that can be used to calculate the value of the subsequent operation.
Fig. 6 shows a flow diagram of combining multiplication and summation operations in hardware according to an embodiment of the disclosure. As shown in fig. 6, the hardware repeats v1 (e.g., the valid value of row 1 of matrix a) multiple times (e.g., twice) as the first operand of the multiplication operation, selects operands from v2, v3 (e.g., columns 1, 2 of matrix B) according to the position indicators in the matrix mask, places them in the second operand of the multiplication operation, e.g., in the example shown in fig. 6, the operation data selected from v2 and v3 is placed in the first half and the second half of the second operand of the multiplication operation, respectively, at which time vm bits are all filled with 1, which means that all values in the multiplication unit are valid. Then a plurality of successive multiplication operations are completed in one multiplication unit, for example, in the example shown in fig. 6, two multiplication operations of the 1 st row of the matrix a and the 1 st and 2 nd columns of the matrix B are completed in one multiplication unit. In addition, a plurality of summation operations are also combined into one operation, for example, as shown in fig. 6, the results of two multiplication operations are put into the first half and the second half of the summation unit, respectively, whereby a plurality of successive summation operations are completed in one summation unit.
It should be appreciated that while the present disclosure discusses convolution calculations of a 2:4 sparse matrix as an example, the RISC-V accelerated n:m sparse network based method provided by the present disclosure may be used for sparse networks with any other sparsity (e.g., 3:5).
In one example, the effective value matrix and matrix mask may be similarly obtained for a 3:5 sparse matrix, for example, and multiple multiplication and summation operations in a convolution calculation are combined according to the effective value matrix and matrix mask based on the data bit width of the calculation unit.
In one example, with a data bit width of 5 in units of computation, for a 3:5 sparse matrix, there are only 2 valid operands per computation instruction, in order to achieve computation acceleration, i.e., to fully utilize computing resources in a single computation, 5 valid operands may be obtained from three consecutive computation instructions to fill 5 computation units to complete as one operation, while the remaining one valid operand in the third instruction may be refilled with 5 computation units with 4 valid operands in the subsequent two computation instructions to complete as the next operation. In one example, 4 valid operands may be fetched from two consecutive computing instructions as a single operation completion.
FIG. 7 illustrates a flow chart of a method 700 of accelerating an N:M sparse network based on RISC-V in accordance with an embodiment of the present disclosure. The method 700 may be performed by a processor supporting RISC-V. The method 700 may include steps S702, S704 and S706. However, in some embodiments, method 700 may include more or less distinct steps, and the present disclosure is not limited thereto.
In step S702, an effective value matrix A 'of matrix A of the N:M sparse network is obtained, wherein the effective value matrix A' includes only non-zero values of matrix A. In one embodiment, the sparsity of an N:M sparse network is 2:4. In another embodiment, the sparsity of an N:M sparse network is 3:5. In other embodiments, the sparsity of the N: M sparse network may take any other ratio, and the disclosure is not limited thereto.
In step S704, a matrix mask is obtained for indicating the locations of non-zero values of matrix a, wherein each row of the matrix mask includes a plurality of location indicators to indicate the locations of the non-zero values of the corresponding row of matrix a. The matrix mask is the same size as matrix a and in the matrix mask, 1 indicates that the corresponding position in matrix a is a non-zero value and 0 indicates that the corresponding position in matrix a is a zero value.
In step S706, a convolution operation of the matrix a and the matrix B is performed using the effective value matrix a' and the matrix mask. Step S706 may further include: selecting each row of the effective value matrix A' as a first operand of convolution multiplication of convolution operation; and selecting the second operand of the convolution multiplication column by column from the matrix B according to the matrix mask.
In embodiments of the present disclosure, two or more consecutive single-multiply sums from a convolution operation are combined into one multiplication operation.
In embodiments of the present disclosure, multiple successive single summation operations from a convolution operation may be combined into a single summation operation.
In embodiments of the present disclosure, free multipliers in a multiplication unit to perform a convolution operation of matrix a may be used to perform subsequent one or more single multiplication calculations in view of zero values in matrix a not participating in the convolution operation.
In embodiments of the present disclosure, free multipliers in a multiplication unit used to perform a single multiplication calculation of a convolution operation of matrix a may be used to perform a calculation of a portion of the effective values in a single multiplication calculation subsequent to the single multiplication calculation, in view of zero values in matrix a not participating in the convolution operation.
In embodiments of the present disclosure, multiple multiplication operations of a single row in matrix a and multiple columns in matrix B may be combined into one operation to be implemented in a multiplication unit.
In embodiments of the present disclosure, multiple multiplication computations of multiple rows in matrix a with multiple columns in matrix B may be combined into one operation in a multiplication unit.
In an embodiment of the present disclosure, the number of multiplication and/or summation computations to be combined is determined according to the data bit width of the computation unit.
In an embodiment of the present disclosure, the number of effective values that satisfy the data bit width of the computation unit is determined based on the data bit width of the computation unit, thereby determining the number of multiplication and/or summation computations to be combined.
In embodiments of the present disclosure, the multiplication and/or summation calculations to be combined in the calculation unit may comprise a partial effective value for a single multiplication and/or summation calculation.
In an embodiment of the present disclosure, the calculation of the partial valid values of the individual rows of matrix a may be combined in the calculation unit.
In one example, in a multiplication unit, non-zero values in a single row of matrix a are repeated a number of times in order as a first operand for a multiplication operation; and selecting, from the matrix B, in order from column to column, a plurality of columns of valid values as the second operand of the multiplication operation according to the position indicator in the matrix mask corresponding to the row of the matrix a, wherein the plurality of columns of valid values are values in each column for multiplication with non-zero values of the row of the matrix a.
In an embodiment of the present disclosure, a plurality of consecutive single summation computations in a convolution operation are combined into one summation operation in a summation unit for performing the convolution operation.
In an embodiment of the present disclosure, RISC-V vector instructions for performing convolution operations are reordered to trigger hardware-optimized convolution operations. In an embodiment of the present disclosure, RISC-V vector instructions for performing convolution operations are reordered to: the RISC-V processor is instructed to combine a plurality of consecutive single multiplication and/or summation calculations in a convolution operation of matrices a and B into a single multiplication and/or summation calculation.
In an embodiment of the present disclosure, RISC-V vector instructions for performing convolution operations are reordered to: the RISC-V processor is instructed to combine one or more rows in matrix a with a plurality of consecutive single multiplication computations and/or a plurality of consecutive single summation computations for corresponding columns in matrix B into a single multiplication and/or summation computation.
In one example, in a multiplication unit for performing a convolution operation, two single multiplication computations of a single row in matrix a and two columns in matrix B are combined into one multiplication computation.
In one example, in a multiplication unit, the non-zero values of a single row of matrix a are repeated 2 times as the first operand for the multiplication calculation; and selecting, from the matrix B, valid values of two columns in order from column to column as the second operand of the multiplication according to the position indicators in the matrix mask corresponding to the individual rows of the matrix a, wherein the valid values are values in each column for multiplication with non-zero values of the rows of the matrix a.
The disclosed embodiments eliminate all zero-value-dependent computations in hardware, thus allowing the theoretical efficiency of convolution to be improved by M/(M-N). The efficiency improvement of the present disclosure was evaluated by experiment with 2:4 sparsity for CIFAR10 on the convolution kernel of ResNet 32 (a (100, 100)), W (3, 3)). The experimental results are shown in table 1.
TABLE 1
As can be seen from table 1, the method proposed by the present disclosure has a significant reduction in the amount of computational resource usage in both MAC operation and overall operation compared to the original method.
Fig. 8 is a block diagram illustrating components capable of reading instructions from a machine-readable or computer-readable medium (e.g., a non-transitory machine-readable storage medium) and performing any one or more of the methods discussed herein, according to some example embodiments. In particular, fig. 8 shows a schematic diagram of a hardware resource 800, the hardware resource 800 comprising one or more processors (or processor cores) 810, one or more memory/storage devices 820, and one or more communication resources 830, wherein each of these processors, memory/storage devices, and communication resources may be communicatively coupled via a bus 840 or other interface circuitry. For embodiments that utilize node virtualization, such as Network Function Virtualization (NFV), the hypervisor 802 may be executed to provide an execution environment for one or more network slices/sub-slices to utilize the hardware resources 800.
Processor 810 may include, for example, a processor 812 and a processor 814. The processor 810 may be, for example, a Central Processing Unit (CPU), a Reduced Instruction Set Computing (RISC) processor, a Complex Instruction Set Computing (CISC) processor, a Graphics Processing Unit (GPU), a Digital Signal Processor (DSP) such as a baseband processor, an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), a Radio Frequency Integrated Circuit (RFIC), another processor (including those discussed herein), or any suitable combination thereof.
Memory/storage 820 may include main memory, magnetic disk storage, or any suitable combination thereof. Memory/storage 820 may include, but is not limited to, any type of volatile, nonvolatile, or semi-volatile memory such as Dynamic Random Access Memory (DRAM), static Random Access Memory (SRAM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), flash memory, solid state memory, and the like.
Communication resources 830 may include interconnection or network interface controllers, components, or other suitable devices to communicate with one or more peripheral devices 804 or one or more databases 806 or other network elements via network 808. For example, the communication resources 830 may include wired communication components (e.g., for coupling via USB, ethernet, etc.), cellular communication components, near Field Communication (NFC) components, bluetooth (or Bluetooth (r) low energy) components, wi-Fi components, and other communication components.
The instructions 850 may include software, programs, applications, applets, applications, or other executable code for causing at least any one of the processors 810 to perform any one or more of the methods discussed herein. The instructions 850 may reside, completely or partially, within at least one of the processor 810 (e.g., in a cache of the processor), the memory/storage device 820, or any suitable combination thereof. Further, any portion of the instructions 850 may be transferred from any combination of the peripheral 804 or the database 806 to the hardware resource 800. Accordingly, the memory of processor 810, memory/storage 820, peripherals 804, and database 806 are examples of computer-readable and machine-readable media.
Some examples may be implemented with or as an article of manufacture or at least one computer readable medium. The computer readable medium may include a non-transitory storage medium to store logic. In some examples, the non-transitory storage medium may include one or more types of computer-readable storage media capable of storing electronic data, including volatile memory or non-volatile memory, removable or non-removable memory, erasable or non-erasable memory, writeable or re-writeable memory, and so forth. In some examples, logic may include various software elements, such as software components, programs, applications, computer programs, application programs, system programs, machine programs, operating system software, middleware, firmware, software modules, routines, subroutines, functions, methods, procedures, software interfaces, APIs, instruction sets, computing code, computer code, code segments, computer code segments, words, values, symbols, or any combination of these.
According to some examples, a computer-readable medium may include a non-transitory storage medium to store or maintain instructions that, when executed by a machine, computing device, or system, cause the machine, computing device, or system to perform methods and/or operations in accordance with the described examples. The instructions may include any suitable type of code, such as source code, compiled code, interpreted code, executable code, static code, dynamic code, and the like. The instructions may be implemented according to a predetermined computer language, manner or syntax, for instructing a machine, computing device, or system to perform a certain function. The instructions may be implemented using any suitable high-level, low-level, object-oriented, visual, compiled and/or interpreted programming language.
One or more aspects of at least one example may be implemented by representative instructions stored on at least one machine-readable medium which represent various logic within a processor, which when read by a machine, computing device, or system, cause the machine, computing device, or system to fabricate logic to perform the techniques described herein. Such a representation, referred to as an "IP core," may be stored on a tangible machine readable medium and provided to various customers or manufacturing facilities for loading into the production machine that actually produces the logic or processor.
The appearances of the phrase "one example" or "an example" are not necessarily all referring to the same example or embodiment. Any aspect described herein may be combined with any other aspect or similar aspect described herein, whether or not the aspects are described with respect to the same drawing or element. The division, omission or inclusion of block functions depicted in the accompanying figures does not infer that the hardware components, circuits, software and/or elements for implementing these functions would be necessarily be divided, omitted, or included in embodiments.
Some examples may be described using the expression "coupled" and "connected" along with their derivatives. These terms are not necessarily intended as synonyms for each other. For example, a description using the terms "connected" and/or "coupled" may indicate that two or more elements are in direct physical or electrical contact with each other. However, the term "coupled" may also mean that two or more elements are not in direct contact with each other, but yet still co-operate or interact with each other.
The terms "first," "second," and the like, herein do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The terms "a" and "an" herein do not denote a limitation of quantity, but rather denote the presence of at least one of the referenced item. The term "assert" as used herein with reference to a signal refers to a state of the signal in which the signal is active, and which can be achieved by applying any logic level (whether a logic 0 or a logic 1) to the signal. The term "subsequently" or "after" may refer to immediately following or following some other event or events. Other sequences of steps may also be performed according to alternative embodiments. Furthermore, depending on the particular application, additional steps may be added or removed. Any combination of the variations may be used, and many variations, modifications, and alternative embodiments thereof will be understood by those of ordinary skill in the art having the benefit of this disclosure.
Unless specifically stated otherwise, disjunctive language such as the phrase "at least one of X, Y or Z" is understood within the context to generally recite an item, term, etc. may be X, Y or Z, or any combination thereof (e.g., X, Y and/or Z). Thus, such disjunctive language is generally not intended nor should it be implied that certain embodiments require the presence of each of at least one X, at least one Y, or at least one Z. Furthermore, unless specifically stated otherwise, a connectivity language such as the phrase "at least one of X, Y and Z" should also be understood to refer to X, Y, Z or any combination thereof, including "X, Y and/or Z".
Illustrative examples of the devices, systems, and methods disclosed herein are provided below. Embodiments of the devices, systems, and methods may include any one or more of the examples described below, as well as any combination thereof.

Claims (11)

1. A method for image processing performed by a RISC-V enabled processor, comprising:
obtaining an effective value matrix a 'of a matrix a having an N: M sparsity from an image data input, wherein the effective value matrix a' includes only non-zero values of the matrix a;
obtaining a matrix mask for indicating locations of non-zero values of the matrix a, wherein respective rows of the matrix mask include a plurality of location indicators indicating locations of non-zero values of respective rows of the matrix a; and
performing a convolution operation of the matrix a and the matrix B using the matrix a' of active values and the matrix mask,
wherein performing a convolution operation of the matrix a and the matrix B using the valid value matrix a' and the matrix mask comprises:
selecting each row of the effective value matrix A' as a first operand of a convolution multiplication of the convolution operation;
a second operand of the convolution multiplication is selected column by column from the matrix B according to the matrix mask,
the free multipliers in the multiplication units used to perform the single-multiply computations of the convolution operation are used to perform one or more single-multiply computations subsequent to the single-multiply computation.
2. The method of claim 1, further comprising:
a plurality of successive single multiplication and/or summation calculations from the convolution operation are combined into one multiplication and/or summation calculation.
3. The method of claim 1, further comprising:
the multiplication calculations of one or more rows in the matrix a and corresponding columns in the matrix B are combined into one calculation in a multiplication unit.
4. A method according to claim 3, further comprising:
in the multiplication unit, non-zero values in a single row of the matrix a are repeated a plurality of times in order as a first operand for the multiplication calculation; and is also provided with
According to the position indicators corresponding to the rows of the matrix A in the matrix mask, valid values of a plurality of columns are sequentially selected from the matrix B column by column to serve as second operands of multiplication calculation, wherein the valid values of the plurality of columns are values used for multiplication with non-zero values of the rows of the matrix A in each column.
5. The method of claim 1, further comprising:
the RISC-V vector instructions used to perform the convolution operations are reordered to trigger hardware optimization of the convolution operations.
6. The method of claim 5, wherein the RISC-V vector instructions for performing convolution operations are reordered to: the RISC-V processor is instructed to combine two or more consecutive single multiplication and/or summation calculations in a convolution operation of the matrix a and the matrix B into one multiplication and/or summation calculation.
7. A RISC-V based processor for image processing, comprising:
at least one processor core, the at least one processor core to:
obtaining an effective value matrix a 'of a matrix a having an N: M sparsity from an image data input, wherein the effective value matrix a' includes only non-zero values of the matrix a;
obtaining a matrix mask for indicating locations of non-zero values of the matrix a, wherein respective rows of the matrix mask include a plurality of location indicators indicating locations of non-zero values of respective rows of the matrix a; and
performing a convolution operation of the matrix a and the matrix B using the matrix a' of active values and the matrix mask,
selecting each row of the effective value matrix A' as a first operand of a convolution multiplication of the convolution operation;
selecting a second operand of the convolution multiplication column by column from the matrix B according to the matrix mask; and
the free multipliers in the multiplication units used to perform the single-multiply computations of the convolution operation are used to perform one or more single-multiply computations subsequent to the single-multiply computation.
8. The processor of claim 7, the at least one processor core further to:
a plurality of successive single multiplication and/or summation calculations from the convolution operation are combined into one multiplication and/or summation calculation.
9. The processor of claim 7, the at least one processor core further to:
the multiplication calculations of one or more rows in the matrix a and corresponding columns in the matrix B are combined into one calculation in a multiplication unit.
10. The processor of claim 7, the at least one processor core further to:
the RISC-V vector instructions used to perform the convolution operations are reordered to trigger hardware optimization of the convolution operations.
11. A computer readable storage medium having instructions stored thereon, wherein the instructions, when executed by a processor, cause the processor to perform the method for image processing performed by a RISC-V enabled processor according to any of claims 1-6.
CN202311091713.9A 2023-08-28 2023-08-28 Method and apparatus for accelerating N: M sparse networks based on RISC-V Active CN116821576B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311091713.9A CN116821576B (en) 2023-08-28 2023-08-28 Method and apparatus for accelerating N: M sparse networks based on RISC-V

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311091713.9A CN116821576B (en) 2023-08-28 2023-08-28 Method and apparatus for accelerating N: M sparse networks based on RISC-V

Publications (2)

Publication Number Publication Date
CN116821576A CN116821576A (en) 2023-09-29
CN116821576B true CN116821576B (en) 2023-12-26

Family

ID=88122471

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311091713.9A Active CN116821576B (en) 2023-08-28 2023-08-28 Method and apparatus for accelerating N: M sparse networks based on RISC-V

Country Status (1)

Country Link
CN (1) CN116821576B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110321525A (en) * 2018-03-28 2019-10-11 英特尔公司 Accelerator for sparse-dense matrix multiplication
CN111191784A (en) * 2018-11-14 2020-05-22 辉达公司 Transposed sparse matrix multiplied by dense matrix for neural network training

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110321525A (en) * 2018-03-28 2019-10-11 英特尔公司 Accelerator for sparse-dense matrix multiplication
CN111191784A (en) * 2018-11-14 2020-05-22 辉达公司 Transposed sparse matrix multiplied by dense matrix for neural network training

Also Published As

Publication number Publication date
CN116821576A (en) 2023-09-29

Similar Documents

Publication Publication Date Title
US11263007B2 (en) Convolutional neural network hardware acceleration device, convolutional calculation method, and storage medium
CN110337635B (en) System, method and apparatus for dot product operation
EP3404587A1 (en) Cnn processing method and device
US8433883B2 (en) Inclusive “OR” bit matrix compare resolution of vector update conflict masks
EP3451162A1 (en) Device and method for use in executing matrix multiplication operations
US8321492B1 (en) System, method, and computer program product for converting a reduction algorithm to a segmented reduction algorithm
EP3623941B1 (en) Systems and methods for performing instructions specifying ternary tile logic operations
US11614947B2 (en) Computational memory
US11579883B2 (en) Systems and methods for performing horizontal tile operations
KR20210071073A (en) Matrix multiplier using sub-matrix ordering
Lai et al. Accelerating Strassen-Winograd's matrix multiplication algorithm on GPUs
CN113010213B (en) Simplified instruction set storage and calculation integrated neural network coprocessor based on resistance change memristor
CN114090954A (en) Integer matrix multiplication kernel optimization method based on FT-2000+
US20080288756A1 (en) "or" bit matrix multiply vector instruction
JPWO2016024508A1 (en) Multiprocessor device
CN112446007A (en) Matrix operation method, operation device and processor
CN116821576B (en) Method and apparatus for accelerating N: M sparse networks based on RISC-V
CN113987414A (en) Small and irregular matrix multiplication optimization method based on ARMv8 multi-core processor
CN113313244A (en) Near-storage neural network accelerator facing to addition network and acceleration method thereof
US20200192633A1 (en) Arithmetic processing device and method of controlling arithmetic processing device
US20240004702A1 (en) Thread construction method and device
CN114116208A (en) Short wave radiation transmission mode three-dimensional acceleration method based on GPU
CN110750752B (en) Interpolation method and device for analog data
CN112667241B (en) Machine learning instruction conversion method and device, board card, main board and electronic equipment
CN112434255A (en) Vector-matrix operation and data processing method, multiplier and processor chip

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant