CN114218524A

CN114218524A - Large-scale multi-operation floating-point matrix calculation acceleration implementation method and device

Info

Publication number: CN114218524A
Application number: CN202111283133.0A
Authority: CN
Inventors: 彭元喜; 张龙龙; 郭阳; 扈啸; 黄啊慧; 粟毅; 张世亮; 田甜; 李岩
Original assignee: National University of Defense Technology; Beijing Power Machinery Institute
Current assignee: National University of Defense Technology; Beijing Power Machinery Institute
Priority date: 2021-11-01
Filing date: 2021-11-01
Publication date: 2022-03-22

Abstract

The invention discloses a large-scale multi-operation floating-point matrix calculation acceleration realization method, which comprises the following steps: step S1: receiving an external input signal according to the operation type of the matrix to be processed, and judging a matrix operation mode: when the operation mode is matrix addition and matrix subtraction, the step proceeds to execute step S3, and when the operation mode is matrix multiplication, matrix-vector multiplication, and matrix-scalar multiplication, the step proceeds to execute step S2; step S2: initializing the on-chip RAM to zero, and executing the step S4; step S3: loading the data source C into the on-chip RAM through the RAM channel, and executing the step S4; step S4: preloading a part of data flow A through an RAM channel, and simultaneously calculating and loading the data flow A and the data flow B; step S5: and after the calculation is finished, transmitting the calculation result to an off-chip memory. The device is used for implementing the method. The invention has the advantages of low storage requirement, high calculation efficiency, high reusability, wide application range and the like.

Description

Large-scale multi-operation floating-point matrix calculation acceleration implementation method and device

Technical Field

The invention mainly relates to the technical field of high-performance computers, in particular to a method and a device for realizing the calculation acceleration of a large-scale multi-operation floating-point matrix.

Background

Matrix computation is a fundamental and widely used operational model in many sciences and engineering, such as storage, processing and recognition of digital images, neural network computation, kalman filters in control systems, etc. Matrix calculations directly affect the performance of high performance computers.

At present, platforms such as a CPU (Central processing Unit) and a GPGPU (general purpose processing Unit) utilize software libraries such as MKL (Mega-Log-Kelvin) and CuBLAS (CuBLAS) to accelerate matrix calculation, but the methods are limited by energy consumption and complex levels, and have poor application effect in a mobile system or an embedded system.

In the Field Programmable Gate Array (FPGA) development, there are some related researches on dense matrix-matrix multiplication and sparse matrix-vector multiplication in the related art. However, most of these architectures can only handle a single certain matrix operation mode. In fact, in many engineering applications, a certain matrix operation is often insufficient, but rather needs to support multiple matrix operation modes.

Aiming at the problem of fusion of multiple operations of a floating-point matrix, a practitioner proposes a Chinese patent application 'general floating-point matrix processor hardware structure based on FPGA' (publication number CN104391820A), which discloses a general matrix computing system integrating a host with a plurality of different types of matrix operation accelerators, and the simple combination mode of the plurality of independent matrix operation accelerators is relatively simple. However, such techniques have disadvantages in that: the sum of all the modules consumes more logic resources, the area is increased, and the reusability of the structure is relatively insufficient.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: aiming at the technical problems in the prior art, the invention provides a large-scale multi-operation floating point matrix calculation acceleration implementation method and device with low storage requirement, high calculation efficiency, high reusability and wide application range.

In order to solve the technical problems, the invention adopts the following technical scheme:

a large-scale multi-operation floating-point matrix calculation acceleration implementation method comprises the following steps:

step S1: receiving an external input signal according to the operation type of the matrix to be processed, and judging a matrix operation mode: when the operation mode is matrix addition and matrix subtraction, the step proceeds to execute step S3, and when the operation mode is matrix multiplication, matrix-vector multiplication, and matrix-scalar multiplication, the step proceeds to execute step S2;

step S2: initializing the on-chip RAM to zero, and executing the step S4;

step S3: loading the data source C into the on-chip RAM through the RAM channel, and executing the step S4;

step S4: preloading a part of data flow A through an RAM channel, and simultaneously calculating and loading the data flow A and the data flow B;

step S5: and after the calculation is finished, transmitting the calculation result to an off-chip memory.

As a further improvement of the process of the invention: the matrix multiplication is the product of matrix A and matrix B, i.e. C is realized_M×N＝A_M×K×B_K×N+C_M×NWherein, the data stream a and the data stream B are input data streams, and the data stream C is directly initialized to zero on the chip.

As a further improvement of the process of the invention: the matrix is added,Matrix subtraction means addition and subtraction of matrix A and matrix C, namely C is realized_M×K＝A_M×K×I_K×K±C_M×KWherein, the data stream a and the data stream C are input data streams, and the data stream B is an introduced identity matrix I, and is directly generated by a counter on a chip.

As a further improvement of the process of the invention: the matrix-vector multiplication refers to the product of a matrix A and a vector B, i.e. to achieve C_M×1＝A_K×K×B_K×1+C_M×1Wherein, the data stream a and the data stream B are input data streams, and the data stream C is directly initialized to zero on the chip.

As a further improvement of the process of the invention: the matrix-scalar multiplication refers to the product of matrix A and scalar b, i.e. to realize C_M×K＝A_M×KAnd B, wherein the data stream A and the data stream B are input data streams, and the data stream C is directly initialized to zero on a chip.

As a further improvement of the process of the invention: the data source comprises a data source A, a data source B and a data source C, wherein the data source A refers to a matrix A, the data source B refers to a matrix B, a vector B or a scalar B, and the data source C refers to a participation accumulation matrix or a vector stored on a chip.

As a further improvement of the process of the invention: the method for adopting the block matrix comprises the following steps:

before step S1, calculating the block size according to the scale of the matrix to be processed, and then operating the sub-blocks one by one;

after step S5, the calculation of the next sub-matrix block is continued, i.e., steps S2-S5 are repeated until all sub-block calculations are completed.

The present invention further provides a large-scale multi-operation floating-point matrix calculation accelerating device, which comprises:

the preprocessing module is used for partitioning the matrix and the vector, receiving an external input signal according to the operation type of the matrix to be processed, and judging a matrix operation mode: when the operation mode is matrix addition and matrix subtraction, loading a data source C into the on-chip RAM through the RAM channel; when the operation mode is matrix multiplication, matrix-vector multiplication and matrix-scalar multiplication, initializing an on-chip RAM to be zero, preloading a part of data stream A through an RAM channel, and then calculating and loading the data stream A and the data stream B;

the data transmission control module is used for loading each data source and returning a data calculation result;

the matrix calculation accelerating unit module is a linear array structure formed by a plurality of completely same basic calculation units PE and is used for realizing matrix operation.

As a further improvement of the device of the invention: the basic computation unit PE comprises a floating point computation unit, two registers, a FIFO and a dual port RAM.

As a further improvement of the device of the invention: the floating point calculation unit is an IP core conforming to IEEE 754 standard, integrates multiply-add operation or multiply-subtract operation, and is specifically calculated according to an external operation mode signal.

Compared with the prior art, the invention has the advantages that:

1. the method and the device for realizing the calculation acceleration of the large-scale multi-operation floating-point matrix can complete various mode matrix operations including matrix multiplication, matrix addition, matrix subtraction, matrix-vector multiplication and matrix-scalar multiplication, have wide application range and flexible use, and compared with the traditional method based on a plurality of accelerators, the method and the device are based on a unified hardware structure, have high logic structure reusability and have the advantages of less resource consumption and small area.

2. The method and the device for realizing the acceleration of the large-scale multi-operation floating-point matrix calculation effectively develop the parallelism and the data multiplexing of the calculation on certain resources by combining the pipeline strategy and the block strategy, can process the matrix calculation problem of any scale, and have the advantages of low storage requirement and high calculation efficiency.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention.

Fig. 2 is a schematic diagram of the topology of the apparatus of the present invention.

FIG. 3 is a schematic diagram of the matrix multiplication function process in the specific application example of the present invention.

FIG. 4 is a schematic diagram of the matrix addition and matrix subtraction function process in the specific application example of the present invention.

FIG. 5 is a diagram illustrating the matrix-vector multiplication function of the present invention in a specific application example.

FIG. 6 is a diagram illustrating the matrix-scalar multiplication function process in a specific application example of the present invention.

Detailed Description

The invention will be described in further detail below with reference to the drawings and specific examples.

As shown in fig. 1, the method for accelerating the large-scale multi-operation floating-point matrix calculation of the present invention includes the following steps:

step S2: initializing the on-chip RAM to zero, and executing the step S4;

step S5: after the calculation is finished, transmitting the calculation result to an off-chip memory;

in a specific application example, the invention can maximally multiplex the logical structures of operations such as matrix multiplication, matrix addition, matrix subtraction, matrix-vector multiplication, matrix-scalar multiplication and the like, and use similar data sources and data flow modes.

In a specific application example, the data source includes a data source a, a data source B, and a data source C, where the data source a refers to a matrix a, the data source B refers to a matrix B, a vector B, or a scalar B, and the data source C refers to a participating accumulation matrix or vector stored on the slice.

In a specific application example, the data stream corresponds to the data source, and includes data stream a, data stream B, and data stream C. In the present invention, data stream a and data stream C are relatively fixed, and data stream B is a matrix, vector, or scalar.

In a specific application example, the matrix multiplication function in the invention refers to the product of a matrix A and a matrix B, namely, the realization C_M×N＝A_M×K×B_K×N+C_M×NWherein, the data stream a and the data stream B are input data streams, and the data stream C is directly initialized to zero on the chip.

In a specific application example, the matrix addition and matrix subtraction function in the invention refers to the addition and subtraction of the matrix A and the matrix C, namely, the C is realized_M×K＝A_M×K×I_K×K±C_M×KWherein, the data stream a and the data stream C are input data streams, and the data stream B is an introduced identity matrix I, which can be directly generated on-chip by a counter.

In a specific application example, the matrix-vector multiplication function in the invention refers to the product of a matrix A and a vector B, namely, the realization C_M×1＝A_M×K×B_K×1+C_M×1Wherein, the data stream a and the data stream B are input data streams, and the data stream C is directly initialized to zero on the chip.

In a specific application example, the matrix-scalar multiplication function in the invention refers to the product of a matrix A and a scalar b, namely, the implementation C_M×K＝A_M×KAnd B, wherein the data stream A and the data stream B are input data streams, and the data stream C is directly initialized to zero on a chip.

As a better embodiment, the invention combines the thought of a block matrix to decompose a large-scale matrix calculation problem into a small matrix calculation problem, and the specific steps are as follows: before step S1, calculating the block size according to the scale of the matrix to be processed, and then operating the sub-blocks one by one; after step S5, the calculation of the next sub-matrix block is continued, i.e., steps S2-S5 are repeated until all sub-block calculations are completed.

As shown in fig. 2, the present invention further provides a large-scale multi-operation floating-point matrix calculation acceleration device, comprising:

the preprocessing module is used for partitioning the matrix and the vector, receiving an external input signal according to the operation type of the matrix to be processed, and judging a matrix operation mode: when the operation mode is matrix addition and matrix subtraction, the step S3 in the method is executed, and when the operation mode is matrix multiplication, matrix-vector multiplication, and matrix-scalar multiplication, the step S2 in the method is executed;

the matrix calculation accelerating unit module is a linear array structure formed by a plurality of completely same basic calculation units PE and is used for realizing various types of operation operations of the matrix.

In a specific application example, the basic computation unit PE comprises a floating-point computation unit, two registers, a FIFO and a simple dual-port RAM.

In a specific application example, the floating point calculation unit is an IP core conforming to the IEEE 754 standard, and integrates multiply-add operation or multiply-subtract operation, and specific calculation is carried out according to an external operation mode signal.

As shown in FIG. 3, this embodiment describes a matrix multiplication operation, the specific function is the product of matrix A and matrix B, i.e. to realize C_M×N＝A_M×K×B_K×N+C_M×NAnd the data stream A and the data stream B are input data streams, and the value of the data stream C is directly initialized to zero on a chip.

The form of the embodiment after matrix multiplication and partitioning is

Wherein the size of the sub-matrix A is S_pxK, the size of the sub-matrix B is KxS_qThe size of the submatrix C is S_p×S_a，S_pRepresenting the number of rows of the sub-matrix A and, at the same time, the number of linear array elements, S_qThe column number of the sub-matrix B is shown, and the depth of the on-chip RAM is also shown, K is the column number of the sub-matrix a and the row number of the sub-matrix B, and also the number of iterations.

The specific calculation process is as follows: initializing the submatrix C to zero, pre-adding a row of elements (S) of the submatrix A_pOne), then sequentially flow the elements of the submatrix B by row toIn a linear array computing unit, c [ CNT ] is completed in a pipeline mode and a rewriting mechanism]The intermediate result is stored in the local memory, and the calculation result submatrix C is transferred to the off-chip memory after K iterations, and the calculation of the next subblock is performed, and this is repeated.

As shown in FIG. 4, the embodiment describes the matrix addition and matrix subtraction operations, and the specific function is the addition and subtraction of the matrix A and the matrix C, i.e. the C is realized_M×K＝A_M×K×I_K×K±C_M×KWherein, the data stream a and the data stream C are input data streams, and the data stream B is an introduced identity matrix I, which can be directly generated on-chip by a counter.

When the embodiment performs the matrix addition or matrix subtraction operation, the form after the block division is

Wherein the size of the sub-matrix A is S_pxK, the size of the sub-matrix B is K x K, and the size of the sub-matrix C is S_p×K，S_pThe number of rows of the sub-matrix a is shown, the number of linear array units is also shown, K is the number of columns of the sub-matrix a, the number of columns and the number of rows of the sub-matrix B, and the number of iterations is also shown.

The specific calculation process is as follows: pre-adding all elements of the sub-matrix C and a column of elements (S) of the sub-matrix A_pOne), then flows into the elements of the sub-matrix B in sequence by rows into the linear array computing unit, and c [ CNT ] is completed in a pipeline mode and a rewriting mechanism]The intermediate result is stored in the local memory, and the calculation result submatrix C is transferred to the off-chip memory after K iterations, and the calculation of the next subblock is performed, and this is repeated.

As shown in FIG. 5, this embodiment describes a matrix-vector multiplication operation, the specific function is the product of matrix A and vector B, i.e. to realize C_M×1＝A_M×K×B_K×1+C_M×1Wherein, the data stream A and the data stream B are input data streams and data streamsThe value of C is initialized to zero directly on-chip.

When the matrix-vector multiplication is carried out in the embodiment, the data stream B becomes a vector, and the form after the block division is

Wherein the size of the sub-matrix A is S_pX K, the size of the subvector B is K x 1, and the size of the subvector C is S_p×1，S_pThe number of rows of the sub-matrix a is shown, and the number of linear array elements is also shown, and K is the number of columns of the sub-matrix a.

The specific calculation process is as follows: initializing the sub-vector block C as a zero vector, pre-adding a row of elements (S) of the sub-matrix A_pOne), then flow the elements of sub-vector B into the linear array computation unit, complete c [ CNT ] in a pipelined manner and rewrite mechanism]The intermediate result is stored in the local memory, and the calculation result sub-vector block C is transferred to the off-chip memory after K iterations, and the calculation of the next sub-block is performed, and this is repeated.

As shown in FIG. 6, the present embodiment states that the matrix-scalar multiplication function refers to the product of matrix A and scalar b, i.e., implementation C_M×K＝A_M×KAnd B, wherein the data stream A and the data stream B are input data streams, and the data stream C is directly initialized to zero on a chip.

When the matrix-scalar multiplication is performed in the embodiment, the data stream B becomes a scalar and the form after the partitioning is

Wherein the size of the sub-matrix A is S_pxK, the size of the sub-matrix C being S_p×K，S_pThe number of rows of the sub-matrix a is shown, and the number of linear array elements is also shown, and K is the number of columns of the sub-matrix a.

The specific calculation process is as follows: initializing the submatrix C to zero, pre-adding a row of elements (S) of the submatrix A_pOne), then flow element b into the linear array computation unit, completing a single c [ CNT ]]Moving the calculation result sub-matrix C to an off-chip memory, and then calculating the next sub-block,this is repeated.

Therefore, the method can complete matrix operation in multiple modes, has wide application range and flexible use, and compared with the traditional method based on multiple accelerators, the method has the advantages that each operation is based on a uniform hardware structure, the multiplexing degree of a logic structure is high, the resource consumption is low, and the area is small.

The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may be made by those skilled in the art without departing from the principle of the invention.

Claims

1. A large-scale multi-operation floating-point matrix calculation acceleration implementation method is characterized by comprising the following steps:

step S2: initializing the on-chip RAM to zero, and executing the step S4;

2. The method as claimed in claim 1, wherein the matrix multiplication is the product of a matrix A and a matrix B, that is, C is the implementation_M×N＝A_M×K×B_K×N+C_M×NWherein, the data stream A and the data stream B are input data streams, and the data stream C is directly on-chipInitialized to zero.

3. The method for realizing the calculation acceleration of the large-scale multi-operation floating-point matrix according to claim 1, wherein the matrix addition and the matrix subtraction refer to the addition and the subtraction of a matrix A and a matrix C, namely, the C is realized_M×K＝A_M×K×I_K×K±C_M×KWherein, the data stream a and the data stream C are input data streams, and the data stream B is an introduced identity matrix I, and is directly generated by a counter on a chip.

4. The method of claim 1, wherein the matrix-vector multiplication is the product of a matrix A and a vector B, that is, C is the implementation_M×1＝A_M×K×B_K×1+C_M×1Wherein, the data stream a and the data stream B are input data streams, and the data stream C is directly initialized to zero on the chip.

5. The method of claim 1, wherein the matrix-scalar multiplication is the product of a matrix A and a scalar b, that is, implementation C_M×K＝A_M×KAnd B, wherein the data stream A and the data stream B are input data streams, and the data stream C is directly initialized to zero on a chip.

6. The implementation method of the computation acceleration of the large-scale multi-operation floating-point matrix according to any one of claims 1 to 5, wherein the data source comprises a data source A, a data source B and a data source C, wherein the data source A refers to a matrix A, the data source B refers to a matrix B, a vector B or a scalar B, and the data source C refers to a participating accumulation matrix or vector stored on a slice.

7. The large-scale multi-operation floating-point matrix computation acceleration implementation method of any one of claims 1 to 5, wherein a blocking matrix method is adopted, and the method comprises the following steps:

8. A large-scale multi-operation floating-point matrix calculation accelerating device is characterized by comprising:

9. The apparatus of claim 8, wherein the basic computation unit PE comprises a floating-point computation unit, two registers, a FIFO and a dual-port RAM.

10. The apparatus of claim 8, wherein the floating-point calculation unit is an IP core conforming to IEEE 754 standard, and integrates multiply-add operation or multiply-subtract operation according to an external operation mode signal during calculation.