CN114218524A - Large-scale multi-operation floating-point matrix calculation acceleration implementation method and device - Google Patents

Large-scale multi-operation floating-point matrix calculation acceleration implementation method and device Download PDF

Info

Publication number
CN114218524A
CN114218524A CN202111283133.0A CN202111283133A CN114218524A CN 114218524 A CN114218524 A CN 114218524A CN 202111283133 A CN202111283133 A CN 202111283133A CN 114218524 A CN114218524 A CN 114218524A
Authority
CN
China
Prior art keywords
matrix
data stream
calculation
data
multiplication
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111283133.0A
Other languages
Chinese (zh)
Inventor
彭元喜
张龙龙
郭阳
扈啸
黄啊慧
粟毅
张世亮
田甜
李岩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Beijing Power Machinery Institute
Original Assignee
National University of Defense Technology
Beijing Power Machinery Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology, Beijing Power Machinery Institute filed Critical National University of Defense Technology
Priority to CN202111283133.0A priority Critical patent/CN114218524A/en
Publication of CN114218524A publication Critical patent/CN114218524A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/483Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Algebra (AREA)
  • Nonlinear Science (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a large-scale multi-operation floating-point matrix calculation acceleration realization method, which comprises the following steps: step S1: receiving an external input signal according to the operation type of the matrix to be processed, and judging a matrix operation mode: when the operation mode is matrix addition and matrix subtraction, the step proceeds to execute step S3, and when the operation mode is matrix multiplication, matrix-vector multiplication, and matrix-scalar multiplication, the step proceeds to execute step S2; step S2: initializing the on-chip RAM to zero, and executing the step S4; step S3: loading the data source C into the on-chip RAM through the RAM channel, and executing the step S4; step S4: preloading a part of data flow A through an RAM channel, and simultaneously calculating and loading the data flow A and the data flow B; step S5: and after the calculation is finished, transmitting the calculation result to an off-chip memory. The device is used for implementing the method. The invention has the advantages of low storage requirement, high calculation efficiency, high reusability, wide application range and the like.

Description

Large-scale multi-operation floating-point matrix calculation acceleration implementation method and device
Technical Field
The invention mainly relates to the technical field of high-performance computers, in particular to a method and a device for realizing the calculation acceleration of a large-scale multi-operation floating-point matrix.
Background
Matrix computation is a fundamental and widely used operational model in many sciences and engineering, such as storage, processing and recognition of digital images, neural network computation, kalman filters in control systems, etc. Matrix calculations directly affect the performance of high performance computers.
At present, platforms such as a CPU (Central processing Unit) and a GPGPU (general purpose processing Unit) utilize software libraries such as MKL (Mega-Log-Kelvin) and CuBLAS (CuBLAS) to accelerate matrix calculation, but the methods are limited by energy consumption and complex levels, and have poor application effect in a mobile system or an embedded system.
In the Field Programmable Gate Array (FPGA) development, there are some related researches on dense matrix-matrix multiplication and sparse matrix-vector multiplication in the related art. However, most of these architectures can only handle a single certain matrix operation mode. In fact, in many engineering applications, a certain matrix operation is often insufficient, but rather needs to support multiple matrix operation modes.
Aiming at the problem of fusion of multiple operations of a floating-point matrix, a practitioner proposes a Chinese patent application 'general floating-point matrix processor hardware structure based on FPGA' (publication number CN104391820A), which discloses a general matrix computing system integrating a host with a plurality of different types of matrix operation accelerators, and the simple combination mode of the plurality of independent matrix operation accelerators is relatively simple. However, such techniques have disadvantages in that: the sum of all the modules consumes more logic resources, the area is increased, and the reusability of the structure is relatively insufficient.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: aiming at the technical problems in the prior art, the invention provides a large-scale multi-operation floating point matrix calculation acceleration implementation method and device with low storage requirement, high calculation efficiency, high reusability and wide application range.
In order to solve the technical problems, the invention adopts the following technical scheme:
a large-scale multi-operation floating-point matrix calculation acceleration implementation method comprises the following steps:
step S1: receiving an external input signal according to the operation type of the matrix to be processed, and judging a matrix operation mode: when the operation mode is matrix addition and matrix subtraction, the step proceeds to execute step S3, and when the operation mode is matrix multiplication, matrix-vector multiplication, and matrix-scalar multiplication, the step proceeds to execute step S2;
step S2: initializing the on-chip RAM to zero, and executing the step S4;
step S3: loading the data source C into the on-chip RAM through the RAM channel, and executing the step S4;
step S4: preloading a part of data flow A through an RAM channel, and simultaneously calculating and loading the data flow A and the data flow B;
step S5: and after the calculation is finished, transmitting the calculation result to an off-chip memory.
As a further improvement of the process of the invention: the matrix multiplication is the product of matrix A and matrix B, i.e. C is realizedM×N=AM×K×BK×N+CM×NWherein, the data stream a and the data stream B are input data streams, and the data stream C is directly initialized to zero on the chip.
As a further improvement of the process of the invention: the matrix is added,Matrix subtraction means addition and subtraction of matrix A and matrix C, namely C is realizedM×K=AM×K×IK×K±CM×KWherein, the data stream a and the data stream C are input data streams, and the data stream B is an introduced identity matrix I, and is directly generated by a counter on a chip.
As a further improvement of the process of the invention: the matrix-vector multiplication refers to the product of a matrix A and a vector B, i.e. to achieve CM×1=AK×K×BK×1+CM×1Wherein, the data stream a and the data stream B are input data streams, and the data stream C is directly initialized to zero on the chip.
As a further improvement of the process of the invention: the matrix-scalar multiplication refers to the product of matrix A and scalar b, i.e. to realize CM×K=AM×KAnd B, wherein the data stream A and the data stream B are input data streams, and the data stream C is directly initialized to zero on a chip.
As a further improvement of the process of the invention: the data source comprises a data source A, a data source B and a data source C, wherein the data source A refers to a matrix A, the data source B refers to a matrix B, a vector B or a scalar B, and the data source C refers to a participation accumulation matrix or a vector stored on a chip.
As a further improvement of the process of the invention: the method for adopting the block matrix comprises the following steps:
before step S1, calculating the block size according to the scale of the matrix to be processed, and then operating the sub-blocks one by one;
after step S5, the calculation of the next sub-matrix block is continued, i.e., steps S2-S5 are repeated until all sub-block calculations are completed.
The present invention further provides a large-scale multi-operation floating-point matrix calculation accelerating device, which comprises:
the preprocessing module is used for partitioning the matrix and the vector, receiving an external input signal according to the operation type of the matrix to be processed, and judging a matrix operation mode: when the operation mode is matrix addition and matrix subtraction, loading a data source C into the on-chip RAM through the RAM channel; when the operation mode is matrix multiplication, matrix-vector multiplication and matrix-scalar multiplication, initializing an on-chip RAM to be zero, preloading a part of data stream A through an RAM channel, and then calculating and loading the data stream A and the data stream B;
the data transmission control module is used for loading each data source and returning a data calculation result;
the matrix calculation accelerating unit module is a linear array structure formed by a plurality of completely same basic calculation units PE and is used for realizing matrix operation.
As a further improvement of the device of the invention: the basic computation unit PE comprises a floating point computation unit, two registers, a FIFO and a dual port RAM.
As a further improvement of the device of the invention: the floating point calculation unit is an IP core conforming to IEEE 754 standard, integrates multiply-add operation or multiply-subtract operation, and is specifically calculated according to an external operation mode signal.
Compared with the prior art, the invention has the advantages that:
1. the method and the device for realizing the calculation acceleration of the large-scale multi-operation floating-point matrix can complete various mode matrix operations including matrix multiplication, matrix addition, matrix subtraction, matrix-vector multiplication and matrix-scalar multiplication, have wide application range and flexible use, and compared with the traditional method based on a plurality of accelerators, the method and the device are based on a unified hardware structure, have high logic structure reusability and have the advantages of less resource consumption and small area.
2. The method and the device for realizing the acceleration of the large-scale multi-operation floating-point matrix calculation effectively develop the parallelism and the data multiplexing of the calculation on certain resources by combining the pipeline strategy and the block strategy, can process the matrix calculation problem of any scale, and have the advantages of low storage requirement and high calculation efficiency.
Drawings
FIG. 1 is a schematic flow diagram of the process of the present invention.
Fig. 2 is a schematic diagram of the topology of the apparatus of the present invention.
FIG. 3 is a schematic diagram of the matrix multiplication function process in the specific application example of the present invention.
FIG. 4 is a schematic diagram of the matrix addition and matrix subtraction function process in the specific application example of the present invention.
FIG. 5 is a diagram illustrating the matrix-vector multiplication function of the present invention in a specific application example.
FIG. 6 is a diagram illustrating the matrix-scalar multiplication function process in a specific application example of the present invention.
Detailed Description
The invention will be described in further detail below with reference to the drawings and specific examples.
As shown in fig. 1, the method for accelerating the large-scale multi-operation floating-point matrix calculation of the present invention includes the following steps:
step S1: receiving an external input signal according to the operation type of the matrix to be processed, and judging a matrix operation mode: when the operation mode is matrix addition and matrix subtraction, the step proceeds to execute step S3, and when the operation mode is matrix multiplication, matrix-vector multiplication, and matrix-scalar multiplication, the step proceeds to execute step S2;
step S2: initializing the on-chip RAM to zero, and executing the step S4;
step S3: loading the data source C into the on-chip RAM through the RAM channel, and executing the step S4;
step S4: preloading a part of data flow A through an RAM channel, and simultaneously calculating and loading the data flow A and the data flow B;
step S5: after the calculation is finished, transmitting the calculation result to an off-chip memory;
in a specific application example, the invention can maximally multiplex the logical structures of operations such as matrix multiplication, matrix addition, matrix subtraction, matrix-vector multiplication, matrix-scalar multiplication and the like, and use similar data sources and data flow modes.
In a specific application example, the data source includes a data source a, a data source B, and a data source C, where the data source a refers to a matrix a, the data source B refers to a matrix B, a vector B, or a scalar B, and the data source C refers to a participating accumulation matrix or vector stored on the slice.
In a specific application example, the data stream corresponds to the data source, and includes data stream a, data stream B, and data stream C. In the present invention, data stream a and data stream C are relatively fixed, and data stream B is a matrix, vector, or scalar.
In a specific application example, the matrix multiplication function in the invention refers to the product of a matrix A and a matrix B, namely, the realization CM×N=AM×K×BK×N+CM×NWherein, the data stream a and the data stream B are input data streams, and the data stream C is directly initialized to zero on the chip.
In a specific application example, the matrix addition and matrix subtraction function in the invention refers to the addition and subtraction of the matrix A and the matrix C, namely, the C is realizedM×K=AM×K×IK×K±CM×KWherein, the data stream a and the data stream C are input data streams, and the data stream B is an introduced identity matrix I, which can be directly generated on-chip by a counter.
In a specific application example, the matrix-vector multiplication function in the invention refers to the product of a matrix A and a vector B, namely, the realization CM×1=AM×K×BK×1+CM×1Wherein, the data stream a and the data stream B are input data streams, and the data stream C is directly initialized to zero on the chip.
In a specific application example, the matrix-scalar multiplication function in the invention refers to the product of a matrix A and a scalar b, namely, the implementation CM×K=AM×KAnd B, wherein the data stream A and the data stream B are input data streams, and the data stream C is directly initialized to zero on a chip.
As a better embodiment, the invention combines the thought of a block matrix to decompose a large-scale matrix calculation problem into a small matrix calculation problem, and the specific steps are as follows: before step S1, calculating the block size according to the scale of the matrix to be processed, and then operating the sub-blocks one by one; after step S5, the calculation of the next sub-matrix block is continued, i.e., steps S2-S5 are repeated until all sub-block calculations are completed.
As shown in fig. 2, the present invention further provides a large-scale multi-operation floating-point matrix calculation acceleration device, comprising:
the preprocessing module is used for partitioning the matrix and the vector, receiving an external input signal according to the operation type of the matrix to be processed, and judging a matrix operation mode: when the operation mode is matrix addition and matrix subtraction, the step S3 in the method is executed, and when the operation mode is matrix multiplication, matrix-vector multiplication, and matrix-scalar multiplication, the step S2 in the method is executed;
the data transmission control module is used for loading each data source and returning a data calculation result;
the matrix calculation accelerating unit module is a linear array structure formed by a plurality of completely same basic calculation units PE and is used for realizing various types of operation operations of the matrix.
In a specific application example, the basic computation unit PE comprises a floating-point computation unit, two registers, a FIFO and a simple dual-port RAM.
In a specific application example, the floating point calculation unit is an IP core conforming to the IEEE 754 standard, and integrates multiply-add operation or multiply-subtract operation, and specific calculation is carried out according to an external operation mode signal.
As shown in FIG. 3, this embodiment describes a matrix multiplication operation, the specific function is the product of matrix A and matrix B, i.e. to realize CM×N=AM×K×BK×N+CM×NAnd the data stream A and the data stream B are input data streams, and the value of the data stream C is directly initialized to zero on a chip.
The form of the embodiment after matrix multiplication and partitioning is
Figure BDA0003331854160000061
Wherein the size of the sub-matrix A is SpxK, the size of the sub-matrix B is KxSqThe size of the submatrix C is Sp×Sa,SpRepresenting the number of rows of the sub-matrix A and, at the same time, the number of linear array elements, SqThe column number of the sub-matrix B is shown, and the depth of the on-chip RAM is also shown, K is the column number of the sub-matrix a and the row number of the sub-matrix B, and also the number of iterations.
The specific calculation process is as follows: initializing the submatrix C to zero, pre-adding a row of elements (S) of the submatrix ApOne), then sequentially flow the elements of the submatrix B by row toIn a linear array computing unit, c [ CNT ] is completed in a pipeline mode and a rewriting mechanism]The intermediate result is stored in the local memory, and the calculation result submatrix C is transferred to the off-chip memory after K iterations, and the calculation of the next subblock is performed, and this is repeated.
As shown in FIG. 4, the embodiment describes the matrix addition and matrix subtraction operations, and the specific function is the addition and subtraction of the matrix A and the matrix C, i.e. the C is realizedM×K=AM×K×IK×K±CM×KWherein, the data stream a and the data stream C are input data streams, and the data stream B is an introduced identity matrix I, which can be directly generated on-chip by a counter.
When the embodiment performs the matrix addition or matrix subtraction operation, the form after the block division is
Figure BDA0003331854160000062
Figure BDA0003331854160000063
Wherein the size of the sub-matrix A is SpxK, the size of the sub-matrix B is K x K, and the size of the sub-matrix C is Sp×K,SpThe number of rows of the sub-matrix a is shown, the number of linear array units is also shown, K is the number of columns of the sub-matrix a, the number of columns and the number of rows of the sub-matrix B, and the number of iterations is also shown.
The specific calculation process is as follows: pre-adding all elements of the sub-matrix C and a column of elements (S) of the sub-matrix ApOne), then flows into the elements of the sub-matrix B in sequence by rows into the linear array computing unit, and c [ CNT ] is completed in a pipeline mode and a rewriting mechanism]The intermediate result is stored in the local memory, and the calculation result submatrix C is transferred to the off-chip memory after K iterations, and the calculation of the next subblock is performed, and this is repeated.
As shown in FIG. 5, this embodiment describes a matrix-vector multiplication operation, the specific function is the product of matrix A and vector B, i.e. to realize CM×1=AM×K×BK×1+CM×1Wherein, the data stream A and the data stream B are input data streams and data streamsThe value of C is initialized to zero directly on-chip.
When the matrix-vector multiplication is carried out in the embodiment, the data stream B becomes a vector, and the form after the block division is
Figure BDA0003331854160000071
Wherein the size of the sub-matrix A is SpX K, the size of the subvector B is K x 1, and the size of the subvector C is Sp×1,SpThe number of rows of the sub-matrix a is shown, and the number of linear array elements is also shown, and K is the number of columns of the sub-matrix a.
The specific calculation process is as follows: initializing the sub-vector block C as a zero vector, pre-adding a row of elements (S) of the sub-matrix ApOne), then flow the elements of sub-vector B into the linear array computation unit, complete c [ CNT ] in a pipelined manner and rewrite mechanism]The intermediate result is stored in the local memory, and the calculation result sub-vector block C is transferred to the off-chip memory after K iterations, and the calculation of the next sub-block is performed, and this is repeated.
As shown in FIG. 6, the present embodiment states that the matrix-scalar multiplication function refers to the product of matrix A and scalar b, i.e., implementation CM×K=AM×KAnd B, wherein the data stream A and the data stream B are input data streams, and the data stream C is directly initialized to zero on a chip.
When the matrix-scalar multiplication is performed in the embodiment, the data stream B becomes a scalar and the form after the partitioning is
Figure BDA0003331854160000072
Wherein the size of the sub-matrix A is SpxK, the size of the sub-matrix C being Sp×K,SpThe number of rows of the sub-matrix a is shown, and the number of linear array elements is also shown, and K is the number of columns of the sub-matrix a.
The specific calculation process is as follows: initializing the submatrix C to zero, pre-adding a row of elements (S) of the submatrix ApOne), then flow element b into the linear array computation unit, completing a single c [ CNT ]]Moving the calculation result sub-matrix C to an off-chip memory, and then calculating the next sub-block,this is repeated.
Therefore, the method can complete matrix operation in multiple modes, has wide application range and flexible use, and compared with the traditional method based on multiple accelerators, the method has the advantages that each operation is based on a uniform hardware structure, the multiplexing degree of a logic structure is high, the resource consumption is low, and the area is small.
The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may be made by those skilled in the art without departing from the principle of the invention.

Claims (10)

1. A large-scale multi-operation floating-point matrix calculation acceleration implementation method is characterized by comprising the following steps:
step S1: receiving an external input signal according to the operation type of the matrix to be processed, and judging a matrix operation mode: when the operation mode is matrix addition and matrix subtraction, the step proceeds to execute step S3, and when the operation mode is matrix multiplication, matrix-vector multiplication, and matrix-scalar multiplication, the step proceeds to execute step S2;
step S2: initializing the on-chip RAM to zero, and executing the step S4;
step S3: loading the data source C into the on-chip RAM through the RAM channel, and executing the step S4;
step S4: preloading a part of data flow A through an RAM channel, and simultaneously calculating and loading the data flow A and the data flow B;
step S5: and after the calculation is finished, transmitting the calculation result to an off-chip memory.
2. The method as claimed in claim 1, wherein the matrix multiplication is the product of a matrix A and a matrix B, that is, C is the implementationM×N=AM×K×BK×N+CM×NWherein, the data stream A and the data stream B are input data streams, and the data stream C is directly on-chipInitialized to zero.
3. The method for realizing the calculation acceleration of the large-scale multi-operation floating-point matrix according to claim 1, wherein the matrix addition and the matrix subtraction refer to the addition and the subtraction of a matrix A and a matrix C, namely, the C is realizedM×K=AM×K×IK×K±CM×KWherein, the data stream a and the data stream C are input data streams, and the data stream B is an introduced identity matrix I, and is directly generated by a counter on a chip.
4. The method of claim 1, wherein the matrix-vector multiplication is the product of a matrix A and a vector B, that is, C is the implementationM×1=AM×K×BK×1+CM×1Wherein, the data stream a and the data stream B are input data streams, and the data stream C is directly initialized to zero on the chip.
5. The method of claim 1, wherein the matrix-scalar multiplication is the product of a matrix A and a scalar b, that is, implementation CM×K=AM×KAnd B, wherein the data stream A and the data stream B are input data streams, and the data stream C is directly initialized to zero on a chip.
6. The implementation method of the computation acceleration of the large-scale multi-operation floating-point matrix according to any one of claims 1 to 5, wherein the data source comprises a data source A, a data source B and a data source C, wherein the data source A refers to a matrix A, the data source B refers to a matrix B, a vector B or a scalar B, and the data source C refers to a participating accumulation matrix or vector stored on a slice.
7. The large-scale multi-operation floating-point matrix computation acceleration implementation method of any one of claims 1 to 5, wherein a blocking matrix method is adopted, and the method comprises the following steps:
before step S1, calculating the block size according to the scale of the matrix to be processed, and then operating the sub-blocks one by one;
after step S5, the calculation of the next sub-matrix block is continued, i.e., steps S2-S5 are repeated until all sub-block calculations are completed.
8. A large-scale multi-operation floating-point matrix calculation accelerating device is characterized by comprising:
the preprocessing module is used for partitioning the matrix and the vector, receiving an external input signal according to the operation type of the matrix to be processed, and judging a matrix operation mode: when the operation mode is matrix addition and matrix subtraction, loading a data source C into the on-chip RAM through the RAM channel; when the operation mode is matrix multiplication, matrix-vector multiplication and matrix-scalar multiplication, initializing an on-chip RAM to be zero, preloading a part of data stream A through an RAM channel, and then calculating and loading the data stream A and the data stream B;
the data transmission control module is used for loading each data source and returning a data calculation result;
the matrix calculation accelerating unit module is a linear array structure formed by a plurality of completely same basic calculation units PE and is used for realizing matrix operation.
9. The apparatus of claim 8, wherein the basic computation unit PE comprises a floating-point computation unit, two registers, a FIFO and a dual-port RAM.
10. The apparatus of claim 8, wherein the floating-point calculation unit is an IP core conforming to IEEE 754 standard, and integrates multiply-add operation or multiply-subtract operation according to an external operation mode signal during calculation.
CN202111283133.0A 2021-11-01 2021-11-01 Large-scale multi-operation floating-point matrix calculation acceleration implementation method and device Pending CN114218524A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111283133.0A CN114218524A (en) 2021-11-01 2021-11-01 Large-scale multi-operation floating-point matrix calculation acceleration implementation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111283133.0A CN114218524A (en) 2021-11-01 2021-11-01 Large-scale multi-operation floating-point matrix calculation acceleration implementation method and device

Publications (1)

Publication Number Publication Date
CN114218524A true CN114218524A (en) 2022-03-22

Family

ID=80696378

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111283133.0A Pending CN114218524A (en) 2021-11-01 2021-11-01 Large-scale multi-operation floating-point matrix calculation acceleration implementation method and device

Country Status (1)

Country Link
CN (1) CN114218524A (en)

Similar Documents

Publication Publication Date Title
CN106970896B (en) Vector processor-oriented vectorization implementation method for two-dimensional matrix convolution
CN110458279B (en) FPGA-based binary neural network acceleration method and system
CN110352434B (en) Method and system for evaluating neural network model corresponding to service in system
JP7268996B2 (en) Systems and methods for computation
CN107832082B (en) Device and method for executing artificial neural network forward operation
US10379816B2 (en) Data accumulation apparatus and method, and digital signal processing device
CN110263925B (en) Hardware acceleration implementation device for convolutional neural network forward prediction based on FPGA
US11151445B2 (en) Neural network processor with a window expander circuit
EP3785112B1 (en) Matrix vector multiplier with a vector register file comprising a multi-port memory
CN111797982A (en) Image processing system based on convolution neural network
CN115310037A (en) Matrix multiplication computing unit, acceleration unit, computing system and related method
Cho et al. FARNN: FPGA-GPU hybrid acceleration platform for recurrent neural networks
CN111506520B (en) Address generation method, related device and storage medium
CN114218524A (en) Large-scale multi-operation floating-point matrix calculation acceleration implementation method and device
CN111401533A (en) Special calculation array for neural network and calculation method thereof
CN112836793B (en) Floating point separable convolution calculation accelerating device, system and image processing method
Fischer et al. BinArray: A scalable hardware accelerator for binary approximated CNNs
Shin et al. Low complexity gradient computation techniques to accelerate deep neural network training
CN112992248A (en) PE (provider edge) calculation unit structure of FIFO (first in first out) -based variable-length cyclic shift register
Zhang et al. Accelerated Inference Framework of Sparse Neural Network Based on Nested Bitmask Structure.
CN114186187A (en) Configurable floating-point matrix multiplication implementation method and device based on linear array
Alawad et al. Robust and Large-Scale Convolution through Stochastic-Based Processing without Multipliers
Anderson et al. Toward Energy–Quality Scaling in Deep Neural Networks
Mahajan et al. Review of Artificial Intelligence Applications and Architectures
CN111652365B (en) Hardware architecture for accelerating Deep Q-Network algorithm and design space exploration method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination