CN109284475B - Matrix convolution calculating device and matrix convolution calculating method - Google Patents

Matrix convolution calculating device and matrix convolution calculating method Download PDF

Info

Publication number
CN109284475B
CN109284475B CN201811101509.XA CN201811101509A CN109284475B CN 109284475 B CN109284475 B CN 109284475B CN 201811101509 A CN201811101509 A CN 201811101509A CN 109284475 B CN109284475 B CN 109284475B
Authority
CN
China
Prior art keywords
registers
matrix
multipliers
group
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811101509.XA
Other languages
Chinese (zh)
Other versions
CN109284475A (en
Inventor
满宏涛
王振江
李拓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengzhou Yunhai Information Technology Co Ltd
Original Assignee
Zhengzhou Yunhai Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhengzhou Yunhai Information Technology Co Ltd filed Critical Zhengzhou Yunhai Information Technology Co Ltd
Priority to CN201811101509.XA priority Critical patent/CN109284475B/en
Publication of CN109284475A publication Critical patent/CN109284475A/en
Application granted granted Critical
Publication of CN109284475B publication Critical patent/CN109284475B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • G06F17/153Multidimensional correlation or convolution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Abstract

The matrix convolution calculation module is provided with (m-1) memories, a first-in first-out storage structure is formed among the (m-1) memories, and when input data stored in external storage equipment are read, m rows of input data do not need to be read from the external storage equipment at the same time, and the input data are read sequentially row by row. Therefore, by using the matrix convolution calculation module provided by the embodiment of the application, the requirements on the interface bandwidth of the external storage device and the number of the interfaces of the FPGA chip can be reduced, and the applicability is strong.

Description

Matrix convolution calculating device and matrix convolution calculating method
Technical Field
The present application relates to the field of electronic technologies, and in particular, to a device and a method for calculating a matrix convolution based on an FPGA.
Background
With the development of scientific technology, the application of Convolutional Neural Network (CNN) is becoming more and more widespread. CNN is a multilayer neural network, and the convolution layer is an important component of CNN, and the core operation is to complete the convolution operation of input data and a convolver. Wherein the input data may be embodied as an input matrix, the convolver may be embodied as a convolver matrix, and the convolved output is embodied as an output matrix.
If the input matrix is
Figure DEST_PATH_IMAGE001
The convolver matrix is
Figure DEST_PATH_IMAGE002
In general
Figure DEST_PATH_IMAGE003
. The convolution output is then:
Figure DEST_PATH_IMAGE004
wherein, in the step (A),
Figure DEST_PATH_IMAGE005
. The operation of one output point in the output matrix is completed through the formulaThe calculation requires multiply-accumulate the convolver and each point at the corresponding position of the input data. The operation of different output points can be completed by changing the position relation of the convolver and the input data. In order to complete the whole convolution operation, the convolver needs to move from left to right, and the operation of one output point is completed every time the convolver moves one grid; and after the operation of one row of output is finished, moving the convolver down by one row, moving the convolver from left to right, and finishing the operation until the operation of the last output point of the last row is finished.
The FPGA is used for realizing the matrix convolution operation, certain advantages are achieved, a large number of multiplication and accumulation operations are involved in the matrix convolution operation, resources of the FPGA can be fully utilized, and the parallel characteristic of the FPGA can also bring great improvement to the operation speed.
Referring to fig. 1, the diagram is a schematic diagram of implementing a matrix convolution operation by using an FPGA in the prior art. In fig. 1, the first element in the output matrix is obtained by performing the calculation with the input matrix being 5 × 5 matrix and the convolver matrix being 3 × 3 matrix as examples
Figure DEST_PATH_IMAGE006
The calculation method of (1). As can be seen from fig. 1, 9 registers 110, 9 multipliers 120, and an addition tree are involved in computing the output matrix from the input matrix and the convolver. Computing
Figure 125958DEST_PATH_IMAGE006
The process is as follows: simultaneously reading the 1 st line, the 2 nd line and the 3 rd line of input data, inputting 3 input ports in each line of data in sequence, namely the first line of data enters the input port 101, the second line of data enters the input port 102 and the third line of data enters the input port 103, and starting multiplication operation after 3 clock cycles by starting the data in each register as shown in figure 1; the outputs of the multipliers at this time are (listed in the order of the multipliers from top to bottom as shown):
Figure DEST_PATH_IMAGE007
Figure DEST_PATH_IMAGE008
Figure DEST_PATH_IMAGE009
Figure DEST_PATH_IMAGE010
Figure DEST_PATH_IMAGE011
Figure DEST_PATH_IMAGE012
Figure DEST_PATH_IMAGE013
Figure DEST_PATH_IMAGE014
Figure DEST_PATH_IMAGE015
. The output of each multiplier enters an addition tree to be added, thereby obtaining
Figure DEST_PATH_IMAGE016
The value of (c).
It will be appreciated that as data continues to be input, the values in the registers for the next clock cycle are as shown in fig. 2, with the outputs of the multipliers being respectively (listed in the order of the multipliers shown from top to bottom):
Figure DEST_PATH_IMAGE017
Figure DEST_PATH_IMAGE018
Figure DEST_PATH_IMAGE019
Figure DEST_PATH_IMAGE020
Figure DEST_PATH_IMAGE021
Figure DEST_PATH_IMAGE022
Figure DEST_PATH_IMAGE023
Figure DEST_PATH_IMAGE024
Figure DEST_PATH_IMAGE025
. The output of each multiplier enters an addition tree to be added, thereby obtaining
Figure DEST_PATH_IMAGE026
The value of (c). And when the 1 st row, the 2 nd row and the 3 rd row of data are all input, obtaining the values of all elements in the first row of the output matrix.
Similarly to the calculation of the values of the respective elements in the 1 st row of the output matrix, when the values of the respective elements in the 2 nd row of the output matrix are calculated, the data in the 2 nd, 3 rd and 4 th rows of the input data are read simultaneously, and each row of data is sequentially input to the three input ports of the input port 101, the input port 102 and the input port 103. The method for calculating the values of the elements in row 2 of the output matrix is similar to the method for calculating the values of the elements in row 1 of the output matrix, and is not described herein again. It will be appreciated that the calculation of the entire output matrix is complete until the last three rows of input data have been read.
It can be understood that, in practical applications, Input Output (IO) interface resources of the FPGA are limited, and the above scheme of implementing the matrix convolution by using the FPGA requires that 3 rows of input data are read from the external storage device at the same time, which requires higher requirements on interface bandwidth of the external storage device and the number of interfaces of the FPGA chip. In fact, the number of rows of data read simultaneously from the external storage device is related to the dimension of the convolver matrix, and if the convolver matrix is an m × n matrix, then m rows of input data need to be read simultaneously from the external storage device. It can be understood that the larger m is, the higher the requirements on the interface bandwidth of the external storage device and the number of interfaces of the FPGA chip are, and therefore, the applicability of the above scheme in practical application is not strong.
In view of the above, a solution is needed to solve the above problems.
Disclosure of Invention
The technical problem to be solved by the present application is that the applicability of the matrix convolution operation implemented based on the FPGA in the prior art is not strong, and a matrix convolution calculation apparatus and a matrix convolution calculation method are provided.
In a first aspect, an embodiment of the present application provides a matrix convolution calculation apparatus implemented based on an FPGA, where an input matrix is an M × N matrix, a convolver matrix is an M × N matrix, M, N, M and N are positive integers, M is less than or equal to M, and N is less than or equal to N; the convolution calculation means includes: m registers, m multipliers, an addition tree and (m-1) memories;
the output ends of the m x n multipliers are connected to the input end of the addition tree;
the input of any multiplier in the m x n multipliers is an element in a convolver matrix and a value stored in the register; the m x n multipliers are embodied as m groups of multipliers, and one group of multipliers comprises n multipliers; the m × n registers are embodied as m groups of registers, and each group of registers comprises n registers; the m × n registers correspond to the m × n multipliers one by one;
in n registers corresponding to an ith group of multipliers in the m groups of multipliers, elements in a convolver matrix corresponding to a first register in the n registers corresponding to the ith group of multipliers are: the (i-1) th row and (n-1) th column in the convolver and the element in the convolver matrix corresponding to the p-th register in the n registers corresponding to the ith group of multipliers are as follows: the convolver comprises (i-1) th row and (n-p) th column elements, wherein p is an integer which is more than or equal to 2 and less than or equal to n, and i is an integer which is more than or equal to 1 and less than or equal to m;
the (m-1) memories are in one-to-one correspondence with (m-1) groups of registers in the m groups of registers, and the output end of the first memory is connected with the input end of a first register in the first group of registers; wherein the first memory is any one of the (m-1) memories, and the first group of registers is a group of registers corresponding to the first memory; the input ends of a group of registers except the (m-1) group of registers in the m groups of registers are connected with the output end of an external storage device for storing input data; wherein, the input end of the group of registers refers to the input end of the first register in the group of registers;
the storage size of the (m-1) memories is the N;
n registers in any group of registers in the m groups of registers are in cascade connection, when a clock arrives, the value of the jth register in the n registers is updated to the value of the (j-1) th register, the value of the first register in the n registers is the value read from the memory, wherein j is an integer which is less than or equal to n and is greater than 1;
a first-in first-out storage structure is formed among the (m-1) storages, the data output end of the kth storage is connected with the data input end of the (k-1) storage, and k is smaller than or equal to the (m-1).
Optionally, the memory includes: a first-in first-out queue FIFO inside the FPGA or a random access memory RAM inside the FPGA.
In a second aspect, an embodiment of the present application provides a method for implementing matrix convolution by using the apparatus for computing matrix convolution according to any one of the above first aspects, where the input matrix is an M × N matrix, the convolver matrix is an M × N matrix, M, N, M and N are positive integers, M is less than or equal to M, and N is less than or equal to N; the method comprises the following steps:
reading input data from an external storage device, and starting a multiplier after waiting n clock cycles if a data full signal of a first memory is detected.
Compared with the prior art, the embodiment of the application has the following advantages:
the embodiment of the application provides a matrix convolution calculating device based on FPGA realizes, includes: m registers, m multipliers, an addition tree and (m-1) memories; the output ends of the m x n multipliers are connected to the input end of the addition tree; the input of any multiplier in the m x n multipliers is an element in a convolver matrix and a value stored in the register; the m x n multipliers are embodied as m groups of multipliers, and one group of multipliers comprises n multipliers; the m × n registers are embodied as m groups of registers, and each group of registers comprises n registers; the m × n registers correspond to the m × n multipliers one by one; in n registers corresponding to an ith group of multipliers in the m groups of multipliers, elements in a convolver corresponding to a first register in the n registers corresponding to the ith group of multipliers are: the (i-1) th row and the (n-1) th column of the convolver, and the element in the convolver corresponding to the p-th register in the n registers corresponding to the i-th group of multipliers is as follows: the data of the (i-1) th row and the (n-p) th column in the convolver, wherein p is an integer which is more than or equal to 2 and less than or equal to n, and i is an integer which is more than or equal to 1 and less than or equal to m; the (m-1) memories are in one-to-one correspondence with (m-1) groups of registers in the m groups of registers, and the output end of the first memory is connected with the input end of a first register in the first group of registers; wherein the first memory is any one of the (m-1) memories, and the first group of registers is a group of registers corresponding to the first memory; the input ends of a group of registers except the (m-1) group of registers in the m groups of registers are connected with the output end of an external storage device for storing input data; wherein, the input end of the group of registers refers to the input end of the first register in the group of registers; the storage size of the (m-1) memories is the N; n registers in any group of registers in the m groups of registers are in cascade connection, when a clock arrives, the value of the jth register in the n registers is updated to the value of the (j-1) th register, the value of the first register in the n registers is the value read from the memory, wherein j is an integer which is less than or equal to n and is greater than 1; a first-in first-out storage structure is formed among the (m-1) storages, the data output end of the kth storage is connected with the data input end of the (k-1) storage, and k is smaller than or equal to the (m-1).
That is to say, in the embodiment of the present application, (m-1) memories are provided, and a first-in first-out storage structure is formed among the (m-1) memories, so that when the input data stored in the external storage device is read, it is not necessary to read m lines of input data from the external storage device at the same time, and the input data is read sequentially row by row. Therefore, by using the matrix convolution calculating device provided by the embodiment of the application, the requirements on the interface bandwidth of the external storage equipment and the number of the interfaces of the FPGA chip can be reduced, and the applicability is strong.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a schematic diagram of a prior art implementation of matrix convolution operations using an FPGA;
FIG. 2 is a schematic diagram of a prior art implementation of a matrix convolution operation using an FPGA;
fig. 3 is a schematic structural diagram of a matrix convolution calculating apparatus according to an embodiment of the present application;
FIG. 4 is a schematic diagram illustrating movement of data stored in a memory when data is read from an external storage device according to an embodiment of the present disclosure;
fig. 5 is a schematic diagram illustrating a matrix convolution operation implemented by an FPGA according to an embodiment of the present application;
fig. 6 is a further schematic diagram of implementing a matrix convolution operation with an FPGA according to an embodiment of the present application.
Detailed Description
In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The inventor of the present application finds, through research, that in the prior art, a scheme of using an FPGA to implement matrix convolution requires that a plurality of rows of input data are read from an external storage device at the same time, and thus, the requirements on the interface bandwidth of the external storage device and the number of interfaces of an FPGA chip are high. Specifically, the number of rows of data read simultaneously from the external storage device is related to the dimension of the convolver matrix, and if the convolver matrix is an m × n matrix, m rows of input data need to be read simultaneously from the external storage device. It can be understood that the larger m is, the higher the requirements on the interface bandwidth of the external storage device and the number of interfaces of the FPGA chip are, and therefore, the applicability of the above scheme in practical application is not strong.
In order to solve the above problem, an embodiment of the present application provides a matrix convolution calculating apparatus implemented based on an FPGA, including: m registers, m multipliers, an addition tree and (m-1) memories; the output ends of the m x n multipliers are connected to the input end of the addition tree; the input of any multiplier in the m x n multipliers is an element in a convolver matrix and a value stored in the register; the m x n multipliers are embodied as m groups of multipliers, and one group of multipliers comprises n multipliers; the m × n registers are embodied as m groups of registers, and each group of registers comprises n registers; the m × n registers correspond to the m × n multipliers one by one; in n registers corresponding to an ith group of multipliers in the m groups of multipliers, elements in a convolver corresponding to a first register in the n registers corresponding to the ith group of multipliers are: the (i-1) th row and the (n-1) th column of the convolver, and the element in the convolver corresponding to the p-th register in the n registers corresponding to the i-th group of multipliers is as follows: the data of the (i-1) th row and the (n-p) th column in the convolver, wherein p is an integer which is more than or equal to 2 and less than or equal to n, and i is an integer which is more than or equal to 1 and less than or equal to m; the (m-1) memories are in one-to-one correspondence with (m-1) groups of registers in the m groups of registers, and the output end of the first memory is connected with the input end of a first register in the first group of registers; wherein the first memory is any one of the (m-1) memories, and the first group of registers is a group of registers corresponding to the first memory; the input ends of a group of registers except the (m-1) group of registers in the m groups of registers are connected with the output end of an external storage device for storing input data; wherein, the input end of the group of registers refers to the input end of the first register in the group of registers; the storage size of the (m-1) memories is the N; n registers in any group of registers in the m groups of registers are in cascade connection, when a clock arrives, the value of the jth register in the n registers is updated to the value of the (j-1) th register, the value of the first register in the n registers is the value read from the memory, wherein j is an integer which is less than or equal to n and is greater than 1; a first-in first-out storage structure is formed among the (m-1) storages, the data output end of the kth storage is connected with the data input end of the (k-1) storage, and k is smaller than or equal to the (m-1).
That is to say, in the embodiment of the present application, (m-1) memories are provided, and a first-in first-out storage structure is formed among the (m-1) memories, so that when the input data stored in the external storage device is read, it is not necessary to read m lines of input data from the external storage device at the same time, and the input data is read sequentially row by row. Therefore, by using the matrix convolution calculating device provided by the embodiment of the application, the requirements on the interface bandwidth of the external storage equipment and the number of the interfaces of the FPGA chip can be reduced, and the applicability is strong.
Various non-limiting embodiments of the present application are described in detail below with reference to the accompanying drawings.
Referring to fig. 3, the diagram is a schematic structural diagram of a matrix convolution calculation apparatus according to an embodiment of the present application.
The matrix convolution calculating device 300 according to the embodiment of the present application may be configured to calculate that the input matrix is an M × N matrix, and the convolver matrix is a convolution result of the M × N matrix. Wherein M, N, M and N are both positive integers, M is less than or equal to M, and N is less than or equal to N.
The convolution calculating means 300 includes: m x n registers 301, m x n multipliers 302, a summing tree 303 and (m-1) memories 304, memory 1 and memory 2 respectively. Fig. 3 illustrates an example of the convolver matrix as a 3 × 3 matrix, which includes 9 registers 301, 9 multipliers 302, an addition tree 303, and 2 memories 304.
Where the outputs of the m x n multipliers are connected to the inputs of the addition tree 303, it will be understood that the output of the addition tree 303 is one element value of the input matrix and the output matrix of the convolver matrix.
The input of any multiplier in the m x n multipliers is an element in a convolver matrix and a value stored in the register; the m x n multipliers are embodied as m groups of multipliers, and one group of multipliers comprises n multipliers; the m × n registers are embodied as m groups of registers, and each group of registers comprises n registers; and the m × n registers correspond to the m × n multipliers one by one. As shown in fig. 3, the 3 × 3 registers are embodied as 3 sets of registers, each set of registers including 3 registers; the 3 x 3 registers correspond to the 3 x 3 multipliers one by one.
In n registers corresponding to an ith group of multipliers in the m groups of multipliers, elements in a convolver matrix corresponding to a first register in the n registers corresponding to the ith group of multipliers are: the (i-1) th row and (n-1) th column of the convolver matrix, and the element in the convolver corresponding to the p-th register in the n registers corresponding to the i-th group of multipliers is as follows: and (2) data of an (i-1) th row and an (n-p) th column in the convolver, wherein p is an integer which is more than or equal to 2 and less than or equal to n, and i is an integer which is more than or equal to 1 and less than or equal to m. As can be understood in connection with FIG. 3, the 1 st set of multiplicationsIn the first 3 multipliers from top to bottom in fig. 3, the element in the convolver matrix corresponding to the first multiplier is the 0 th row and the 2 nd column of the convolver matrix02. The elements in the convolver corresponding to the 3 rd register in the 3 registers corresponding to the 1 st group of multipliers are: element K of row 0 and column 0 in the convolver00
The (m-1) memories 304 correspond to the (m-1) groups of registers in the m groups of registers one to one, and the output end of the first memory is connected with the input end of the first register in the first group of registers; wherein the first memory is any one of the (m-1) memories, and the first group of registers is a group of registers corresponding to the first memory. As will be understood in conjunction with fig. 3, memory 1 corresponds to 1 set of registers, and the output of memory 1 is connected to the input of the first register in the set of 3 registers (i.e., the first register from top to bottom in fig. 3). The memory 2 corresponds to a group of registers, and the output terminal of the memory 2 is connected to the input terminal of the first register (i.e. the fourth register from top to bottom in fig. 3) in the group of 3 registers.
The input ends of a group of registers except the (m-1) group of registers in the m groups of registers are connected with the output end of an external storage device for storing input data; wherein the input terminal of the group of registers refers to the input terminal of the first register in the group of registers. As can be understood in conjunction with fig. 3, the three registers from the seventh register from top to bottom to the ninth register in fig. 3 do not correspond to any memory, and in the embodiment of the present application, the input terminal of the first register (i.e., the seventh register from top to bottom in fig. 3) of the three registers, i.e., the input terminal of the group of registers formed by the three registers, is connected to the output terminal of the external storage device for storing input data.
In the embodiment of the present application, the storage size of the (m-1) memories is the N. That is, one row of data in the input matrix may be stored in one memory.
In this embodiment of the present application, n registers in any one of the m groups of registers are in a cascade relationship, when a clock arrives, a value of a jth register in the n registers is updated to a value of a (j-1) th register, and a value of a first register in the n registers is a value read from the memory, where j is an integer that is less than or equal to n and is greater than 1.
It should be noted that the clock mentioned herein may refer to a system clock of the FPGA, or may be a clock obtained by frequency multiplication or frequency division according to the system clock.
It will be appreciated that the value of the j registers corresponds to the value of the (j-1) th register being clocked by the clock signal.
A first-in first-out storage structure is formed among the (m-1) storages, the data output end of the kth storage is connected with the data input end of the (k-1) storage, and k is smaller than or equal to the (m-1). It is understood that the data in the (m-1) memories may move when the data is read from the external storage device. As can be understood with reference to fig. 4, fig. 4 is a schematic diagram illustrating the movement of data stored in (m-1) memories when data is read from an external storage device. A data storage case of storing external data into the memories 401 and 402 is shown in fig. 4. Fig. 4 is merely an illustration of 6 input data for ease of understanding, and does not limit the embodiments of the present application.
It should be noted that, in the embodiment of the present application, the memory includes: a first-in first-out queue FIFO inside the FPGA or a random access memory RAM inside the FPGA.
The embodiment of the present application is not particularly limited to the external storage device, and the external storage device may be, for example, an external memory.
In the embodiment of the application, (m-1) memories are arranged, a first-in first-out storage structure is formed among the (m-1) memories, and when input data stored in external storage equipment are read, m lines of input data do not need to be read from the external storage equipment at the same time, and the input data can be read sequentially line by line. Therefore, by using the matrix convolution calculating device provided by the embodiment of the application, the requirements on the interface bandwidth of the external storage equipment and the number of the interfaces of the FPGA chip can be reduced, and the applicability is strong.
The above describes a matrix convolution calculation apparatus, and the following describes a method of performing matrix convolution calculation using the matrix convolution calculation apparatus.
First, input data is read from an external storage device, and as is apparent from the structure of the above matrix convolution calculation apparatus, the (m-1) memories may be filled with signals as the read input data increases. It is understood that, since one memory can store one row of data in the input matrix, when the data full signal of the first memory of the (m-1) memories is detected, it indicates that the (m-1) row of data in the input matrix has been stored in the first to (m-1) th memories. In the embodiment of the application, if a data full signal of the first memory is detected, the multiplier is started after waiting for n clock cycles, and the calculation of the matrix convolution result of the input matrix and the convolver matrix is started.
The following describes, with reference to the accompanying drawings, a method for implementing matrix convolution according to an embodiment of the present application, by taking an input matrix as a (5 × 5) matrix and a convolver matrix as a (3 × 3) matrix as an example.
Firstly, the multiplier is not started in the initial stage, the 1 st row of data of the input matrix is read from the external storage device, and the first row of data is written into the memory 2 according to the matrix convolution calculation device shown in fig. 3; then, continuing to read the 2 nd row data of the input matrix from the external storage device, the 2 nd row data of the matrix convolution calculation device according to fig. 3 will be written into the memory 2, and the data in the memory 2 will be written into the memory 1. When both memory 2 and memory 1 are full, continue reading line 3 data from the external storage device, wait 3 clock cycles, start the multiplier. At this time, the data of the 2 nd line is output by the memory 2, and the data of the 1 st line is output by the memory 1; after three clock cycles, the values of the registers are shown in FIG. 5, at which time the first element of the output matrix can be calculated
Figure 955680DEST_PATH_IMAGE006
A value of (d); four clock cycles later, the values of the registers are shown in FIG. 6, at which time the second element of the output matrix can be calculated
Figure DEST_PATH_IMAGE027
The value of (c). By analogy, the values of the elements of the first row of the output matrix can all be calculated.
It will be appreciated that during the above-described process of calculating all the values of the elements of the first row of the output matrix, the data in memory 2 is replaced by row 3 and the data in memory 1 is replaced by row 2, at which point the reading of row 4 from the external storage device can continue. The data at the input of the multiplier is updated to the data of the 2 nd, 3 rd and 4 th rows, so that the values of the elements of the second row in the output matrix can be calculated. It will be appreciated that the entire matrix convolution calculation is complete until the last row of data read is complete.
According to the scheme, the method provided by the embodiment of the application can reduce the requirements on the interface bandwidth of the external storage device and the number of the interfaces of the FPGA chip, and is high in applicability. The input data reading sequence is simple, and the input data is stored in the external storage equipment according to the row sequence; in the design of the application, all input data can be used iteratively by a method of cascading a plurality of RAMs or FIFOs, and the requirement of the whole matrix convolution operation can be met by reading once. The problem that when the FPGA is used for calculating the matrix convolution in the prior art, input data are stored in external storage equipment, the requirement of simultaneously reading any 3 rows of adjacent data is met, the requirements on the data storage sequence and format are high, the data management is very complex, each row of data is continuously and repeatedly read for 3 times, and the input data reading efficiency is low can be solved.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice in the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the attached claims
The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (3)

1. A matrix convolution calculating device realized based on FPGA (field programmable gate array) comprises an input matrix M N matrix, a convolver matrix M N matrix, M, N, M and N are positive integers, M is less than or equal to M, and N is less than or equal to N; wherein said convolution calculating means comprises: m registers, m multipliers, an addition tree and (m-1) memories;
the output ends of the m x n multipliers are connected to the input end of the addition tree;
the input of any multiplier in the m x n multipliers is an element in a convolver matrix and a value stored in the register; the m x n multipliers are embodied as m groups of multipliers, and one group of multipliers comprises n multipliers; the m × n registers are embodied as m groups of registers, and each group of registers comprises n registers; the m × n registers correspond to the m × n multipliers one by one;
in n registers corresponding to an ith group of multipliers in the m groups of multipliers, elements in a convolver matrix corresponding to a first register in the n registers corresponding to the ith group of multipliers are: the (i-1) th row and (n-1) th column in the convolver and the element in the convolver matrix corresponding to the p-th register in the n registers corresponding to the ith group of multipliers are as follows: the convolver comprises (i-1) th row and (n-p) th column elements, wherein p is an integer which is more than or equal to 2 and less than or equal to n, and i is an integer which is more than or equal to 1 and less than or equal to m;
the (m-1) memories are in one-to-one correspondence with (m-1) groups of registers in the m groups of registers, and the output end of the first memory is connected with the input end of a first register in the first group of registers; wherein the first memory is any one of the (m-1) memories, and the first group of registers is a group of registers corresponding to the first memory; the input ends of a group of registers except the (m-1) group of registers in the m groups of registers are connected with the output end of an external storage device for storing input data; wherein, the input end of the group of registers refers to the input end of the first register in the group of registers;
the storage size of the (m-1) memories is the N;
n registers in any group of registers in the m groups of registers are in cascade connection, when a clock arrives, the value of the jth register in the n registers is updated to the value of the (j-1) th register, the value of the first register in the n registers is the value read from the memory, wherein j is an integer which is less than or equal to n and is greater than 1;
a first-in first-out storage structure is formed among the (m-1) storages, the data output end of the kth storage is connected with the data input end of the (k-1) storage, and k is smaller than or equal to the (m-1).
2. The matrix convolution calculation apparatus of claim 1, wherein the memory includes:
a first-in first-out queue FIFO inside the FPGA or a random access memory RAM inside the FPGA.
3. A method of performing matrix convolution using the matrix convolution calculation apparatus of claim 1, the input matrix being an M x N matrix, the convolver matrix being an M x N matrix, M, N, M and N being positive integers, M being less than or equal to M, N being less than or equal to N; characterized in that the method comprises:
reading input data from an external storage device, and starting a multiplier after waiting n clock cycles if a data full signal of a first memory is detected.
CN201811101509.XA 2018-09-20 2018-09-20 Matrix convolution calculating device and matrix convolution calculating method Active CN109284475B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811101509.XA CN109284475B (en) 2018-09-20 2018-09-20 Matrix convolution calculating device and matrix convolution calculating method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811101509.XA CN109284475B (en) 2018-09-20 2018-09-20 Matrix convolution calculating device and matrix convolution calculating method

Publications (2)

Publication Number Publication Date
CN109284475A CN109284475A (en) 2019-01-29
CN109284475B true CN109284475B (en) 2021-10-29

Family

ID=65181844

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811101509.XA Active CN109284475B (en) 2018-09-20 2018-09-20 Matrix convolution calculating device and matrix convolution calculating method

Country Status (1)

Country Link
CN (1) CN109284475B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110097174B (en) * 2019-04-22 2021-04-20 西安交通大学 Method, system and device for realizing convolutional neural network based on FPGA and row output priority
CN110648313B (en) * 2019-09-05 2022-05-24 北京智行者科技有限公司 Laser stripe center line fitting method based on FPGA
CN111240746B (en) * 2020-01-12 2023-01-10 苏州浪潮智能科技有限公司 Floating point data inverse quantization and quantization method and equipment
CN113536221B (en) * 2020-04-21 2023-12-15 中科寒武纪科技股份有限公司 Operation method, processor and related products
CN112612447B (en) * 2020-12-31 2023-12-08 安徽芯纪元科技有限公司 Matrix calculator and full-connection layer calculating method based on same

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4937774A (en) * 1988-11-03 1990-06-26 Harris Corporation East image processing accelerator for real time image processing applications
CN104915322A (en) * 2015-06-09 2015-09-16 中国人民解放军国防科学技术大学 Method for accelerating convolution neutral network hardware and AXI bus IP core thereof
CN106250103A (en) * 2016-08-04 2016-12-21 东南大学 A kind of convolutional neural networks cyclic convolution calculates the system of data reusing
CN106844294A (en) * 2016-12-29 2017-06-13 华为机器有限公司 Convolution algorithm chip and communication equipment
CN107341544A (en) * 2017-06-30 2017-11-10 清华大学 A kind of reconfigurable accelerator and its implementation based on divisible array
CN107656899A (en) * 2017-09-27 2018-02-02 深圳大学 A kind of mask convolution method and system based on FPGA

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8499021B2 (en) * 2010-08-25 2013-07-30 Qualcomm Incorporated Circuit and method for computing circular convolution in streaming mode

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4937774A (en) * 1988-11-03 1990-06-26 Harris Corporation East image processing accelerator for real time image processing applications
CN104915322A (en) * 2015-06-09 2015-09-16 中国人民解放军国防科学技术大学 Method for accelerating convolution neutral network hardware and AXI bus IP core thereof
CN106250103A (en) * 2016-08-04 2016-12-21 东南大学 A kind of convolutional neural networks cyclic convolution calculates the system of data reusing
CN106844294A (en) * 2016-12-29 2017-06-13 华为机器有限公司 Convolution algorithm chip and communication equipment
CN107341544A (en) * 2017-06-30 2017-11-10 清华大学 A kind of reconfigurable accelerator and its implementation based on divisible array
CN107656899A (en) * 2017-09-27 2018-02-02 深圳大学 A kind of mask convolution method and system based on FPGA

Also Published As

Publication number Publication date
CN109284475A (en) 2019-01-29

Similar Documents

Publication Publication Date Title
CN109284475B (en) Matrix convolution calculating device and matrix convolution calculating method
KR102492477B1 (en) Matrix multiplier
US10140251B2 (en) Processor and method for executing matrix multiplication operation on processor
US10713214B1 (en) Hardware accelerator for outer-product matrix multiplication
WO2021232843A1 (en) Image data storage method, image data processing method and system, and related apparatus
WO2023065983A1 (en) Computing apparatus, neural network processing device, chip, and data processing method
CN111008691B (en) Convolutional neural network accelerator architecture with weight and activation value both binarized
WO2021232422A1 (en) Neural network arithmetic device and control method thereof
CN111178513B (en) Convolution implementation method and device of neural network and terminal equipment
CN114758209B (en) Convolution result obtaining method and device, computer equipment and storage medium
CN108804974B (en) Method and system for estimating and configuring resources of hardware architecture of target detection algorithm
CN112346704B (en) Full-streamline type multiply-add unit array circuit for convolutional neural network
CN110704019B (en) Data buffer and data reading method
CN103179398A (en) FPGA (field programmable gate array) implement method for lifting wavelet transform
CN114115799A (en) Matrix multiplication apparatus and method of operating the same
CN111368250B (en) Data processing system, method and equipment based on Fourier transformation/inverse transformation
CN110929854B (en) Data processing method and device and hardware accelerator
Mohanty et al. Systolic architecture for hardware implementation of two-dimensional non-separable filter-bank
Yin et al. A reconfigurable accelerator for generative adversarial network training based on FPGA
CN108900177A (en) A kind of FIR filter and its method that data are filtered
Mohanty et al. New scan method and pipeline architecture for VLSI implementation of separable 2-D FIR filters without transposition
JP2001160736A (en) Digital filter circuit
WO2023131252A1 (en) Data flow architecture-based image size adjustment structure, adjustment method, and image resizing method and apparatus
Zhang et al. A cache structure and corresponding data access method for Winograd algorithm
Lu et al. A Rotation-based Data Buffering Architecture for Convolution Filtering in a Field Programmable Gate Array.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant