CN109284475B

CN109284475B - Matrix convolution calculating device and matrix convolution calculating method

Info

Publication number: CN109284475B
Application number: CN201811101509.XA
Authority: CN
Inventors: 满宏涛; 王振江; 李拓
Original assignee: Zhengzhou Yunhai Information Technology Co Ltd
Current assignee: Zhengzhou Yunhai Information Technology Co Ltd
Priority date: 2018-09-20
Filing date: 2018-09-20
Publication date: 2021-10-29
Anticipated expiration: 2038-09-20
Also published as: CN109284475A

Abstract

The matrix convolution calculation module is provided with (m-1) memories, a first-in first-out storage structure is formed among the (m-1) memories, and when input data stored in external storage equipment are read, m rows of input data do not need to be read from the external storage equipment at the same time, and the input data are read sequentially row by row. Therefore, by using the matrix convolution calculation module provided by the embodiment of the application, the requirements on the interface bandwidth of the external storage device and the number of the interfaces of the FPGA chip can be reduced, and the applicability is strong.

Description

Matrix convolution calculating device and matrix convolution calculating method

Technical Field

The present application relates to the field of electronic technologies, and in particular, to a device and a method for calculating a matrix convolution based on an FPGA.

Background

With the development of scientific technology, the application of Convolutional Neural Network (CNN) is becoming more and more widespread. CNN is a multilayer neural network, and the convolution layer is an important component of CNN, and the core operation is to complete the convolution operation of input data and a convolver. Wherein the input data may be embodied as an input matrix, the convolver may be embodied as a convolver matrix, and the convolved output is embodied as an output matrix.

If the input matrix is

The convolver matrix is

In general

. The convolution output is then:

wherein, in the step (A),

. The operation of one output point in the output matrix is completed through the formulaThe calculation requires multiply-accumulate the convolver and each point at the corresponding position of the input data. The operation of different output points can be completed by changing the position relation of the convolver and the input data. In order to complete the whole convolution operation, the convolver needs to move from left to right, and the operation of one output point is completed every time the convolver moves one grid; and after the operation of one row of output is finished, moving the convolver down by one row, moving the convolver from left to right, and finishing the operation until the operation of the last output point of the last row is finished.

The FPGA is used for realizing the matrix convolution operation, certain advantages are achieved, a large number of multiplication and accumulation operations are involved in the matrix convolution operation, resources of the FPGA can be fully utilized, and the parallel characteristic of the FPGA can also bring great improvement to the operation speed.

Referring to fig. 1, the diagram is a schematic diagram of implementing a matrix convolution operation by using an FPGA in the prior art. In fig. 1, the first element in the output matrix is obtained by performing the calculation with the input matrix being 5 × 5 matrix and the convolver matrix being 3 × 3 matrix as examples

The calculation method of (1). As can be seen from fig. 1, 9 registers 110, 9 multipliers 120, and an addition tree are involved in computing the output matrix from the input matrix and the convolver. Computing

The process is as follows: simultaneously reading the 1 st line, the 2 nd line and the 3 rd line of input data, inputting 3 input ports in each line of data in sequence, namely the first line of data enters the input port 101, the second line of data enters the input port 102 and the third line of data enters the input port 103, and starting multiplication operation after 3 clock cycles by starting the data in each register as shown in figure 1; the outputs of the multipliers at this time are (listed in the order of the multipliers from top to bottom as shown):

、

、

、

、

、

、

、

、

. The output of each multiplier enters an addition tree to be added, thereby obtaining

The value of (c).

It will be appreciated that as data continues to be input, the values in the registers for the next clock cycle are as shown in fig. 2, with the outputs of the multipliers being respectively (listed in the order of the multipliers shown from top to bottom):

、

、

、

、

、

、

、

、

The value of (c). And when the 1 st row, the 2 nd row and the 3 rd row of data are all input, obtaining the values of all elements in the first row of the output matrix.

Similarly to the calculation of the values of the respective elements in the 1 st row of the output matrix, when the values of the respective elements in the 2 nd row of the output matrix are calculated, the data in the 2 nd, 3 rd and 4 th rows of the input data are read simultaneously, and each row of data is sequentially input to the three input ports of the input port 101, the input port 102 and the input port 103. The method for calculating the values of the elements in row 2 of the output matrix is similar to the method for calculating the values of the elements in row 1 of the output matrix, and is not described herein again. It will be appreciated that the calculation of the entire output matrix is complete until the last three rows of input data have been read.

It can be understood that, in practical applications, Input Output (IO) interface resources of the FPGA are limited, and the above scheme of implementing the matrix convolution by using the FPGA requires that 3 rows of input data are read from the external storage device at the same time, which requires higher requirements on interface bandwidth of the external storage device and the number of interfaces of the FPGA chip. In fact, the number of rows of data read simultaneously from the external storage device is related to the dimension of the convolver matrix, and if the convolver matrix is an m × n matrix, then m rows of input data need to be read simultaneously from the external storage device. It can be understood that the larger m is, the higher the requirements on the interface bandwidth of the external storage device and the number of interfaces of the FPGA chip are, and therefore, the applicability of the above scheme in practical application is not strong.

In view of the above, a solution is needed to solve the above problems.

Disclosure of Invention

The technical problem to be solved by the present application is that the applicability of the matrix convolution operation implemented based on the FPGA in the prior art is not strong, and a matrix convolution calculation apparatus and a matrix convolution calculation method are provided.

In a first aspect, an embodiment of the present application provides a matrix convolution calculation apparatus implemented based on an FPGA, where an input matrix is an M × N matrix, a convolver matrix is an M × N matrix, M, N, M and N are positive integers, M is less than or equal to M, and N is less than or equal to N; the convolution calculation means includes: m registers, m multipliers, an addition tree and (m-1) memories;

the output ends of the m x n multipliers are connected to the input end of the addition tree;

the input of any multiplier in the m x n multipliers is an element in a convolver matrix and a value stored in the register; the m x n multipliers are embodied as m groups of multipliers, and one group of multipliers comprises n multipliers; the m × n registers are embodied as m groups of registers, and each group of registers comprises n registers; the m × n registers correspond to the m × n multipliers one by one;

in n registers corresponding to an ith group of multipliers in the m groups of multipliers, elements in a convolver matrix corresponding to a first register in the n registers corresponding to the ith group of multipliers are: the (i-1) th row and (n-1) th column in the convolver and the element in the convolver matrix corresponding to the p-th register in the n registers corresponding to the ith group of multipliers are as follows: the convolver comprises (i-1) th row and (n-p) th column elements, wherein p is an integer which is more than or equal to 2 and less than or equal to n, and i is an integer which is more than or equal to 1 and less than or equal to m;

the (m-1) memories are in one-to-one correspondence with (m-1) groups of registers in the m groups of registers, and the output end of the first memory is connected with the input end of a first register in the first group of registers; wherein the first memory is any one of the (m-1) memories, and the first group of registers is a group of registers corresponding to the first memory; the input ends of a group of registers except the (m-1) group of registers in the m groups of registers are connected with the output end of an external storage device for storing input data; wherein, the input end of the group of registers refers to the input end of the first register in the group of registers;

the storage size of the (m-1) memories is the N;

n registers in any group of registers in the m groups of registers are in cascade connection, when a clock arrives, the value of the jth register in the n registers is updated to the value of the (j-1) th register, the value of the first register in the n registers is the value read from the memory, wherein j is an integer which is less than or equal to n and is greater than 1;

a first-in first-out storage structure is formed among the (m-1) storages, the data output end of the kth storage is connected with the data input end of the (k-1) storage, and k is smaller than or equal to the (m-1).

Optionally, the memory includes: a first-in first-out queue FIFO inside the FPGA or a random access memory RAM inside the FPGA.

In a second aspect, an embodiment of the present application provides a method for implementing matrix convolution by using the apparatus for computing matrix convolution according to any one of the above first aspects, where the input matrix is an M × N matrix, the convolver matrix is an M × N matrix, M, N, M and N are positive integers, M is less than or equal to M, and N is less than or equal to N; the method comprises the following steps:

reading input data from an external storage device, and starting a multiplier after waiting n clock cycles if a data full signal of a first memory is detected.

Compared with the prior art, the embodiment of the application has the following advantages:

the embodiment of the application provides a matrix convolution calculating device based on FPGA realizes, includes: m registers, m multipliers, an addition tree and (m-1) memories; the output ends of the m x n multipliers are connected to the input end of the addition tree; the input of any multiplier in the m x n multipliers is an element in a convolver matrix and a value stored in the register; the m x n multipliers are embodied as m groups of multipliers, and one group of multipliers comprises n multipliers; the m × n registers are embodied as m groups of registers, and each group of registers comprises n registers; the m × n registers correspond to the m × n multipliers one by one; in n registers corresponding to an ith group of multipliers in the m groups of multipliers, elements in a convolver corresponding to a first register in the n registers corresponding to the ith group of multipliers are: the (i-1) th row and the (n-1) th column of the convolver, and the element in the convolver corresponding to the p-th register in the n registers corresponding to the i-th group of multipliers is as follows: the data of the (i-1) th row and the (n-p) th column in the convolver, wherein p is an integer which is more than or equal to 2 and less than or equal to n, and i is an integer which is more than or equal to 1 and less than or equal to m; the (m-1) memories are in one-to-one correspondence with (m-1) groups of registers in the m groups of registers, and the output end of the first memory is connected with the input end of a first register in the first group of registers; wherein the first memory is any one of the (m-1) memories, and the first group of registers is a group of registers corresponding to the first memory; the input ends of a group of registers except the (m-1) group of registers in the m groups of registers are connected with the output end of an external storage device for storing input data; wherein, the input end of the group of registers refers to the input end of the first register in the group of registers; the storage size of the (m-1) memories is the N; n registers in any group of registers in the m groups of registers are in cascade connection, when a clock arrives, the value of the jth register in the n registers is updated to the value of the (j-1) th register, the value of the first register in the n registers is the value read from the memory, wherein j is an integer which is less than or equal to n and is greater than 1; a first-in first-out storage structure is formed among the (m-1) storages, the data output end of the kth storage is connected with the data input end of the (k-1) storage, and k is smaller than or equal to the (m-1).

That is to say, in the embodiment of the present application, (m-1) memories are provided, and a first-in first-out storage structure is formed among the (m-1) memories, so that when the input data stored in the external storage device is read, it is not necessary to read m lines of input data from the external storage device at the same time, and the input data is read sequentially row by row. Therefore, by using the matrix convolution calculating device provided by the embodiment of the application, the requirements on the interface bandwidth of the external storage equipment and the number of the interfaces of the FPGA chip can be reduced, and the applicability is strong.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a schematic diagram of a prior art implementation of matrix convolution operations using an FPGA;

FIG. 2 is a schematic diagram of a prior art implementation of a matrix convolution operation using an FPGA;

fig. 3 is a schematic structural diagram of a matrix convolution calculating apparatus according to an embodiment of the present application;

FIG. 4 is a schematic diagram illustrating movement of data stored in a memory when data is read from an external storage device according to an embodiment of the present disclosure;

fig. 5 is a schematic diagram illustrating a matrix convolution operation implemented by an FPGA according to an embodiment of the present application;

fig. 6 is a further schematic diagram of implementing a matrix convolution operation with an FPGA according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The inventor of the present application finds, through research, that in the prior art, a scheme of using an FPGA to implement matrix convolution requires that a plurality of rows of input data are read from an external storage device at the same time, and thus, the requirements on the interface bandwidth of the external storage device and the number of interfaces of an FPGA chip are high. Specifically, the number of rows of data read simultaneously from the external storage device is related to the dimension of the convolver matrix, and if the convolver matrix is an m × n matrix, m rows of input data need to be read simultaneously from the external storage device. It can be understood that the larger m is, the higher the requirements on the interface bandwidth of the external storage device and the number of interfaces of the FPGA chip are, and therefore, the applicability of the above scheme in practical application is not strong.

In order to solve the above problem, an embodiment of the present application provides a matrix convolution calculating apparatus implemented based on an FPGA, including: m registers, m multipliers, an addition tree and (m-1) memories; the output ends of the m x n multipliers are connected to the input end of the addition tree; the input of any multiplier in the m x n multipliers is an element in a convolver matrix and a value stored in the register; the m x n multipliers are embodied as m groups of multipliers, and one group of multipliers comprises n multipliers; the m × n registers are embodied as m groups of registers, and each group of registers comprises n registers; the m × n registers correspond to the m × n multipliers one by one; in n registers corresponding to an ith group of multipliers in the m groups of multipliers, elements in a convolver corresponding to a first register in the n registers corresponding to the ith group of multipliers are: the (i-1) th row and the (n-1) th column of the convolver, and the element in the convolver corresponding to the p-th register in the n registers corresponding to the i-th group of multipliers is as follows: the data of the (i-1) th row and the (n-p) th column in the convolver, wherein p is an integer which is more than or equal to 2 and less than or equal to n, and i is an integer which is more than or equal to 1 and less than or equal to m; the (m-1) memories are in one-to-one correspondence with (m-1) groups of registers in the m groups of registers, and the output end of the first memory is connected with the input end of a first register in the first group of registers; wherein the first memory is any one of the (m-1) memories, and the first group of registers is a group of registers corresponding to the first memory; the input ends of a group of registers except the (m-1) group of registers in the m groups of registers are connected with the output end of an external storage device for storing input data; wherein, the input end of the group of registers refers to the input end of the first register in the group of registers; the storage size of the (m-1) memories is the N; n registers in any group of registers in the m groups of registers are in cascade connection, when a clock arrives, the value of the jth register in the n registers is updated to the value of the (j-1) th register, the value of the first register in the n registers is the value read from the memory, wherein j is an integer which is less than or equal to n and is greater than 1; a first-in first-out storage structure is formed among the (m-1) storages, the data output end of the kth storage is connected with the data input end of the (k-1) storage, and k is smaller than or equal to the (m-1).

Various non-limiting embodiments of the present application are described in detail below with reference to the accompanying drawings.

Referring to fig. 3, the diagram is a schematic structural diagram of a matrix convolution calculation apparatus according to an embodiment of the present application.

The matrix convolution calculating device 300 according to the embodiment of the present application may be configured to calculate that the input matrix is an M × N matrix, and the convolver matrix is a convolution result of the M × N matrix. Wherein M, N, M and N are both positive integers, M is less than or equal to M, and N is less than or equal to N.

The convolution calculating means 300 includes: m x n registers 301, m x n multipliers 302, a summing tree 303 and (m-1) memories 304, memory 1 and memory 2 respectively. Fig. 3 illustrates an example of the convolver matrix as a 3 × 3 matrix, which includes 9 registers 301, 9 multipliers 302, an

addition tree

303, and 2 memories 304.

Where the outputs of the m x n multipliers are connected to the inputs of the addition tree 303, it will be understood that the output of the addition tree 303 is one element value of the input matrix and the output matrix of the convolver matrix.

The input of any multiplier in the m x n multipliers is an element in a convolver matrix and a value stored in the register; the m x n multipliers are embodied as m groups of multipliers, and one group of multipliers comprises n multipliers; the m × n registers are embodied as m groups of registers, and each group of registers comprises n registers; and the m × n registers correspond to the m × n multipliers one by one. As shown in fig. 3, the 3 × 3 registers are embodied as 3 sets of registers, each set of registers including 3 registers; the 3 x 3 registers correspond to the 3 x 3 multipliers one by one.

In n registers corresponding to an ith group of multipliers in the m groups of multipliers, elements in a convolver matrix corresponding to a first register in the n registers corresponding to the ith group of multipliers are: the (i-1) th row and (n-1) th column of the convolver matrix, and the element in the convolver corresponding to the p-th register in the n registers corresponding to the i-th group of multipliers is as follows: and (2) data of an (i-1) th row and an (n-p) th column in the convolver, wherein p is an integer which is more than or equal to 2 and less than or equal to n, and i is an integer which is more than or equal to 1 and less than or equal to m. As can be understood in connection with FIG. 3, the 1 st set of multiplicationsIn the first 3 multipliers from top to bottom in fig. 3, the element in the convolver matrix corresponding to the first multiplier is the 0 th row and the 2 nd column of the convolver matrix₀₂. The elements in the convolver corresponding to the 3 rd register in the 3 registers corresponding to the 1 st group of multipliers are: element K of row 0 and column 0 in the convolver₀₀。

The (m-1) memories 304 correspond to the (m-1) groups of registers in the m groups of registers one to one, and the output end of the first memory is connected with the input end of the first register in the first group of registers; wherein the first memory is any one of the (m-1) memories, and the first group of registers is a group of registers corresponding to the first memory. As will be understood in conjunction with fig. 3, memory 1 corresponds to 1 set of registers, and the output of memory 1 is connected to the input of the first register in the set of 3 registers (i.e., the first register from top to bottom in fig. 3). The memory 2 corresponds to a group of registers, and the output terminal of the memory 2 is connected to the input terminal of the first register (i.e. the fourth register from top to bottom in fig. 3) in the group of 3 registers.

The input ends of a group of registers except the (m-1) group of registers in the m groups of registers are connected with the output end of an external storage device for storing input data; wherein the input terminal of the group of registers refers to the input terminal of the first register in the group of registers. As can be understood in conjunction with fig. 3, the three registers from the seventh register from top to bottom to the ninth register in fig. 3 do not correspond to any memory, and in the embodiment of the present application, the input terminal of the first register (i.e., the seventh register from top to bottom in fig. 3) of the three registers, i.e., the input terminal of the group of registers formed by the three registers, is connected to the output terminal of the external storage device for storing input data.

In the embodiment of the present application, the storage size of the (m-1) memories is the N. That is, one row of data in the input matrix may be stored in one memory.

In this embodiment of the present application, n registers in any one of the m groups of registers are in a cascade relationship, when a clock arrives, a value of a jth register in the n registers is updated to a value of a (j-1) th register, and a value of a first register in the n registers is a value read from the memory, where j is an integer that is less than or equal to n and is greater than 1.

It should be noted that the clock mentioned herein may refer to a system clock of the FPGA, or may be a clock obtained by frequency multiplication or frequency division according to the system clock.

It will be appreciated that the value of the j registers corresponds to the value of the (j-1) th register being clocked by the clock signal.

A first-in first-out storage structure is formed among the (m-1) storages, the data output end of the kth storage is connected with the data input end of the (k-1) storage, and k is smaller than or equal to the (m-1). It is understood that the data in the (m-1) memories may move when the data is read from the external storage device. As can be understood with reference to fig. 4, fig. 4 is a schematic diagram illustrating the movement of data stored in (m-1) memories when data is read from an external storage device. A data storage case of storing external data into the

memories

401 and 402 is shown in fig. 4. Fig. 4 is merely an illustration of 6 input data for ease of understanding, and does not limit the embodiments of the present application.

It should be noted that, in the embodiment of the present application, the memory includes: a first-in first-out queue FIFO inside the FPGA or a random access memory RAM inside the FPGA.

The embodiment of the present application is not particularly limited to the external storage device, and the external storage device may be, for example, an external memory.

In the embodiment of the application, (m-1) memories are arranged, a first-in first-out storage structure is formed among the (m-1) memories, and when input data stored in external storage equipment are read, m lines of input data do not need to be read from the external storage equipment at the same time, and the input data can be read sequentially line by line. Therefore, by using the matrix convolution calculating device provided by the embodiment of the application, the requirements on the interface bandwidth of the external storage equipment and the number of the interfaces of the FPGA chip can be reduced, and the applicability is strong.

The above describes a matrix convolution calculation apparatus, and the following describes a method of performing matrix convolution calculation using the matrix convolution calculation apparatus.

First, input data is read from an external storage device, and as is apparent from the structure of the above matrix convolution calculation apparatus, the (m-1) memories may be filled with signals as the read input data increases. It is understood that, since one memory can store one row of data in the input matrix, when the data full signal of the first memory of the (m-1) memories is detected, it indicates that the (m-1) row of data in the input matrix has been stored in the first to (m-1) th memories. In the embodiment of the application, if a data full signal of the first memory is detected, the multiplier is started after waiting for n clock cycles, and the calculation of the matrix convolution result of the input matrix and the convolver matrix is started.

The following describes, with reference to the accompanying drawings, a method for implementing matrix convolution according to an embodiment of the present application, by taking an input matrix as a (5 × 5) matrix and a convolver matrix as a (3 × 3) matrix as an example.

Firstly, the multiplier is not started in the initial stage, the 1 st row of data of the input matrix is read from the external storage device, and the first row of data is written into the memory 2 according to the matrix convolution calculation device shown in fig. 3; then, continuing to read the 2 nd row data of the input matrix from the external storage device, the 2 nd row data of the matrix convolution calculation device according to fig. 3 will be written into the memory 2, and the data in the memory 2 will be written into the memory 1. When both memory 2 and memory 1 are full, continue reading line 3 data from the external storage device, wait 3 clock cycles, start the multiplier. At this time, the data of the 2 nd line is output by the memory 2, and the data of the 1 st line is output by the memory 1; after three clock cycles, the values of the registers are shown in FIG. 5, at which time the first element of the output matrix can be calculated

A value of (d); four clock cycles later, the values of the registers are shown in FIG. 6, at which time the second element of the output matrix can be calculated

The value of (c). By analogy, the values of the elements of the first row of the output matrix can all be calculated.

It will be appreciated that during the above-described process of calculating all the values of the elements of the first row of the output matrix, the data in memory 2 is replaced by row 3 and the data in memory 1 is replaced by row 2, at which point the reading of row 4 from the external storage device can continue. The data at the input of the multiplier is updated to the data of the 2 nd, 3 rd and 4 th rows, so that the values of the elements of the second row in the output matrix can be calculated. It will be appreciated that the entire matrix convolution calculation is complete until the last row of data read is complete.

According to the scheme, the method provided by the embodiment of the application can reduce the requirements on the interface bandwidth of the external storage device and the number of the interfaces of the FPGA chip, and is high in applicability. The input data reading sequence is simple, and the input data is stored in the external storage equipment according to the row sequence; in the design of the application, all input data can be used iteratively by a method of cascading a plurality of RAMs or FIFOs, and the requirement of the whole matrix convolution operation can be met by reading once. The problem that when the FPGA is used for calculating the matrix convolution in the prior art, input data are stored in external storage equipment, the requirement of simultaneously reading any 3 rows of adjacent data is met, the requirements on the data storage sequence and format are high, the data management is very complex, each row of data is continuously and repeatedly read for 3 times, and the input data reading efficiency is low can be solved.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice in the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the attached claims

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A matrix convolution calculating device realized based on FPGA (field programmable gate array) comprises an input matrix M N matrix, a convolver matrix M N matrix, M, N, M and N are positive integers, M is less than or equal to M, and N is less than or equal to N; wherein said convolution calculating means comprises: m registers, m multipliers, an addition tree and (m-1) memories;

the storage size of the (m-1) memories is the N;

2. The matrix convolution calculation apparatus of claim 1, wherein the memory includes:

a first-in first-out queue FIFO inside the FPGA or a random access memory RAM inside the FPGA.

3. A method of performing matrix convolution using the matrix convolution calculation apparatus of claim 1, the input matrix being an M x N matrix, the convolver matrix being an M x N matrix, M, N, M and N being positive integers, M being less than or equal to M, N being less than or equal to N; characterized in that the method comprises: