CN111652360B

CN111652360B - Convolution operation device based on pulsation array

Info

Publication number: CN111652360B
Application number: CN202010447090.4A
Authority: CN
Inventors: 焦海龙; 刘敏; 周长春
Original assignee: Peking University Shenzhen Graduate School
Current assignee: Peking University Shenzhen Graduate School
Priority date: 2020-05-25
Filing date: 2020-05-25
Publication date: 2023-03-14
Anticipated expiration: 2040-05-25
Also published as: CN111652360A

Abstract

The utility model relates to an electronic information and deep learning technical field disclose a convolution arithmetic device based on pulsation array, including treating convolution matrix acquisition device, first convolution matrix controller, second convolution matrix controller and pulsation array, treat that convolution matrix acquisition device sends respectively to first convolution matrix controller and second convolution matrix controller with waiting to convolution matrix and convolution kernel matrix earlier, input the pulsation array by first convolution matrix controller and second convolution matrix controller again and carry out convolution calculation. Because the first convolution matrix controller only inputs the obtained non-zero elements of the matrix into the ripple array in sequence for convolution operation, the convolution operation is faster, and the occupancy rate of circuit computing resources can be reduced.

Description

Convolution operation device based on pulsation array

Technical Field

The invention relates to the technical field of electronic information and deep learning, in particular to a convolution operation device based on a pulse array.

Background

With the increasing demand of artificial intelligence solutions based on neural networks, convolutional neural network construction is applied to some mobile platforms such as unmanned planes, robots and the like, which are deeply changing the production and life styles of human beings. In the aspect of designing and researching special hardware of the convolutional neural network, the implementation mode of novel devices based on a CPU, a GPU, an FPGA, an ASIC, an RRAM and the like is provided. From the cloud to the mobile terminal, different application scenes provide different requirements for the computing capacity of the convolutional neural network, and the convolutional neural network has various structures, large data volume and large computation amount, so that great challenges are provided for hardware to realize neural network algorithm design. The core of the hardware architecture of the convolutional neural network is the hardware architecture of convolution operation. In the prior art, one is a circuit design of a hardware architecture for performing convolutional neural network convolutional operation by using a conventional digital circuit, such as an FPGA, an ASIC, a GPU, a CPU, and the like. However, as the process size decreases, the circuit node leakage increases and the power supply voltage decreases. Under certain calculation accuracy, a large amount of circuit calculation resources and storage resources are consumed. That is, the performance of the whole circuit, such as power consumption, area, speed and precision, is continuously limited. And the other is to design and realize a CNN hardware circuit based on a new device, such as RRAM and the like. Among them, the convolutional layer of the Convolutional Neural Network (CNN) is very effective in extracting feature information of input data, and thus the CNN has high recognition accuracy and has been widely applied in various fields such as image recognition and natural language processing. In summary, how to accelerate the operation speed of the convolution operation based on the existing circuit computing resources is a hot research and development direction of the neural network accelerator.

Disclosure of Invention

The application provides a convolution arithmetic device based on a pulsation array, which solves the defects of a neural network accelerator in the prior art. The specific embodiment is as follows:

according to a first aspect, an embodiment provides a convolution operation device based on a systolic array, configured to perform convolution calculation on a to-be-convolved matrix X and a convolution kernel matrix H to obtain a convolution result matrix P, including the to-be-convolved matrix obtaining device, a first convolution matrix controller, a second convolution matrix controller, and the systolic array, where the systolic array includes m rows and n columns of multiplier-adder units, n values are the same as the number of columns of the convolution kernel matrix H, and m values are the same as the number of rows of the convolution kernel matrix H;

the device for acquiring the matrix to be convolved is used for sending one matrix of the matrix X to be convolved and the convolution kernel matrix H serving as a first convolution matrix to the first convolution matrix controller, and sending the other matrix serving as a second convolution matrix to the second convolution matrix controller;

the first convolution matrix controller is used for acquiring all non-zero elements of the first convolution matrix and broadcasting and inputting all non-zero elements to the systolic array in sequence;

the second convolution matrix controller is used for sequentially inputting m rows and n columns of elements of the second convolution matrix to each multiplier-adder unit of the systolic array according to a convolution sequence;

and the systolic array is used for acquiring elements of a convolution result matrix P according to the acquired elements of the first convolution matrix and the second convolution matrix.

Further, the device for acquiring the matrix to be convolved is used for determining that one matrix of the matrix X to be convolved and the convolution kernel matrix H is a first convolution matrix and the other matrix of the matrix X to be convolved and the convolution kernel matrix H is a second convolution matrix according to the sparsity of the matrix X to be convolved and the convolution kernel matrix H;

when the sparsity of the matrix X to be convolved is greater than that of the convolution kernel matrix H, taking the matrix X to be convolved as the first convolution matrix and taking the convolution kernel matrix H as the second convolution matrix;

and otherwise, taking the matrix X to be convolved as the second convolution matrix and taking the convolution kernel matrix H as the first convolution matrix.

Further, the multiplier-adder unit comprises a multiplier, an adder and a register; the multiplier is used for multiplying the elements of the first convolution matrix and the elements of the second convolution matrix obtained by the multiplier and adder unit, and the adder is used for summing the value stored in the register and the product obtained by the multiplier and storing the sum in the register.

Further, the first convolution matrix controller is configured to obtain all non-zero elements of the first convolution matrix and sequentially broadcast and input the non-zero elements to the systolic array, and includes:

the first convolution matrix controller obtains a sparse array of the first convolution matrix, wherein the sparse array comprises k sparse matrices, the k value is the same as the row number of the first convolution matrix, and the elements of each sparse matrix respectively comprise non-zero elements of one row of the first convolution matrix;

and the first convolution matrix controller broadcasts and inputs k elements of the sparse matrix to the systolic array in sequence according to the sequence of the first convolution matrix row.

Further, when the convolution kernel matrix H is the second convolution matrix, the second convolution matrix controller sends the row and column positions of the elements of the second convolution matrix to the multiplier-adder units at the row and column positions corresponding to the systolic array, respectively.

Further, the systolic array is configured to obtain elements of a convolution result matrix P according to the obtained elements of the first convolution matrix and the second convolution matrix, and includes:

the pulse array firstly acquires elements of the second convolution matrix, and then acquires one element of one sparse matrix in each clock cycle;

when the systolic array acquires an element of the sparse matrix, each multiplier-adder unit sequentially transfers the value of the register to the register of the h-th multiplier-adder unit behind the multiplier-adder unit according to the sequence of the first row and the second row; h is the difference obtained by comparing the element row serial number of the sparse matrix currently input into the pulse array with the element row serial number of the sparse matrix input before with 3 to obtain the maximum value;

the multiplier of each multiplier-adder unit multiplies the obtained elements of the sparse matrix and the obtained elements of the second convolution matrix, and the adder of the multiplier-adder unit sums the value stored in the register and the product obtained by the multiplier and stores the obtained sum in the register of the multiplier-adder unit;

when the value stored in the register of the multiplier-adder unit is shifted out of the systolic array, the value is output as an element of the convolution result matrix P.

Further, the systolic array is configured to obtain elements of a convolution result matrix P according to the obtained elements of the first convolution matrix and the second convolution matrix, and further includes:

and when the pulsation array sequentially acquires all elements of one sparse matrix, completing a calculation cycle by the pulsation array, and ending convolution operation when completing k +1 calculation cycles by the pulsation array.

Further, when the matrix X to be convolved is the second convolution matrix, the second convolution matrix controller sequentially inputs m rows and n columns of elements of the second convolution matrix to each multiplier-adder unit of the systolic array according to the convolution sequence, and respectively inputs m rows and n columns of elements to each multiplier-adder unit of the systolic array according to the reverse sequence of rows and the positive sequence of columns.

the pulse array firstly acquires elements of the second convolution matrix, and then acquires an element of the sparse matrix in each clock cycle;

when the systolic array acquires an element of the sparse matrix, each multiplier-adder unit sequentially transfers the value of the register to a register of an h-th multiplier-adder unit behind the multiplier-adder unit according to the sequence of the first row and the second row, wherein h is the difference obtained by comparing the element column sequence number of the sparse matrix currently input into the systolic array and the element column sequence number of the sparse matrix input before with 3 to obtain the maximum value;

when the pulsation array sequentially acquires all elements of one sparse matrix, the pulsation array completes one calculation cycle;

the pulse array finishes k calculation cycles after acquiring n rows and n columns of elements of m rows of the second convolution matrix;

and the pulse array acquires m rows and n columns of elements of the last second convolution matrix, and when k +1 calculation cycles are completed, the convolution operation is ended.

According to the convolution operation device based on the pulse array of the embodiment, the convolution operation device comprises a to-be-convolved matrix acquisition device, a first convolution matrix controller, a second convolution matrix controller and the pulse array, wherein the to-be-convolved matrix acquisition device firstly sends a to-be-convolved matrix and a convolution kernel matrix to the first convolution matrix controller and the second convolution matrix controller respectively, and then the first convolution matrix controller and the second convolution matrix controller input the pulse array for convolution calculation. Because the first convolution matrix controller only inputs the obtained non-zero elements of the matrix into the ripple array in sequence for convolution operation, the convolution operation is faster, and the occupancy rate of circuit computing resources can be reduced.

Drawings

FIG. 1 is a block diagram of a convolutional neural network;

FIG. 2 is a schematic diagram of a convolution operation;

FIG. 3 is a schematic diagram of a systolic array;

FIG. 4 is a schematic diagram of the calculation steps of a systolic array;

FIG. 5 is a diagram illustrating an exemplary convolution operation apparatus;

FIG. 6 is a schematic diagram of a convolution calculation of a matrix to be convolved and a convolution kernel matrix;

FIG. 7 is a diagram illustrating an exemplary convolution operation device;

FIG. 8 is a schematic diagram of a matrix to be convolved and a convolution kernel matrix;

FIG. 9 is a diagram illustrating a systolic array acquiring a second convolution matrix in accordance with an embodiment;

FIG. 10 is a diagram illustrating a systolic array acquiring a second convolution matrix in accordance with an embodiment;

FIG. 11 is a diagram illustrating a systolic array acquiring a second convolution matrix in another embodiment;

FIG. 12 is a diagram illustrating the operation of the systolic array in one embodiment;

FIG. 13 is a diagram illustrating the operation of the systolic array in one embodiment.

Detailed Description

The present invention will be described in further detail with reference to the following detailed description and accompanying drawings. Wherein like elements in different embodiments have been given like element numbers associated therewith. In the following description, numerous specific details are set forth in order to provide a better understanding of the present application. However, those skilled in the art will readily recognize that some of the features may be omitted or replaced with other elements, materials, methods in different instances. In some instances, certain operations related to the present application have not been shown or described in this specification in order not to obscure the core of the present application with unnecessary detail, and it is not necessary for those skilled in the art to describe these operations in detail, so that they may be fully understood from the description in the specification and the general knowledge in the art.

Furthermore, the features, operations, or characteristics described in the specification may be combined in any suitable manner to form various embodiments. Also, the various steps or actions in the method descriptions may be transposed or transposed in order, as will be apparent to one of ordinary skill in the art. Thus, the various sequences in the specification and drawings are for the purpose of describing certain embodiments only and are not intended to imply a required sequence unless otherwise indicated where such sequence must be followed.

The numbering of the components as such, e.g., "first", "second", etc., is used herein only to distinguish the objects as described, and does not have any sequential or technical meaning. The term "connected" and "coupled" as used herein includes both direct and indirect connections (couplings), unless otherwise specified.

The convolutional neural network is a feedforward neural network, and its artificial neuron can respond to a part of surrounding units in the coverage range, and can be generally divided into an input layer, a hidden layer and an output layer, wherein the hidden layer can be divided into a convolutional layer and a sampling layer. The following explains the structure of the convolutional neural network by using a specific example, please refer to fig. 1, which is a structural diagram of the convolutional neural network. The convolutional neural network inputs an image of a resolution, for example, 28 x 28 resolution. The convolution layer C1 performs convolution operation on the above-mentioned images by using M convolution kernels of n × n to obtain M images with b × b resolution, and usually applies bias and activation operations, so that it is convenient to understand the structure of the convolution neural network and omit these two steps. The sampling layer S2 performs a sampling operation on the M b × b-resolution images obtained by the convolution layer C1 to obtain M (b/2) × (b/2) -resolution images. The convolution layer C3 performs convolution operation on the 6 images of 12 × 12 resolution obtained by the sampling layer S2 using 12 convolution kernels of 5 × 5 to obtain 12 images of 8 × 8 resolution. The sampling layer S3 performs a sampling operation on the 12 images of 8 × 8 resolution obtained by the convolution layer C3 to obtain 12 images of 4 × 4 resolution. The output layer is used for carrying out full connection output on 12 images with 4 × 4 resolutions obtained by the sampling layer S3 to obtain 12 feature information of the images. The convolutional neural network in the above example uses two convolutional layers, and the fully-connected output of the output layer is also a special convolution operation, so the convolution operation is the core of the operation of the convolutional neural network.

With the development of mobile devices and internet of things, it is a development trend to apply convolutional neural networks to these devices with limited hardware resources, and the application scenarios thereof require low latency, low bandwidth and low power consumption. However, the convolutional neural network has characteristics of high-density computation and high-density storage, which limits the wide application of the convolutional neural network. The convolutional neural network has sparsity: for the characteristic value (act _ in), a Linear rectification function (Rectisfied Linear Unit, reLU) sets a negative characteristic value to 0, so that the characteristic value becomes sparse; for weight (weight), the weight value of the floating point type is quantized to an integer of 8 bits or less so that much weight becomes 0. Even, some unimportant weights are directly set to be 0 through a pruning technology, then the convolutional neural network is retrained to recover the accuracy, under the cost of extremely small or even no accuracy reduction, the sparsity of the convolutional neural network is greatly increased, the parameter quantity is greatly reduced, and the calculation and the storage quantity are reduced. There are many custom convolutional neural network hardware accelerators that utilize sparsity, but face many problems, first, different convolutional neural networks have different sparsity, even if the same neural network has different sparsity of different layers, especially the two data types of eigenvalue and weight are different, and the sparsity of the eigenvalue is related to the actual input.

Please refer to fig. 2, which is a schematic diagram of a convolution operation, where Xij is an element of a matrix to be convolved, wij is an element of a convolution kernel matrix, and Yij is an element of a feature matrix. And obtaining a characteristic matrix Y after the convolution matrix X and the convolution kernel matrix W are subjected to convolution calculation. As shown in fig. 2, a process of convolving a 3 × 3 convolution kernel with a 6 × 6 input feature map to obtain a 4 × 4 output feature map is shown. The 9 values in the convolution kernel are multiplied by the corresponding values in the input feature map, and then the obtained 9 products are summed up as one element of the output feature map. In order to obtain the next element of the output feature map, the convolution window is slid on the input feature map by a certain step length to repeat the operation of taking the sum of the products, and finally the complete output feature map is obtained.

The systolic array is a structure proposed in the last century, but is still widely used in neural network accelerators up to now. Referring to fig. 3, a schematic diagram of a systolic array is shown, where the systolic array includes the same number of multiply-add units MAC as the number of elements Wij of the convolution kernel matrix W. In each calculation cycle, the characteristic value Xij is broadcasted to each multiply-add unit MAC in the systolic array, and each multiply-add unit MAC stores an element Wij of the convolution kernel matrix W, i.e., a weight value Wij. Each multiply-add unit MAC multiplies the eigenvalue Xij and the weight value Wij and sends the result to the adjacent multiply-add unit MAC in the next cycle. In each cycle, each multiply-add unit MAC adds its own generated product to the products generated by adjacent multiply-add units MAC. The result produced by the final multiply add unit MAC is sent to the FIFO (i.e. the triangle shown in the figure) to wait for the next round of computation.

Please refer to fig. 4, which is a schematic diagram illustrating a calculation procedure of a systolic array. In the first calculation cycle, X ₀₀ Broadcast to all multiply-add units MAC simultaneously. The multiply-add unit MAC1 obtains X in the first calculation cycle ₀₀ ×W ₀₀ Product of (A) P _{0_0} . In the next calculation cycle, X ₀₁ Is broadcast to all multiply-add units MACs. Meanwhile, P obtained by the multiply-add unit MAC1 in the first calculation cycle _{0_0} Moving to the multiply add unit MAC 2. The multiplying and adding unit MAC2 calculates X of the period ₀₁ ×W ₀₁ Product P of _{0_1} And P _{0_0} And adding, and moving the added result to the multiplication and addition unit MAC3 in the third calculation period.

The embodiment of the invention discloses a convolution operation device based on a pulse array, which comprises a device for acquiring a matrix to be convolved, a first convolution matrix controller, a second convolution matrix controller and the pulse array, wherein the device for acquiring the matrix to be convolved firstly sends a matrix to be convolved and a convolution kernel matrix to the first convolution matrix controller and the second convolution matrix controller respectively, and then the first convolution matrix controller and the second convolution matrix controller input the pulse array for convolution calculation. Because the first convolution matrix controller only inputs the acquired nonzero elements of the matrix into the pulse array in sequence for convolution operation, the convolution operation is faster, and the occupancy rate of circuit computing resources can be reduced.

Example one

Please refer to fig. 5, which is a schematic structural diagram of a convolution operation apparatus in an embodiment, including a to-be-convolved matrix obtaining apparatus 1, a first convolution matrix controller 2, a second convolution matrix controller 3, and a systolic array 4, configured to perform convolution calculation on a to-be-convolved matrix X and a convolution kernel matrix H to obtain a convolution result matrix P, where the systolic array 4 includes m rows and n columns of multiplier-adder units, n is the same as the number of columns of the convolution kernel matrix H, and m is the same as the number of rows of the convolution kernel matrix H. The device 1 for acquiring the matrix to be convolved is used for sending one matrix of the matrix X to be convolved and the convolution kernel matrix H as a first convolution matrix to the first convolution matrix controller 2, and sending the other matrix as a second convolution matrix to the second convolution matrix controller 3. The first convolution matrix controller 2 is configured to obtain all non-zero elements of the first convolution matrix and broadcast them to the systolic array 4 in sequence. The second convolution matrix controller 3 is configured to sequentially input m rows and n columns of elements of the second convolution matrix to each multiplier-adder unit of the systolic array 4 according to a convolution order. The systolic array 4 is used for obtaining the elements of the convolution result matrix P according to the obtained elements of the first convolution matrix and the second convolution matrix. In an embodiment, the convolution operation apparatus further includes a result matrix obtaining apparatus 5, configured to obtain a convolution result matrix P according to elements of the convolution result matrix P output by the systolic array 4.

When we design the convolutional neural network accelerator, the most core computation module is the multiplier-adder. The operation solved by these multiply-add modules is essentially matrix multiplication. When the matrix is a sparse matrix, if the hardware structure of the general matrix multiplication is still adopted to calculate the matrix, a great resource waste is undoubtedly caused. The method is based on mathematics, and is characterized in that general rules of convolution calculation are explored, hardware development is carried out on the basis, and multiplication of sparse matrixes is achieved. The matrix multiplication method adopted by the present design will be explained below by taking row convolution as an example. Please refer to fig. 6, which is a schematic diagram illustrating a convolution calculation of a to-be-convolved matrix and a convolution kernel matrix, including a 1 × 8 to-be-convolved matrix, a 1 × 3 convolution kernel matrix, and a 1 × 6 convolution result matrix. Each value output by the 1 × 6 convolution result matrix is a polynomial of the product of the element act _ in of the 1 × 8 convolution matrix and the element weight of the 1 × 3 convolution kernel matrix. Therefore, the value of act _ out of the 1 × 6 convolution result matrix element at a fixed position is determined, and it needs to be known which act _ in and weight product the corresponding polynomial at this point consists of. According to the operation rule of matrix multiplication and the characteristics of the matrix in fig. 6, we can draw the following conclusions:

wherein i, j, and k respectively represent the position information of the matrix element act _ in to be convolved, the convolution kernel matrix element weight, and the convolution result matrix element act _ out, corresponding to the serial number in fig. 6. After the conclusion is obtained, each value of the element act _ in of the matrix to be convolved does not need to be calculated, and the value of the element act _ out of the convolution result matrix corresponding to the element act _ in of the non-zero matrix to be convolved can be calculated only by knowing the position information of the element act _ in.

Referring to fig. 7, which is a schematic structural diagram of a convolution operation apparatus according to an embodiment, the systolic array 4 includes a plurality of multiplier-adder units 41, and each multiplier-adder unit 41 includes a multiplier, an adder, and a register. The multiplier is used for multiplying the elements of the first convolution matrix and the elements of the second convolution matrix acquired by the multiplier and adder unit, and the adder is used for summing the value stored in the register and the product acquired by the multiplier and storing the sum in the register.

In one embodiment, the device 1 for acquiring a matrix to be convolved is configured to determine, according to sparsity of a matrix X to be convolved and a convolution kernel matrix H, that one of the matrix X to be convolved and the convolution kernel matrix H is a first convolution matrix and that the other is a second convolution matrix, where when the sparsity of the matrix X to be convolved is greater than that of the convolution kernel matrix H, the matrix X to be convolved is used as the first convolution matrix and the convolution kernel matrix H is used as the second convolution matrix; and otherwise, taking the matrix X to be convolved as a second convolution matrix and taking the convolution kernel matrix H as a first convolution matrix. Sparsity refers to the density of non-0 elements in the matrix. Please refer to fig. 8, which is a schematic diagram of a to-be-convolved matrix and a convolution kernel matrix, including a to-be-convolved matrix X and a convolution kernel matrix H, wherein the non-0 elements of the 5 × 5 to-be-convolved matrix include only 11 elements, such as a, b, c, d, e, f, g, H, i, j, and k, and the elements Wij of the 3 × 3 convolution kernel matrix H are all non-zero elements, so that the sparsity of the to-be-convolved matrix X is greater than that of the convolution kernel matrix H. In general, the sparsity of the matrix to be convolved is greater than the sparsity of the convolution kernel matrix.

In one embodiment, the first convolution matrix controller is configured to obtain all non-zero elements of the first convolution matrix and broadcast and input the non-zero elements to the systolic array in sequence, and includes the specific steps of:

the first convolution matrix controller obtains a sparse array of the first convolution matrix, wherein the sparse array comprises k sparse matrices, the k value is the same as the number of rows of the first convolution matrix, and elements of each sparse matrix respectively comprise non-zero elements of one row of the first convolution matrix. And the first convolution matrix controller broadcasts and inputs the elements of the k sparse matrices to the systolic array in sequence according to the sequence of the rows of the first convolution matrix.

As shown in fig. 8, the first convolution matrix controller obtains 5 × 5 sparse arrays of matrices to be convolved, where the sparse arrays include a 1 st sparse matrix, a 2 nd sparse matrix, a 3 rd sparse matrix, a 4 th sparse matrix, and a 5 th sparse matrix, and the number of the sparse matrices is the same as the number of rows of the 5 × 5 matrices to be convolved, where:

the 1 st sparse matrix comprises elements a and b;

the 2 nd sparse matrix comprises elements c, d, and e;

the 3 rd sparse matrix comprises elements f and g;

the 4 th sparse matrix comprises elements h and i;

the 5 th sparse matrix includes elements j and f.

The elements of each sparse matrix comprise respectively 5 × 5 non-zero elements of a row of the matrix to be convolved.

In an embodiment, the to-be-convolved matrix obtaining apparatus uses the convolution kernel matrix H as a second convolution matrix, and uses the to-be-convolved matrix X as a first convolution matrix, please refer to fig. 9, which is a schematic diagram of obtaining the second convolution matrix by the systolic array in an embodiment, and the second convolution matrix controller respectively sends the row and column positions of the elements of the second convolution matrix to the multiplier-adder units in the row and column positions corresponding to the systolic array. That is, the 3 × 3 convolution kernel matrix H shown in fig. 8 is input by the second convolution matrix controller in a one-to-one correspondence manner according to the position relationship between the rows and the columns of the multiplier-adder unit in the pulse array, for the elements Wij of the 3 × 3 convolution kernel matrix H.

The systolic array obtains the elements of the convolution result matrix P according to the obtained elements of the first convolution matrix and the second convolution matrix, and the specific steps include:

1) The pulse array firstly acquires elements of a second convolution matrix, and then acquires an element of a sparse matrix in each clock cycle;

2) When the systolic array acquires an element of a sparse matrix, each multiplier-adder unit sequentially transfers the value of a register to the register of the h-th multiplier-adder unit behind the multiplier-adder unit according to the sequence of the first row and the second row; h is the difference between the element column serial number of the sparse matrix of the current input pulse array and the element column serial number of the sparse matrix input previously and 3, and the maximum value is obtained;

3) The multiplier of each multiplier-adder unit multiplies the obtained elements of the sparse matrix and the obtained elements of the second convolution matrix, the adder of the multiplier-adder unit sums the value stored in the register and the product obtained by the multiplier, and the obtained sum is stored in the register of the multiplier-adder unit;

wherein when the value stored in the register of the multiplier-adder unit is shifted out of the systolic array, the value is output as an element of the convolution result matrix P.

Further, when the systolic array sequentially acquires all elements of a sparse matrix, the systolic array completes a calculation cycle. The systolic array ends the convolution operation when it completes k +1 computation cycles. The systolic array completes k calculation cycles, partial calculation results are in a register of a multiplier-adder unit in the systolic array, and one calculation cycle is carried out to output the results in the register.

For convenience of description, the process of obtaining the 1 × 6 convolution result matrix by the systolic array is described by performing convolution calculation on the 1 × 8 to-be-convolved matrix and the 1 × 3 convolution kernel matrix shown in fig. 6, please refer to fig. 10, which is a schematic diagram of obtaining the second convolution matrix by the systolic array in an embodiment, and the specific steps include:

1) The 3 multiplier-adder units of the systolic array acquire 3 elements W0, W1, and W2 of the 1 × 3 convolution kernel matrix (second matrix) in the order of rows, respectively, and the row and column positions of the multiplier-adder units correspond to the row and column positions of the acquired elements. The systolic array acquires one more element of a sparse matrix every clock cycle. Namely, the value of the element weight of the 1 × 3 convolution kernel matrix is fixed, and the element act _ in of the 1 × 8 matrix to be convolved is broadcasted to three multiplier-adder units according to the period, namely, the multiplier-adder unit 51 obtains W0, the multiplier-adder unit 52 obtains W1, and the multiplier-adder unit 53 obtains W2. The 'a' shown in fig. 6 is the first non-zero value of the element act _ in the 1 × 8 matrix to be convolved, and the position information is '3'.

2) When the systolic array acquires elements of a sparse matrix, in the first clock cycle, the multiplier-adder unit multiplies 'a' by 'W0', 'W1' and 'W2' respectively to obtain partial sums of elements act _ out of three 1 × 6 convolution result matrixes and stores the partial sums in corresponding registers. In the second clock cycle, a second non-zero value 'b' of act _ in is input to the systolic array, its position information being '5'. From the second cycle, the value in the register will be determined by the output of the multiplier for this cycle together with the value in the register for the last clock cycle. In this example, the value in the register would first shift right by 2 (5-3) units, i.e., the first multiplier-adderThe value stored in the register of the unit 51 is shifted into the register of the multiplier-adder unit 53, and the values stored in the registers of the multiplier-adder unit 52 and multiplier-adder unit 53 are shifted out of the systolic array and output as one element of the convolution result matrix P. The output of the multiplier-adder unit is added to the shifted value and stored in a register. Step index of value shift in register of multiplier-adder unit _shift Will be determined by the following equation:

index _shift ＝max(index _current -index _previous ，3)

therein, index _shift To move the step size, index _previous Location information, index, of an element acquired for a previous clock cycle _current Position information of the element acquired for the present clock cycle. Index in this example _previous Location information '3' corresponding to ' a _current Position information '5' corresponding to 'b' because act _ in this example only contains 3 non-zero values, the row convolution calculation in fig. 6 can be completed with only 3 cycles. At least 8 cycles are required to complete the calculation if the general method is adopted.

In an embodiment, the to-be-convolved matrix obtaining apparatus uses the convolution kernel matrix H as a first convolution matrix, and uses the to-be-convolved matrix X as a second convolution matrix, please refer to fig. 11, which is a schematic diagram of obtaining the second convolution matrix for the systolic array in another embodiment, and includes performing convolution calculation on 1 × 8 to-be-convolved matrix and 1 × 3 convolution kernel matrix. The second convolution matrix controller inputs the second convolution matrix into each multiplier-adder unit of the systolic array according to the convolution sequence, and inputs the m rows and the n columns of elements into each multiplier-adder unit of the systolic array according to the reverse sequence of the rows and the positive sequence of the columns. That is, the 1 × 8 to-be-convolved matrix X shown in fig. 11 is input to each multiplier-adder unit of the systolic array by the second convolution matrix controller sequentially and 3 elements of the 1 × 8 to-be-convolved matrix X are input to each multiplier-adder unit in the convolution sequence, and the reverse sequence of the rows is opposite to the row sequence of the multiplier-adder units, that is, the second convolution matrix controller sends the convolution array elements 'a', 'b', and 'c' of the first group of the second convolution matrix, which are subjected to convolution calculation, to the systolic array. The first multiplier-adder unit in the first row obtains the element 'c', the second multiplier-adder unit in the first row obtains the element 'b', the third multiplier-adder unit in the first row obtains the element 'a', the second convolution matrix controller inputs the order of the elements of the second convolution matrix each time, and so on.

The convolution kernel matrix H is used as a first convolution matrix by the convolution matrix obtaining device to obtain a sparse array of the convolution kernel matrix H, and then elements of each sparse matrix in the sparse array are sequentially input to the systolic array, as shown in fig. 11, the sparse array of the first convolution matrix includes a sparse matrix, and the elements of the sparse matrix include 'W0' and 'W2'.

1) The systolic array first acquires the elements of the second convolution matrix and then acquires the elements of a sparse matrix every clock cycle. The order of acquiring the second convolution matrix element by each multiplier-adder unit of the systolic array is that m rows and n columns of elements are respectively acquired according to the reverse order of the rows and the positive order of the columns. That is, the systolic array first obtains a first set of convolution array elements 'a', 'b' and 'c' of the second convolution matrix for convolution calculation, the first multiplier-adder unit in the first row obtains the element 'c', the second multiplier-adder unit in the first row obtains the element 'b', and the third multiplier-adder unit in the first row obtains the element 'a'. The systolic array first clock cycle acquires the first element 'W0' in the first sparse matrix of the sparse array and the second clock cycle acquires the second element 'W2' in the first sparse matrix of the sparse array. Each multiplier-adder unit of the systolic array stores the computation result of the first clock cycle multiplier in a register.

2) When the systolic array acquires elements of a sparse matrix, each multiplier-adder unit sequentially transfers the values of the registers to the register of the h-th multiplier-adder unit behind the multiplier-adder unit according to the sequence of the first row and the second row, wherein h is the difference obtained by the element row sequence number of the sparse matrix of the current input systolic array and the element row sequence number of the sparse matrix of the previous input systolic array, and 3 is compared to obtain the maximum value. In this embodiment, the element sequence numbers of the convolution kernel matrix elements weight circularly indicate that when the non-zero elements are "W0" and "W2", the first period is input with "W0" and the second period is input with "W2", in which case h is "2". The third cycle will again enter "W0", where h is "1".

Further, when the pulse array sequentially acquires all elements of a sparse matrix, the pulse array completes a calculation cycle, and k calculation cycles are completed when the pulse array acquires m rows and n columns of elements of a second convolution matrix; and when the pulse array acquires m rows and n columns of elements of the last second convolution matrix and finishes k +1 calculation cycles, ending the convolution operation. And outputting the value stored by the register of the multiplier-adder unit in the systolic array in the k +1 th calculation cycle.

In addition, when the convolution operation is performed in the above manner, because of the requirement of the convolution operation, it is not necessary to obtain the product of the element of each matrix X to be convolved and the element of each convolution kernel matrix H, and the multiplier-adder unit in the systolic array does not participate in the operation every clock cycle, as shown in fig. 12, it is an operation schematic diagram of the systolic array in an embodiment, where the convolution kernel matrix H is the second convolution matrix, and the matrix X to be convolved is the first convolution matrix, where the systolic array includes 9 multiplier-adder units, and is referred to as a PE in this application. Data in blocks in the PE indicate position information of the multiplier-adder unit in the systolic array, and correspondingly store position information of the element weight of the convolution kernel matrix H, wherein the left number in brackets indicates row information, and the right number in brackets indicates column information, such as 2 and 3 in brackets, and indicates the second row and the third column. And each multiplier-adder unit is used for obtaining the weight value of the element of the convolution kernel matrix H corresponding to the row-column information. The elements of the matrix X to be convolved are broadcast to the PE in sequence in the order of the rows until all the elements of the row have been operated. And all the 9 convolution kernel matrix H elements weight stored by the multiplier-adder unit are kept still. While the table below the systolic array in figure 12 shows the output of the systolic array at different times. Or-i refers to row i of the systolic array output. For a 3 x 3 convolution kernel, at time T0, the three multiplier-adder units in the first row compute the result for the first portion of Or-1. At time T1, the first row of multiplier-adder cells starts to compute the result of the first part of Or-2, and the second row of multiplier-adder cells starts to compute the result of the second part of Or-1. Therefore, the last partial result of Or-1 is not obtained by the third row of the systolic array until time T2, resulting in the complete result of Or-1. In the process, the partial multiplier-adder units of the systolic array do not operate every calculation period, and whether the operation is performed or not is set according to the row and column information of the systolic array.

Fig. 13 is a schematic diagram illustrating the operation of the systolic array in an embodiment, in which a convolution kernel matrix H is a first convolution matrix, and a matrix X to be convolved is a second convolution matrix, where the systolic array includes 9 multiplier-adder units. The data in the blocks in the PE represents the position information of the X elements of the matrix to be convolved which are currently undergoing convolution operations. The convolution kernel matrix H element weight can broadcast the sequence of the elements in each row to the multiplier-adder unit in the PE in a circulating mode according to the sequence of the rows. In the first convolution operation period, the first row multiplier adder unit stores the third row elements of the convolution operation matrix of the matrix X to be convolved, the second row multiplier adder unit stores the second row elements of the convolution operation matrix of the matrix X to be convolved, the third row multiplier adder unit stores the first row elements of the convolution operation matrix of the matrix X to be convolved, and the convolution operation matrix of the matrix X to be convolved is input according to the mode, so that the row-column sequence of the convolution result matrix P elements output by the pulse array can be ensured to be the same as the output sequence of the convolution operation mode shown in fig. 12, and further, the additional hardware overhead can be reduced. After the convolution operation matrix of the matrix X to be convolved finishes the convolution of the first three rows, the pulsation array can obtain the next convolution operation matrix of the matrix X to be convolved according to the convolution sequence again, and the next convolution operation cycle is started. The convolution operation period refers to a period of calculating all elements of the convolution kernel matrix H once by the pulse array. In the process, the partial multiplier-adder units of the systolic array do not operate every calculation period, and whether the operation is performed or not is set according to the row and column information of the systolic array.

In the implementation of the application, a convolution operation device based on a pulse array is disclosed, and the convolution operation device comprises a to-be-convolved matrix acquisition device, a first convolution matrix controller, a second convolution matrix controller and a pulse array, wherein the to-be-convolved matrix acquisition device firstly sends a to-be-convolved matrix and a convolution kernel matrix to the first convolution matrix controller and the second convolution matrix controller respectively, and then the first convolution matrix controller and the second convolution matrix controller input the pulse array for convolution calculation. Because the first convolution matrix controller only inputs the acquired nonzero elements of the matrix into the pulse array in sequence for convolution operation, the convolution operation is faster, and the occupancy rate of circuit computing resources can be reduced.

The present invention has been described in terms of specific examples, which are provided to aid understanding of the invention and are not intended to be limiting. For a person skilled in the art to which the invention pertains, several simple deductions, modifications or substitutions may be made according to the idea of the invention.

Claims

1. A convolution operation device based on a systolic array is characterized in that the convolution operation device is used for performing convolution calculation on a to-be-convolved matrix X and a convolution kernel matrix H to obtain a convolution result matrix P, and comprises a to-be-convolved matrix obtaining device, a first convolution matrix controller, a second convolution matrix controller and a systolic array, wherein the systolic array comprises n multiplier-adder units in m rows and n columns, n values are the same as the number of columns of the convolution kernel matrix H, and m values are the same as the number of rows of the convolution kernel matrix H;

2. The convolution operation apparatus according to claim 1, wherein the to-be-convolved matrix obtaining means is configured to determine, according to sparsity of the to-be-convolved matrix X and a convolution kernel matrix H, that one of the to-be-convolved matrix X and the convolution kernel matrix H is the first convolution matrix, and the other matrix is the second convolution matrix;

when the sparsity of the matrix X to be convolved is greater than the sparsity of the convolution kernel matrix H, taking the matrix X to be convolved as the first convolution matrix and taking the convolution kernel matrix H as the second convolution matrix;

3. The convolution operation device of claim 1, wherein the multiplier-adder unit includes a multiplier, an adder, and a register; the multiplier is used for multiplying the elements of the first convolution matrix and the elements of the second convolution matrix obtained by the multiplier and adder unit, and the adder is used for summing the value stored in the register and the product obtained by the multiplier and storing the sum in the register.

4. The convolutional arithmetic device of claim 3, wherein the first convolutional matrix controller is configured to obtain all non-zero elements of the first convolutional matrix and broadcast them to the systolic array in turn, comprising:

5. The convolution operation apparatus according to claim 4, wherein when the to-be-convolved matrix obtaining apparatus sets the convolution kernel matrix H as the second convolution matrix, the second convolution matrix controller sends the row and column positions of the elements of the second convolution matrix to the multiplier-adder units in the row and column positions corresponding to the systolic array, respectively.

6. The convolution operation apparatus as claimed in claim 5, wherein the systolic array is configured to obtain elements of a convolution result matrix P according to the obtained elements of the first convolution matrix and the second convolution matrix, and includes:

the pulse array firstly obtains elements of the second convolution matrix, and then obtains one element of one sparse matrix in each clock cycle;

when the systolic array acquires an element of the sparse matrix, each multiplier-adder unit sequentially transfers the value of the register to the register of the h-th multiplier-adder unit behind the multiplier-adder unit according to the sequence of the first row and the second row; h is the difference obtained by comparing the element column sequence number of the sparse matrix currently input into the systolic array with the element column sequence number of the sparse matrix input before with 3 to obtain the maximum value;

each multiplier-adder unit's multiplier performs product on the obtained sparse matrix's elements and the second convolution matrix's elements, and the multiplier-adder unit's adder performs sum on the register-stored value and the product obtained by the multiplier, and stores the obtained sum in the multiplier-adder unit's register;

7. The convolution operation apparatus according to claim 6, wherein the systolic array is configured to obtain elements of a convolution result matrix P according to the obtained elements of the first convolution matrix and the second convolution matrix, and further includes:

and when the pulsation array sequentially acquires all elements of one sparse matrix, the pulsation array completes one calculation cycle, and the convolution operation is ended when the pulsation array completes k +1 calculation cycles.

8. The convolution operation apparatus according to claim 4, wherein when the matrix to be convolved obtaining means converts the matrix to be convolved X into the second convolution matrix, the second convolution matrix controller sequentially inputs m rows and n columns of elements of the second convolution matrix into each of the multiplier-adder units of the systolic array in the convolution order, and inputs m rows and n columns of elements into each of the multiplier-adder units of the systolic array in the reverse order of rows and the positive order of columns.

9. The convolution operation apparatus according to claim 8, wherein the systolic array is configured to obtain elements of a convolution result matrix P according to the obtained elements of the first convolution matrix and the second convolution matrix, and includes:

when the pulse array acquires an element of the sparse matrix, each multiplier-adder unit sequentially transfers the value of the register to a register of an h-th multiplier-adder unit behind the multiplier-adder unit according to the sequence of the front row and the rear row, wherein h is the difference between the element row serial number of the sparse matrix currently input into the pulse array and the element row serial number of the sparse matrix input previously, and is compared with 3 to obtain the maximum value;

10. The convolution operation apparatus according to claim 9, wherein the systolic array is configured to obtain elements of a convolution result matrix P according to the obtained elements of the first convolution matrix and the second convolution matrix, and further includes: