CN110674927A

CN110674927A - Data recombination method for pulse array structure

Info

Publication number: CN110674927A
Application number: CN201910857692.4A
Authority: CN
Inventors: 胡塘; 徐志伟
Original assignee: Zhijiang Laboratory
Current assignee: Zhijiang Laboratory
Priority date: 2019-09-09
Filing date: 2019-09-09
Publication date: 2020-01-10

Abstract

The invention discloses a data recombination method for a pulsation array structure, which comprises the steps of firstly reading an input characteristic diagram from an off-chip DDR memory bank according to the NCHW format of original data into a buffer, then reading the input characteristic diagram by taking the two-dimensional plane size of a convolution kernel as a basic unit, respectively providing input characteristic diagram data required by convolution for each row of input ports of the pulsation array along the moving direction of a step length S, respectively providing a weight coefficient for each corresponding row of input ports of the pulsation array by each convolution kernel, completing convolution operation of the input characteristic diagram data and the corresponding weight coefficient in each PE operation unit of the pulsation array, and then sequentially outputting convolution calculation results. The method saves the software and hardware expenses in the data recombination process, simplifies the complexity of data scheduling recombination design, optimizes the time sequence of a data recombination circuit, and reduces the access times of off-chip DRAM (dynamic random access memory) so as to reduce the overall power consumption of the pulse array.

Description

Data recombination method for pulse array structure

Technical Field

The invention relates to the field of chip design, in particular to a data recombination method for a pulse array structure.

Background

The deep learning technology based on the CNN is an important field of current artificial intelligence, in various CNN accelerators, a PE array computing unit is in a core position, and the array scale, the data flow mode, the multi-precision support, the energy efficiency ratio, the peak computing power and the like of the PE array computing unit are used for evaluating the performance indexes of the whole accelerator. Among them, the CNN accelerator based on the systolic array structure is realized because of its input characteristic diagram and weight systemThe number and output result can realize simple and high-efficiency pulse transmission and the highest working clock F in the whole array_maxObviously improved, thereby obviously improving the overall performance. The performance of the current commercialized TPU chip greatly surpasses that of various CPUs, GPUs and other various accelerating chips, so that the TPU chip becomes a research hotspot in the field of CNN. The advantage of the systolic array architecture is that operands or intermediate results that participate in internal operations can be pipelined in parallel at high speed, but to take advantage of this advantage without the correct and efficient operation of the data reassembly unit outside the systolic array, otherwise the entire systolic array will have a wait or idle state, resulting in a true effective computation power that is much lower than the nominal peak computation power. However, the prior art does not disclose how to perform data interaction with the pulse array by a data recombination unit outside the TPU chip pulse array, and how to effectively organize an input characteristic diagram, a weight coefficient and the like according to a certain rule.

Chinese patent application No. CN201811641086 discloses a reconfigurable implementation based on a systolic array structure, which involves little description of a scheduler and data reorganization, and from public data analysis, the row and column direction inputs of the systolic array all adopt the NHWC format commonly used in the art, and because the original data of an application layer is generally provided in the NCHW manner, that is, all pixels of one channel are continuously stored, and then the next channel is stored. Therefore, a format conversion is required, which occupies a large amount of hardware overhead if implemented in hardware and occupies a large amount of CPU resources if implemented in software. In addition, the method cannot exert local data multiplexing of a convolution window 'reception field' under the movement of the step S, such as convolution kernel two-dimensional plane sizes of 3x3, 5x5, 7x7, 11x11 and the like, and in the two-dimensional area, continuously stored data has a repeated phenomenon, and the characteristic cannot be utilized to reduce the access times of the off-chip DRAM so as to improve the throughput efficiency and reduce the power consumption.

Chinese patent application No. CN201810391781 discloses how to automatically implement systolic arrays using advanced programs, using double buffering and SIMD to improve performance. However, the input feature map, the input interface of the weight coefficient, and the output interface of the convolution result have only one port, which may seriously affect the transmission bandwidth between the systolic array and the external data reassembly unit, resulting in the state that the systolic array is waiting for data input or result output for many times, thereby becoming a bottleneck affecting the overall performance of the systolic array.

Therefore, how to find a data reorganization method and design a data reorganization circuit with a relatively simple logic structure, which can match the highest working clock F in the array_maxIs particularly important.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a data recombination method for a systolic array structure, which follows the NCHW format of original data, reads an input characteristic diagram by taking the two-dimensional plane size of a convolution kernel as a basic unit, provides input characteristic diagram data required by convolution for each row of input ports of the systolic array along the moving direction of a step length S, and provides a weight coefficient for each corresponding row of input ports of the systolic array by each convolution kernel, thereby saving the cost of converting the NCHW format into the NHWC format and reducing the access frequency of an off-chip DRAM. The specific technical scheme is as follows:

a data reorganization method for a systolic array structure, the size of the systolic array is N rows x m columns, the input feature map has N channels, and the convolution kernel also has N channels, and the method is characterized by comprising the following steps:

s1: firstly, reading an input characteristic diagram into a buffer from an off-chip DDR memory bank according to the NCHW format of original data;

s2: in the horizontal direction, following the NCHW format of the original data, reading input characteristic diagram data corresponding to a0 channel by taking the convolution kernel two-dimensional plane size of the 0 th channel as a basic unit, and respectively providing input characteristic diagram data required by convolution for input ports of each row of the systolic array row by row from left to right along the moving direction of the step S; reading input characteristic diagram data corresponding to a1 st channel by taking the convolution kernel two-dimensional plane size of the 1 st channel as a basic unit, and respectively providing input characteristic diagram data required by convolution for input ports of each row of the systolic array from left to right line by line along the moving direction of the step S; in the same way, until the convolution kernel two-dimensional plane size of the (N-1) th channel is taken as a basic unit to read the input characteristic diagram data corresponding to the (N-1) th channel, and the input characteristic diagram data required by convolution are respectively provided for each row of input ports of the systolic array from left to right line by line along the moving direction of the step S;

and is passed right all the way along the row direction to the rightmost side;

s3: in the vertical direction, following the NCHW format, the weight coefficients of the 0 th to N-1 th channels of each convolution kernel are sequentially and respectively transmitted to the corresponding input ports of each row of the systolic array and are transmitted to the bottommost part all the way down along the row direction;

s4: the input characteristic diagram and the corresponding weight coefficient finish convolution operation in each PE operation unit of the pulsation array, the PE operation units of each row obtain partial sum results in sequence, namely each PE of the same row obtains respective partial sum convolution results at the same time, and the PE of the previous row obtains partial sum convolution results 1 beat in advance compared with the next adjacent row;

s5: and m rows of outputs are performed in parallel, for each row, a convolution calculation result of moving the step length S for 0 time is output firstly, then a convolution calculation result of moving the step length S for 1 time is output, and the like is repeated until a convolution calculation result of moving the step length S for n-1 times is output.

Further, the byte address label in the buffer of the data in the horizontal direction of the input feature map transmitted to S1 is obtained by formula (1):

the initial _ middle _ offset _ pixel represents the byte address label of the central pixel point of the initial convolution window of the current round in the input characteristic diagram buffer, for the most initial starting stage, the value is equal to 0, the initial _ middle _ offset _ pixel of the next round is the offset _ pixel obtained by the previous round of calculation, and the iteration is carried out in a circulating way; p (window _ z, window _ y, window _ x) represents the real-time position of the convolution window in the input profile cube, window _ z represents the value in the width W direction, window _ y represents the value in the height H direction, window _ x represents the value in the channel depth C direction, K represents the value in the channel depth C direction, and_x、K_yis the two-dimensional size of the convolution kernel; the calculation result of formula (1), namely the byte address index offset _ pixel, indicates the first address of a continuous segment of input characteristic diagram data in the input characteristic diagram buffer, and the offset _ pixel obtained in the current round of calculation is used as the initial _ middle _ offset _ pixel of the next round, and the iteration is repeated in a loop.

Further, when the step S is 1 and the step S is 2, the data delivered to each row of input ports of the systolic array is derived from three fixed positions through the following circuit processes, so as to simplify the data path implementation complexity:

two groups of buffers are adopted for shifting respectively, wherein one group of buffers is fixed and moves right, namely buffer0, the other group of buffers is fixed and moves left, namely buffer1, the two groups of buffers are in a parallel pipeline structure, the required values which move left and move right respectively are determined by carrying out modulo operation on the byte address labels obtained by the formula (1), and finally the value of each beat of input feature map data stored in each FIFO queue is selected from the following three sources according to the byte address labels obtained by the formula (1) and by combining the current input feature map size:

(1) shift right the processed buffer 0;

(2) shift left processed buffer 1;

(3) zero Padding is performed according to Padding rules.

The invention has the following beneficial effects:

the data recombination method of the invention transmits the input characteristic diagram processed by the movement of the step length S to the input ports of corresponding rows of the systolic array in an NCHW sequencing mode by analyzing the characteristics that a convolution window of a convolution kernel has a large amount of data repetition along with the movement of the step length S on a two-dimensional plane of the input characteristic diagram, and continuous relation exists among data blocks and the data repetition is consistent with the NCHW storage format of the most original input characteristic diagram, and solves the problems of high fan-out of signals, massive vector signal connection, complex control logic and the like in a data recombination system by replacing a right shift circuit and a left shift circuit of parallel flow. The method flexibly supports various CNN algorithms of various current mainstream convolution kernel sizes, input characteristic diagram sizes, different step length S and Padding rule combinations, saves software and hardware expenses for converting NCHW into NHWC format in the current common implementation method, simplifies the complexity of data scheduling and reorganization design, optimizes the time sequence of a data reorganization circuit, and simultaneously reduces the number of times of accessing an off-chip DRAM so as to reduce the overall power consumption of a pulse array.

Drawings

FIG. 1 is an architectural diagram of a data reconstruction system for a systolic array architecture;

FIG. 2 is a data flow diagram of three types of interface channels of a systolic array;

FIG. 3 is a graph of data analysis of convolution window step S shifted 0 to 63 times;

FIG. 4 is a schematic diagram of a buffer data storage sequence of an input feature map;

FIG. 5 is a schematic diagram of the input feature map moving by step S and Padding processing on a two-dimensional plane;

FIG. 6 is a schematic diagram of an input source for 64 FIFO queue data;

FIG. 7 is a diagram of the parallel pipeline shift process with fixed right shift buffer 0.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and preferred embodiments, and the objects and effects of the present invention will become more apparent, it being understood that the specific embodiments described herein are merely illustrative of the present invention and are not intended to limit the present invention.

The invention discloses a data recombination method and a data recombination system for a pulse array structure, which fully excavate the characteristic that a large amount of data are repeated when an input characteristic diagram and a convolution window move along with a step length S on a two-dimensional plane, optimize the design of a data path by utilizing a right shift circuit and a left shift circuit of parallel pipelining, achieve the purposes of simplifying the complexity of data scheduling recombination and improving the F of a data recombination circuit_maxThe method and the device realize the reduction of the access times of the off-chip DRAM and the reduction of the overall power consumption of the systolic array.

First, technical term explanations are given:

(1) CNN: convolutional Neural Network

(2) TPU: tensor Processing Unit Tensor processor

(3) Stride: step length (moving step of convolution kernel on input characteristic diagram), abbreviated as step length S

(4) PE: process Element arithmetic processing unit

(5) Systolic Array: systolic array (one type of organization of PE array units in CNN)

(6)F_max: the highest working clock frequency, optimizing circuit timing can help to raise F_max

(7) NCHW, NHWC: two common tensor expression formats are available in which,

the data storage mode of NCHW is that the elements of 0 row and 0 column of the 0 channel are firstly, then the elements of 0 row and 1 column are arranged along the horizontal direction until the last column, then the elements of 1 row and 0 column are arranged, and so on until the elements of the rightmost lower corner of the channel, after the 0 channel is stored, the 1 channel is arranged, the processing method is similar until the C-1 channel;

the data storage of the NHWC is first the row 0 and column 0 elements of the 0 th lane, then the row 0 and column 0 elements of the 1 st lane, through to the row 0 and column 0 elements of the C-1 lane, then the row 0 and column 1 elements of the 0 th lane along the horizontal direction, and so on.

(8) According to the general rules in the chip design field, the description in the present invention starts from sequence number 0, rather than from sequence number 1, for example, row 0, column 0, and the address is also numbered from 0x 0000;

as shown in FIG. 1, the present invention describes the system framework structure of the whole invention by taking a pulsation array of 64x64 as an example. In the system, the transmission rate of 12.8GB/s can be achieved between an off-chip DDR memory bank and an on-chip DDR controller through a high-speed DDR interface such as DDR 3-1600; the input characteristic diagram enters a data recombination subsystem consisting of an input characteristic diagram buffer, an input characteristic diagram data recombination processing unit and 64 input characteristic FIFO queues after data recombination, and then is transmitted to a 64 multiplied by 64 pulse array PE operation unit according to a certain rule for operation, and a convolution operation result is output according to a certain rule.

In the CNN convolutional neural network, the input feature map and the convolution kernel that participate in the calculation are all presented as a three-dimensional cube, but in the actual convolution calculation process, the three-dimensional cube is decomposed into a plurality of two-dimensional planes in the channel direction for realization, and further when the calculation is realized on a hardware circuit, for example, data storage and transmission, the calculation needs to be converted into a one-dimensional form.

As a module 64x64 pulse array PE arithmetic unit for intensive arithmetic, adjacent PEs adopt a parallel pipelining mode, a data path logic and a control logic circuit are simple, the fan-out number of each signal is small, and therefore the internal highest working clock F_maxThe peak calculation capability of the whole systolic array can be improved obviously.

But to exploit the peak computational power of systolic arrays requires support for proper and high-speed operation of external interfaces, which deal primarily with three types of data paths, including horizontal inputs, vertical inputs, and vertical outputs.

The input in the vertical direction is specifically as follows: the weight coefficients of the top 64 convolution kernels correspond to the respective columns, the coefficients are transferred beat by beat according to a certain rule, the convolution kernels are relatively fixed during the operation of each convolution layer, the two-dimensional plane size of the convolution kernels is relatively small, such as 3x3, 5x5, 7x7 and 11x11, and the scheduling ordering of the weight coefficient format is closely related to the data reorganization of the input feature map in the horizontal direction.

Output in the vertical direction: the part and the result processed by the PE operation unit of the systolic array are output from the top of the systolic array according to a certain rule, each column respectively corresponds to a respective convolution kernel, and the output rule is closely related to the data recombination of the input characteristic diagram in the horizontal direction.

Input in the horizontal direction: the input characteristic diagram of each layer is read from an off-chip DRAM through a DDR controller and then is firstly sent to the input characteristic diagram for buffering, so that the DDR data reading efficiency is improved, meanwhile, the control right is timely given to the occupation of other ports on the DDR controller, and in order to further improve the overall throughput rate, the buffer is designed into double buffering of a ping-pong structure; the data entering the input characteristic diagram buffer is rearranged and combined by the input characteristic diagram data recombination unit, is respectively processed from 0-63 times of movement according to the step length S, is stored in 64 corresponding input characteristic FIFO queues, and is respectively transmitted to each line corresponding to the pulse array one by one.

Therefore, the invention provides a data reorganization method for a systolic array structure, which improves the peak calculation capacity of the whole systolic array by reasonably designing the data reorganization of the input characteristic diagram in the horizontal direction. In the method, the systolic array is N rows × m columns, the input feature map has N channels, and the convolution kernel also has N channels, and the method specifically includes the following steps:

(1) CNN convolution calculation input data rule analysis

In the horizontal direction of the systolic array, with the advance of the step length S, the two-dimensional plane of the channel corresponding to the convolution kernel and the corresponding region on the two-dimensional plane of the input feature map, i.e. the "receptive field", are respectively subjected to convolution calculation, and the data rule of the input feature map on the two-dimensional plane along with the movement of the step length S under the effect of the convolution window is shown in fig. 3. When the step length S moves 0 times, 9 data in the dashed box at the top left corner in fig. 3 participate in convolution calculation, when the step length S moves 1 time, 9 data in the dashed box at the top left corner where the letter "B" starts participate in calculation, and so on, until the step length S moves 63 times, 9 data in the dashed box at the top left corner where the letter "P" starts participate in calculation.

It can be seen from fig. 3 that there is a large amount of data duplication between adjacent steps S and that their deposit in the NCHW format is continuous. Taking fig. 3 as an example, in two continuous dashed boxes at the top left corner, the letters "B, C, B ', C ', B", C "" are repeated and 64 dashed boxes are completely continuous, and by using this rule, continuous data including the initial letter "a" of the 0 th row and the 0 th column along the horizontal direction until the 4 th row letter "r '" can be taken out from the DDR3 memory at one time, so as to improve the efficiency of off-chip access and storage, reduce the number of off-chip DDR read times, and timely release the DDR access control right to other ports which need to be accessed urgently, which is favorable for improving the overall performance of the system.

(2) Input feature map buffer design

The input characteristic diagram of each layer of the CNN is generally stored in an external DDR main memory device due to large data volume, therefore, when convolution calculation of each layer is started, the input characteristic diagram is required to be read from an external DDR memory to a buffer in a chip so as to accelerate operation, in order to further improve the overall throughput rate, the input characteristic diagram buffer is designed to be in a double-buffer mode of a ping-pong structure, when the systolic array calculation is operated in a buffer A, the input characteristic diagram of the next batch can parallelly and early carry data from the external DDR memory to a buffer B, after the data calculation in the buffer A is finished, the input characteristic diagram is switched to the buffer B for calculation, and meanwhile, the buffer A prepares the input characteristic diagram of the next batch in advance, and the operation is repeated in a cycle. In the embodiment of the invention, the mainstream YOLOv3@416x416 is taken as an example, the SRAM specification of the buffer is 1024-bit in width, 5408 in depth and 676KB in total. FIG. 4 depicts the order in which elements of a two-dimensional plane of a channel of the input feature map are stored in buffer A or B.

(3) Input characteristic diagram data recombination processing unit and 64 FIFO queues

The core task of the data recombination processing unit is to arrange the data of the input characteristic diagram buffer A or B into the general matrix multiplication sequence required by the pulse array calculation, store the rearranged data into corresponding 64 FIFO queues, then respectively transmit the data to each input port in the horizontal direction of the pulse array according to the clock beat and the sequence of the control signal, and finally participate in the convolution calculation in the pulse array.

In the data reorganization implementation process, the biggest difficulty is how to adapt to various condition combinations of different convolution kernel sizes, input feature diagram sizes, convolution step S, Padding rules, positions of the input feature diagram at the current operation time and the like. Taking the position of the input feature map at the current operation time as an example, it needs to be analyzed whether the current position is at the top left corner, the top right corner, the bottom left corner, and the bottom right corner of the whole two-dimensional plane, whether the current position is at the top row, the bottom row, the left column, or the right column, or at the middle common position, because different positions need to consider whether Padding is needed, whether a line needs to be changed or a next adjacent channel needs to be entered, and the like. The step S shifting and Padding processing for fig. 4 results in fig. 5.

On the left side of FIG. 5 compared to the left side of FIG. 4, the "0" of the outermost turn is added by Padding, and the convolution kernel has a two-dimensional size K_x*K_y3x3, the data needed to participate in the convolution calculation is acquired as the step S moves on the plane. The right side of fig. 5 is the result of the reassembly of the data processed for the next 64 queue FIFOs, each line representing one FIFO for a total of 64 FIFOs. K of the 0 th convolution window (centered on the original data "0") on the left side of the figure_x*K_yC data are stored in the 0 th FIFO of the 64 FIFO queues, i.e. the 0 th line on the right in the figure, which is arranged in NCHW mode, i.e. according to the convolution kernel K_x*K_yDimension, first the W direction of channel 0, i.e. K _x3 coding 0 in direction, then H direction, i.e. next row "0, 0, 1", and "0, 10, 11", then data extraction for next channel; k of convolution window centered at 1 on the left side of the figure_x*K_yC data are stored in the 1 st FIFO of the signature matrix buffer on the right side of the figure, i.e. row 1 on the right side of the figure, and so on. The last convolution window of the 0 th row on the left side in the figure is centered at '9', and the next convolution window of the window needs to be changed, so that the position of the current operation time needs to be judged when data reorganization is realized. Because the logical relationship is relatively complex, the data recombination method of the invention has the following extraction rule: k for each convolution window_x*K_yC data, in terms of coordinates (K)_x,K_yC) represents the position of each data, and the storage sequence in the corresponding FIFO is (0,0,0) ->(1,0,0)->…->(K_x-1,0,0)->(0,1,0)->(1,1,0)->…->(K_x-1,1,0)->…->(0,K_y-1,0)->(1,K_y-1,0)->…->(K_x-1,Ky-1,0)->(0,0,1)->…->(0,0,C-1)…->(K_x-1,K_y-1, C-1), and so on, first storing K for input channel 0_x*K_yData, K of input channel 1 is stored again_x*K_yData, straightK to input channel C-1_x*K_yAnd (4) data.

By observing the data on the left and right sides of fig. 5, the following laws can be obtained: in the case of convolution step S of 1, the same column of data for 64 FIFO blocks in the column direction in the right side of fig. 5 remains continuous, regardless of the special case that a wrap is required and Padding is required after the convolution window moves to the last position of the row on the input profile. For example, the data in column 5 (from column 0) on the right side of fig. 5 is from "1", "2", to "65", where "0" is between "9" and "11" instead of "10", because of the special case of the boundary of the input feature map, Padding is required to be 0 processing. Apart from these special points, the following conclusions can be drawn: the 64 data in this column are 64 data stored in series in the input profile buffer, which is a previous-stage submodule, and the same applies to the other columns.

Therefore, in order to efficiently read from the external DDR memory in a continuous batch and simplify the complexity of the entire data reorganization circuit, the byte address label in the buffer of the data in the horizontal direction of the input feature map transmitted to S1 is obtained by formula (1):

the initial _ middle _ offset _ pixel represents the byte address label of the central pixel point of the initial convolution window of the current round in the input characteristic diagram buffer, for the most initial starting stage, the value is equal to 0, the initial _ middle _ offset _ pixel of the next round is the offset _ pixel obtained by the previous round of calculation, and the iteration is carried out in a circulating way; p (window _ z, window _ y, window _ x) represents the real-time position of the convolution window in the input profile cube, window _ z represents the value in the width W direction, window _ y represents the value in the height H direction, window _ x represents the value in the channel depth C direction, K represents the value in the channel depth C direction, and_x、K_yis the two-dimensional size of the convolution kernel; the calculation result, namely the byte address label offset _ pixel, represents the first address of a continuous section of input characteristic diagram data in the input characteristic diagram buffer, and the offset _ pixel obtained by the calculation of the current round is used as the in of the next rounditerative _ middle _ offset _ pixel, and so on.

For fig. 5, the offset _ pixel in equation (1) represents the byte address index of each 8-bit data in the input profile buffer of the 0 th line FIFO on the right side of fig. 5 (see the right side of fig. 4, e.g., data "128" in the last column of address 0x0001, the byte address index is 128), which represents the first address of the 64 data in the input profile buffer of the column, and the initial _ middle _ offset _ pixel represents the byte address index of the central pixel point of the initial convolution window of the current round in the input profile buffer, and the initial _ middle _ offset _ pixel of the next round is updated every 64 steps S.

The embodiment of the present invention supports that the step length S is 1 and the step length S is 2, and for any 64 consecutive data with the step length S being 1 and any 64 data with the fixed interval of 1 with the step length S being 2, the data must satisfy the condition that the data are distributed in the two consecutive rows on the right side of fig. 4, so that only the byte address label obtained by calculation according to formula (1) is needed to obtain the row where the current head address is located and the next row next to the current head address, and the 64 input feature map data needed by the column can be obtained. For example, assuming that the byte address label of the first data calculated according to formula (1) is "130", the requirement can be satisfied as long as the data in the line 0x0001 where the "130" is located and the immediately next line 0x0002 in the input feature map buffer are read.

Although 64 consecutive data or 64 consecutive data with a fixed interval of 1 are necessarily distributed in two consecutive lines of the buffer on the right side of fig. 4, the distribution is random, and there are 128 possibilities in total. If the 128 branches are directly implemented by adopting a logic relationship, the data path at the periphery of the systolic array is abnormally complex, the number of signal fan-outs is large, the time sequence is tight, and the signal layout and wiring are difficult, so that the bottleneck of the whole systolic array is formed. Therefore, the present invention determines to adopt two sets of buffers to shift respectively, wherein one set is fixed to move to the right, and the other set is fixed to move to the left. Taking the step size S ═ 1 as an example, as described below, in two consecutive rows of buffers on the right side of fig. 4, the row with the smaller agreed address of 128Bytes is defined as buffer0, and the row with the other 128Bytes is defined as buffer1, for example, if the address label of the first data byte obtained at present is the 0 th byte of buffer0, 64 consecutive Bytes of data are just the lower 64 Bytes of buffer0, no shift processing is needed, and at this time, the lower 64 Bytes of data of buffer0 are fetched; assuming that the address label of the current first data byte is at byte 1 of buffer0, the whole buffer0 is shifted right by 1 byte, and the lower 64 bytes of data of buffer0 after right shift processing are still taken; by analogy, when the address label of the first data byte is at the 65 th byte of the buffer0, the last data falls on the other row, that is, the 0 th byte of the buffer1, at this time, the buffer0 is still shifted right by 65 bytes as a whole, the buffer1 is shifted left by 127 bytes as a whole, and the lower 63 bytes of the lower 64 bytes of the buffer0 after being shifted right and the highest byte of the upper 64 bytes of the buffer1 after being shifted left are spliced into 64 continuous data; by analogy to 128 cases, this can be done: that is, in any event, the 64 bytes of data fixed from the lower 64 bytes of buffer0 or the upper 64 bytes of buffer1 are implemented. Furthermore, considering special cases, it is necessary to take the multiplexing switch MUX to 0 with the branch of Padding 0. There are a total of only three fixed sources, detailed in fig. 6 for the gray-filled data block, and thus the data path design is simplified. The principle is shown in fig. 6 and 7.

Wherein the control signal of the multi-way selection switch MUX can be calculated by the byte address label of formula (1).

The right-shifted buffer0 in FIG. 6 is the final output in FIG. 7. It should be noted that for the first data, line 0, FIFO, there are only two sources of input, either buffer0 least significant bytes from the right shift process or Padding 0, while in the other 63 cases, there is also a possible source of data from the left shift process buffer 1. In the shift right processing circuit of fig. 7, RS _64 represents a shift right 64 bytes or a bypass; RS _32 represents a right shift of 32 bytes or a bypass, RS _16, RS _8, RS _4, RS _2 and RS _1 are similar in principle, all the shift circuits work in parallel and in a pipeline mode, the effect of immediate output of each beat can be achieved after 7 beats of clock, the working modes of RS _64, RS _32, copy and RS _1 are controlled by a bypass control signal bypass [6:0], 128 right shift processing conditions are met, and bypass [6:0] is obtained by performing modulo 128 operation on byte address labels obtained by a formula (1). The left shift process of buffer1 is similar to the right shift of buffer0 of FIG. 7, except that the direction is exactly the opposite.

In summary, in order to make the data supplied to each row of input ports of the systolic array originate from a fixed location, when the step S is 1 and the step S is 2, the following circuit processes are implemented:

two groups of buffers are adopted for shifting respectively, wherein one group of buffers is fixed and shifted to the right, namely buffer0, the other group of buffers is fixed and shifted to the left, namely buffer1, the two groups of buffers are shifted by adopting a parallel pipeline structure, modulo operation is carried out according to the byte address label obtained by the formula (1) to obtain values shifted to the left and shifted to the right, the value of specific data in each FIFO queue is selected from the following three sources according to the byte address label obtained by the formula (1) and by combining the current input characteristic diagram size:

(1) shift right the processed buffer 0;

(2) shift left processed buffer 1;

(3) zero Padding is performed according to Padding rules.

The invention can achieve the following advantages through the parallel pipeline shift processing of the buffer0 and the buffer 1: firstly, the whole shift circuit works in parallel and pipelining, and after 7 clock beats, each beat can generate a required result; secondly, the logic relation of the shift circuit is simple, and the highest working clock F can be improved_maxMatching the systolic array running clock; thirdly, the connection with the data paths of 64 FIFO queues is simple, 128 input data sources are needed in the original 128 cases and are all in a vector bit width form (128 × 8 × 64-64K signal lines), the mass level of internal signal connection leads to the layout and wiring challenge when an ASIC or an FPGA is realized, and the circuit time sequence is difficult to meet_max。

After the input characteristic diagram data processed by the data recombination subsystem are written into respective FIFO queues, the input characteristic diagram data are transmitted to the systolic array in the following modes under the control of a systolic array reading signal:

s2: in the horizontal direction, following the NCHW format of original data, reading an input characteristic diagram corresponding to a0 channel by taking the convolution kernel two-dimensional plane size of the 0 th channel as a basic unit, and respectively providing operands required by convolution for each row of input ports of the systolic array from left to right line by line along the moving direction of the step S; reading an input characteristic diagram corresponding to a1 st channel by taking the convolution kernel two-dimensional plane size of the 1 st channel as a basic unit, and respectively providing operands required by convolution for input ports of each row of the systolic array from left to right line by line along the moving direction of the step S; in the same way, until the convolution kernel two-dimensional plane size of the (N-1) th channel is taken as a basic unit to read the input characteristic diagram corresponding to the (N-1) th channel, and operands required by convolution are respectively provided to input ports of each row of the systolic array row by row from left to right along the moving direction of the step S;

and is passed right all the way along the row direction to the rightmost side;

s4: the input characteristic diagram and the corresponding weight coefficient finish convolution operation in each PE operation unit of the pulsation array, the PE operation units of each row obtain partial sum results in sequence, namely each PE of the same row obtains partial sum convolution results at the same time, and the PE of the previous row obtains partial sum convolution results 1 beat in advance compared with the next adjacent row;

The data flow rule of the three types of data paths of the systolic array and the interfaces between the systolic array and the data paths is shown in figure 2.

In fig. 2, the size K of the convolution kernel is 1 as the convolution step S_x*K_yThe data flow of three types of data channels of the interface of the pulsation array in the figure is described by taking C ═ 3 × 3 as an example.

In the horizontal direction of the systolic array, a total of 64 rows of input ports receive input feature map data from the input feature map data that has been processed through different number of steps. Horizontal direction line 0: step size is shifted 0 times, beat by beat to extract the 0 th channel a0, a1, a2, a10, a11, a12, a20, a21, a22, then b0, b1, b2, that is next to the next channel, and finally the c0, c1, c2,. of the last channel; line 1 in the horizontal direction: moving the step S1 time, extracting, beat by beat, the 0 th channel a1, a2, a3, a11, a12, a13, a21, a22, a23, then b1, b2, b3, that is next to the next channel, and finally c1, c2, c3,. of the last channel; and the analogy is carried out, the input characteristic diagram data which is moved for 63 times until the step length S is delivered to the input end of the pulse array at the 63 rd row beat by beat. Then, as the clock beats, the whole moves from left to right until the rightmost column of the systolic array ends.

At the input end of the systolic array in the vertical direction, 64 input ports of convolution kernel weight coefficients of 64 columns respectively correspond to 64 convolution kernels, and similar to the input feature map processing except that there is no movement of the step size S, taking convolution kernel 0 as an example, beat by beat extracting the 0 th channel e0, e1, e2, e3, e4, e5, e6, e7, e8, then f0, f1, f2, which is next to the next channel, and finally g0, g1, g2,; and with the clock beat running, the input port of the systolic array passes through PE0,0 of the 0 th row and the 0 th column along the vertical direction, and then is transmitted to PE1,0 of the 1 st row and the 0 th column until the 63 rd row and the 0 th column at the bottommost of the systolic array, namely PE63,0, is finished. The other convolution kernel weight coefficients of columns 1 to 63 are processed in the same manner.

At the output end of the systolic array in the vertical direction, 64 columns of convolution calculation results are output, taking column 0 as an example: PE0 in row 0 and

column

0,0 first obtains convolution calculation results psum0-S0, psum0-S0 represents the input feature map of step S moved 0 times and the calculation result of convolution kernel 0, then PE1 in column 0 and

row

1,0 obtains an input feature map of calculation results psum0-S1 representing step S moved 1 times and the calculation result of convolution kernel 0, and the like, until PE63 in column 0 and psum0-S63 in column 63 and column 0, and each calculation result is output from the top of each column in a vertically pulsating mode.

The data reorganization method of the embodiment has the following beneficial effects:

(1) the characteristics of continuous data and repeated mass data presented by the movement of the input characteristic diagram and the convolution window along with the step length S on the two-dimensional plane are kept consistent with the original data NCHW format in the whole data recombination process, so that the software and hardware expenses of data format conversion are saved, and the advantages of DMA block access can be exerted due to continuous and repeated mass data, the access times of an off-chip DRAM are reduced, the data access efficiency is improved, and the power consumption is reduced;

(2) the input characteristic diagram data processed by moving with 64 continuous step lengths S are respectively transmitted to corresponding horizontal direction input of the pulse array, because the 64 input data have the characteristic of continuity, the first address of the 64 data can be found, and the first address can be calculated by the formula (1), so that the complexity of data recombination logic can be simplified;

(3) the design of the right shift circuit and the left shift circuit of the parallel flow can generate required results every beat, the logical relation of the shift circuit is simple, and the highest working clock F of the data recombination circuit outside the pulse array can be improved_maxMatching the internal running clock of the systolic array. The data path design is realized by only two simple fixed data input sources and one fixed Padding 0, and the mass signal interconnection related to 128-branch conditions and the layout and wiring difficulty and circuit time sequence tension caused by the mass signal interconnection are avoided.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and although the invention has been described in detail with reference to the foregoing examples, it will be apparent to those skilled in the art that various changes in the form and details of the embodiments may be made and equivalents may be substituted for elements thereof. All modifications, equivalents and the like which come within the spirit and principle of the invention are intended to be included within the scope of the invention.

Claims

1. A data reorganization method for a systolic array structure, the size of the systolic array is N rows x m columns, the input feature map has N channels, and the convolution kernel also has N channels, and the method is characterized by comprising the following steps:

and passes all the way to the right to the rightmost side along the row direction.

2. The data reorganization method for systolic array structure of claim 1, characterized in that, the byte address label in the input signature buffer of the data of horizontal direction of the input signature transferred to S1 is obtained by formula (1):

3. The method of claim 2, wherein when the step size S is 1 and the step size S is 2, the data sent to each row of input ports of the systolic array is derived from three fixed positions by the following circuit processes to simplify the complexity of data path implementation:

(1) shift right the processed buffer 0;

(2) shift left processed buffer 1;

(3) zero Padding is performed according to Padding rules.