CN116911365A

CN116911365A - Data flow path device and method suitable for memory and calculation integrated array

Info

Publication number: CN116911365A
Application number: CN202310744336.8A
Authority: CN
Inventors: 潘红兵; 林雨生; 傅高鸣; 王宇宣; 彭成磊
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2023-06-21
Filing date: 2023-06-21
Publication date: 2023-10-20

Abstract

The invention provides a data flow path device and a method thereof, which are applicable to a memory integrated array. The device comprises: the global buffer zone is used for buffering the intermediate result of the reasoning process and transmitting the ready data of the next step to the register group of the waiting zone; the waiting area register set is used for storing and driving input data to be calculated by entering the memory and calculation integrated array; the memory-calculation integrated array is used for realizing convolution operation and full-connection layer calculation of the neural network reasoning process according to the pre-mapped weight information, and transmitting the obtained calculation intermediate result to the single-instruction multi-data stream module; and the single-instruction multi-data flow module is used for realizing the functions of pooling operation and function activation in the neural network reasoning process and transmitting the result to the global buffer area. The data flow path device and the method can simplify the storage and the dispatching of data and improve the reusability of the circuit module.

Description

Data flow path device and method suitable for memory and calculation integrated array

Technical Field

The invention relates to a data flow path device and a method thereof suitable for a memory and calculation integrated array, belonging to the field of integrated circuits.

Background

Deep neural networks have received extensive attention in academia and industry due to their excellent effects in artificial intelligence applications such as image recognition. Most of the existing neural network technologies use application specific integrated circuits to complete the operation, and in particular, the contribution of the memory integrated device is particularly outstanding in this aspect. The integrated memory and calculation device has the characteristics of small data movement amount, low power consumption and the like, and is widely applied to the field of neural network reasoning.

In the neural network computation process, the configuration of data streams is an important ring for deploying the neural network on a chip. Because the integrated memory-calculation array has the characteristic that the weight is not rewritable, the weight is not changeable in the single reasoning process, so that great difficulty exists in data stream configuration and weight mapping work; in addition, the integrated memory and calculation array uses integrated memory and calculation devices, and the input, output and storage of data streams cannot be designed according to the traditional mode.

Disclosure of Invention

In order to improve hardware friendliness and device reusability, the invention provides a data flow path device and a method thereof, wherein the data flow path device is suitable for a memory integrated array.

The technical scheme of the device is as follows:

a data flow path apparatus adapted for use in a memory array, the apparatus comprising:

the global buffer zone is used for buffering the intermediate result of the reasoning process and transmitting the ready data of the next step to the register group of the waiting zone;

the waiting area register set is used for storing and driving input data to be calculated by entering the memory and calculation integrated array;

the memory-calculation integrated array is used for realizing convolution operation and full-connection layer calculation of the neural network reasoning process according to the pre-mapped weight information, and transmitting the obtained calculation intermediate result to the single-instruction multi-data stream module;

and the single-instruction multi-data flow module is used for realizing the functions of pooling operation and function activation in the neural network reasoning process and transmitting the result to the global buffer area.

Further, the global buffer area comprises two static random access memories forming a ping-pong structure, an input selection unit and an output selection unit, wherein the input selection unit is used for transmitting external data into the static random access memories, and the output selection unit is used for transmitting the data in the static random access memories into the register group of the waiting area; one static random access memory is used for storing input data of the current step of the neural network reasoning process, and the other memory is used for storing data of the subsequent step, namely backup data.

Further, two memories in the global buffer are connected with an external general interface, the waiting area register set and the single instruction multiple data stream module.

Further, the single-instruction multi-data flow module comprises a pooling module for performing three-step pooling on a to-be-pooled result input by the memory-calculation integrated array.

The invention also provides an operation method of the data flow path device suitable for the memory-calculation integrated array, which comprises the following specific steps:

(1) Mapping the trained weights of the neural network into a memory-calculation integrated array, wherein the memory-calculation integrated array comprises convolution layer weights and full-connection layer weights;

(2) Inputting the initial feature map into a global buffer through a universal interface, and then inputting the data stream into a selection unit to store the data stream into a first memory; then the input selecting unit switches the route of the data flow path, and when the data flow of the next buffer period arrives, the data flow is stored in the second memory; similarly, the input selection unit can sequentially and alternately transmit the data streams in the subsequent buffer period to the first memory and the second memory;

(3) The output selection unit takes out the data stream from the memory of the current data stream storage, transmits the data stream to the register group of the waiting area, and then transmits the data stream to the memory integrated array by the control module, calculates the weight after mapping in the memory integrated array, and transmits the generated intermediate result to the single instruction multiple data stream module;

(4) Determining whether pooling or activating operation is needed according to the configuration information requirement of the neural network, and if the next step is pooling operation, entering a pooling module into a configuration data stream to perform three-step pooling operation; if the next step is an activation operation, the configuration data flow enters an activation unit to be activated; if the next step is other operation, the data stream is not processed;

(5) Transmitting the result obtained in the step (4) to a memory selected by an input selecting unit in a global buffer, and repeating the steps (2) to (4) until the reasoning process of the neural network is completed;

(6) And outputting the calculation result from the global buffer area to the outside of the system through a general interface.

Further, in step (1), if the weight of the current mapping required is the weight of the convolution layer and the weight block is smaller, multiple convolutions of the convolutions are copied by diagonal filling, wherein the copy multiple takes the number of steps required by the convolutions of the convolution sliding window passing through an entire row or an entire column as a reference, and the copy multiple is as large as possible but not larger than the reference.

Further, in step (1), if the weight of the current required mapping is the weight of the convolution layer and the weight block is smaller, and the next step of the current convolution operation is the pooling operation, then the weights of the same channel in the adjacent weight repetition block of the convolution block are remapped to the adjacent columns in the memory-integrated array during the weight mapping.

Further, in the step (4), the three-step pooling operation specifically includes:

1) Executing pooling operation on every two adjacent data in the first line of data to be pooled, and temporarily storing the data into a built-in cache;

2) Waiting for the arrival of the corresponding second data in step 1), and then executing pooling operation on every two adjacent data;

3) And (3) carrying out third pooling operation on the results obtained in the step (1) and the step (2) to obtain a final pooling result.

The invention has the following beneficial effects:

(1) The data flow path device provided by the invention can be applied to a memory-calculation integrated array and a system similar to the memory-calculation integrated calculation structure, and has certain universality.

(2) By using the memory with ping-pong operation to realize the storage and scheduling of global data, the efficiency of data stream transmission is improved.

(3) The partial integrity of the data stream is ensured by the limitedly copied mapping of the weight blocks, the data stream merging step is simplified, and the calculation time consumption of the small neural network is reduced.

(4) The weight blocks are remapped according to the channel adjacency principle, so that the output channels of the pooled operands are adjacent, and the pooling operation addressing difficulty is reduced.

(5) By the three-step pooling method, pooling operation is completed by fully utilizing the pooling module, and the multiplexing degree of the pooling module is improved.

Drawings

Fig. 1 is a general architecture diagram of a data flow path device of the present invention.

Fig. 2 is an illustration of input data size and network configuration information for various layers of a LeNet neural network.

FIG. 3 is a schematic diagram of weight mapping in a computationally integrated array, in which a convolution layer weight block is replicated by a certain amount.

Fig. 4 is a weight distribution diagram of the LeNet neural network convolutional layer 1 on a memory array.

Fig. 5 is a weight distribution diagram of the LeNet neural network convolutional layer 2 over an array of memory banks.

Fig. 6 is a rule diagram of weight mapping, wherein the left graph is an original weight mapping graph, and the right graph is a new weight mapping graph.

Fig. 7 is an illustration of the understanding of the input data of the LeNet neural network convolutional layer 1 at the abstract level (left), the way (in) of storage in SRAM of the global buffer and the way (right) of storage in the waiting area register set.

FIG. 8 is a specific flow diagram of a three-step pooling operation performed by a single instruction multiple data stream module.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings.

The overall structure of the data flow path device suitable for the memory integrated array of the invention is shown in fig. 1, and comprises: the Global Buffer is used for buffering the intermediate result of the reasoning process and transmitting the ready data of the next step into the register group of the waiting area, wherein the data comprise the input excitation of the first layer network input from the external module and the intermediate result data of other network layers obtained by the single-instruction multi-data flow module; the StandBy zone register set StandBy Regs is used for storing and driving input data to be input into the storage and calculation integrated array for calculation such as convolution or a full connection layer; the memory-calculation integrated Array CIM Array is used for realizing convolution operation and full-connection layer calculation of a neural network reasoning process according to weight information mapped in advance, and transmitting an obtained calculation intermediate result to the single-instruction multi-data flow module; the analog-to-digital converter ADC is used for converting the analog value output by the integrated array into a digital value required by a subsequent circuit; and the single instruction multiple data stream module SIMD is used for realizing the functions of pooling operation and function activation in the reasoning process and transmitting the result to the global buffer area. The Interface in fig. 1 refers to a general Interface used by the device to interact with an external circuit.

The global buffer zone comprises two Static Random Access Memories (SRAM) forming a ping-pong structure, an input selection unit and an output selection unit, wherein the input selection unit is used for transmitting external data into the static random access memories, and the output selection unit is used for transmitting data in the static random access memories into the waiting zone register set. Each SRAM uses 64 bits as a data unit for transmission and storage, and stores the intermediate result of the neural network reasoning process by using a channel as a unit and stores the intermediate result of the neural network reasoning process by a data unit less than one data unit; the number of bits for a single point in the network is stored continuously.

The two SRAMs are connected with an external general Interface, a waiting area register set and a single-instruction multi-data stream module and are used for flexibly configuring input and output directions. If one of the memories is already used to store the input data of the current step of the neural network reasoning process, the other memory is used to store the data of the subsequent step, i.e. the backup data. Specifically, in the first buffering period, the input data stream is buffered to the memory 1; in the second buffering period, the input data stream is buffered to the memory 2 through the switching of the input selection unit, and meanwhile, the data of the memory 1 is transmitted to the waiting area register set through the switching of the output selection unit; in the third buffering period, the input data stream is buffered to the memory 1 by switching of the input selecting unit, and the data of the memory 2 is transferred to the waiting area register set by switching of the output selecting unit, thus being circulated.

The function of the waiting area register set is mainly used for storing and driving input data which is about to enter the memory calculation integrated array to carry out convolution or calculation of a full connection layer and the like. The width of the waiting area register set is 8 bits, and the longitudinal degree is the number of channels of the memory integrated array consistent with the dimension of the memory integrated array. In the process of driving data into the integrated memory-computing array, the register group of the waiting area inputs the first bit of all data into the integrated memory-computing array for computing; after the calculation process is finished, driving the second bit of all data into the integrated memory calculation array; and repeating the steps until all the bit numbers of all the data in the register group in the waiting area are transmitted.

The memory and calculation integrated array is used for calculating a convolution layer and a full connection layer in the neural network reasoning process. The integrated memory array maps the weights in advance, and then calculates the weights and the input data to obtain a result, and if necessary, the integrated memory array can convert between analog signals and digital signals. The convolution kernels are unfolded into planes, and the convolution kernels of different channels are stacked and then mapped into the memory calculation integrated array. If the number of idle devices in the memory and calculation integrated array is enough, the weight block is copied in a diagonal filling mode and is used for calculating the result of the next convolution area. Diagonal filling refers to mapping weight blocks for multiple times in an input direction (row direction) after copying, and shifting in an output direction (column direction) according to multiplexing conditions of input data streams corresponding to each convolution step in a convolution layer. The copy multiple is based on the convolution sliding window passing through a whole row or a whole column, and the copy multiple needs to be as large as possible but not larger than the reference. In addition, if the condition of weight block duplication exists, each two adjacent weight blocks are subjected to remapping, the weights of the same channel of the two weight blocks are mapped to adjacent output channels, and the two weight blocks are subjected to cross mapping.

The single-instruction multi-data flow module is used for realizing the functions of Pooling operation Pool and Activation function Activation (such as a Relu Activation function) in the reasoning process and receiving the calculation result FC of the full connection layer, and transmitting the Pooling result pool_Res, the Activation result Acti_Res or the result FC_Res of the full connection layer to the global buffer area. The pooling operation is completed by a pooling module, and supports pooling and adjacent channels at different moments in a single channel, accumulating at different moments in the same moment in the single channel and accumulating at different moments in the single channel, and accumulating at different moments in the adjacent channel. Under the condition of matching the weight mapping scheme of the memory integrated array, directly carrying out pooling operation on two adjacent bits in a first row of data to be pooled; then, after the second row of the data to be pooled is obtained, the same operation is carried out; and finally, matching the intermediate results obtained in the previous two steps according to the number of bits, and performing pooling operation again to finish all pooling operations of the two lines of data. In addition, the activation function module consists of a plurality of activation units, and the activation function operation can be completed after the data to be activated passes through the activation units. After the data is processed, the single instruction multiple data stream module transmits the data to the global buffer to wait for further processing.

Examples

The present embodiment uses the data flow structure shown in fig. 1, in which two ping-pong SRAMs of the global buffer take 64 bits as an addressing unit, and the size is 2KB; the size of the waiting area register group is 1152 x 8bit; the size of the integrated array is 1152×256,4 arrays form a Tile, and 4 arrays multiplex a group of ADCs. The storage and calculation integrated array only supports one layer of network in the reasoning and calculation process at a time, and the weight of the whole network is mapped onto the array during mapping. And carrying out the reasoning process of the LeNet neural network by using the data flow structure and the memory integrated array. The LeNet neural network architecture is shown in FIG. 2, which includes 2 convolutional layers, 2 pooling layers, 3 fully-connected layers, and 2 Relu activation function operations.

The weights of the LeNet network are first mapped into a computationally integrated array, as shown in fig. 3. The input size of the convolution layer 1 is 32×32×1, the convolution kernel size is 5*5, and the step size is 1, so the convolution kernel needs 28 steps to sweep a column. The weight of the convolution layer 1 is copied 28 times and then mapped into an integrated array of memory calculation, and a row of convolution results can be calculated at the same time by one calculation. Similarly, the weight of convolution layer 2 needs to be replicated 10 copies. Fig. 4 depicts the weight distribution diagram of the convolution layer 1, in the step of sweeping the convolution kernel through a column, the same input data of 5×5-1=20 bits is used in each two adjacent convolution operations, so that 20 bits of longitudinally input data can be multiplexed between each two adjacent weight blocks. Similarly, similar multiplexing relationships exist in convolutional layer 2, and the weight distribution diagram is shown in fig. 5.

Fig. 6 illustrates weight block remapping rules. The weight block 1 and the weight block 2 are a group of adjacent weight blocks in the convolution layer, and if the convolution operation is the pooling operation in the next step, the data to be pooled is the output data of the same channel in the two weight blocks. If the traditional weight mapping mode is adopted according to the left diagram of fig. 6, multiple complex addressing operations are needed to be performed during pooling; if the same channels of the weight block 1 and the weight block 2 are placed at adjacent positions in the memory integrated array according to the improved weight mapping mode of the right diagram of fig. 6, the result can be obtained only by pooling each group of adjacent data in the output channel. To simplify the structure of the pooling module, the weights are cross mapped in a weight mapping stage.

After the weight mapping is completed, data is input and calculation of a convolution layer is performed. An input feature map with the size of 32 x 1 is transmitted to a global buffer area through an interface between the system and an external circuit, and the data width is 8 bits. The first 5 columns of the input signature are the data for the first time the input-memory array is required to perform the convolution operation, as shown in the left-hand diagram of fig. 7. During the whole transmission process:

(1) The input selection unit of the global buffer selects the memory 1 to store the input feature map by column, the memory 1 stores words with a length of 64 bits, each word can store 8 data of the input feature map, and 4 words can store 1 column of data of the input feature map. Thus, as shown in the left two diagrams of fig. 7, 20 words of memory may store 5 columns of data of the input feature map;

(2) The single storage space of the register group of the waiting area is 8 bits, and the space size of each addressing is 64 bits, namely 1 word of the SRAM of the global buffer is fetched each addressing, which corresponds to 8 register storage spaces. The 1 column data of the input signature occupies 32 registers, and the 5 column data required for the convolution operation occupies 32×5=160 registers, as shown in the right diagram of fig. 7.

The set of wait area registers then inputs the corresponding data into the unified memory array. The data are multiplied by the mapped weights in the array, accumulated according to the output channels, subjected to necessary analog-to-digital conversion and then output to the single-instruction multiple-data stream module. The result length of each step of the convolution layer 1 is 6×28=168, which also corresponds to a column of convolution results supplied to the single instruction multiple data stream module. The next step in the convolution layer 1 is the Relu activation function and pooling layer 1, and the single instruction multiple data stream module performs the maximum pooling operation again after passing all the results through the activation module, as indicated in FIG. 2.

FIG. 8 illustrates a specific flow of a three-step pooling operation by a single instruction multiple data stream module. The pooling module firstly receives 168 data of a 1stLine word in fig. 8, wherein each 12 data is a unit from left to right, and respectively represents the result of each two adjacent steps of convolution, and 14 data units are total; the results of the previous and subsequent convolutions of channels 1-6 are shown from left to right in each data unit. The first pooling operation thus occurs between every two adjacent data in 1stLine, the intermediate result of which is the data vector of the 1stPool word in fig. 8, each data being denoted m-Cn, where m represents the mth data element and Cn represents the nth Channel (Channel). 84 data in 1stPool should represent 6 channels of 14 pooled data respectively, and the data is stored in the buffer area of the pooling module to wait for the third pooling operation.

The second step of pooling operation is the same as the first step, converting 2ndLine input data (second row of data to be pooled) into 2ndPool data vectors. And then carrying out a third step of pooling operation on the 1stPool and the 2ndPool according to the data units and the channel correspondence to obtain a final result PoolRes of two rows of data to be pooled. For data with the width of 8 bits, the above activating operation and pooling operation need to be repeated 8 times, and then the result is input into the SRAM corresponding to the global buffer area according to steps and channels. The above steps are repeated until the data shown in fig. 7 are all input and calculated, and the system completes the calculation of the convolution layer 1 and the pooling layer 1. The steps of the convolution layer 2 and the pooling layer 2 are the same as above, and are not repeated here. However, it should be noted that after the single instruction multiple data stream module receives the input result of the full connection layer, it is only necessary to directly transfer the data to the global buffer without performing any additional calculation. And the global buffer zone transmits the final result to an external circuit through an interface until the calculation of the full connection layer 3 is finished, so that a complete reasoning calculation process of the LeNet is realized.

Claims

1. A data flow path apparatus adapted for use in a memory array, the apparatus comprising:

2. The data flow path device for a memory integrated array according to claim 1, wherein the global buffer comprises two static random access memories forming a ping-pong structure, an input selection unit for transmitting external data into the static random access memories, and an output selection unit for transmitting data in the static random access memories to the waiting area register set; one static random access memory is used for storing input data of the current step of the neural network reasoning process, and the other memory is used for storing data of the subsequent step, namely backup data.

3. A data flow path apparatus for a memory bank array according to claim 1, wherein both memories in the global buffer are connected to an external general purpose interface, the waiting area register set and the single instruction multiple data flow module.

4. A data flow path apparatus for a unified memory array according to claim 1, wherein the single instruction multiple data flow module comprises a pooling module for performing three-step pooling on a result to be pooled input by the unified memory array.

5. A method of operating a data flow path apparatus for a memory array according to claim 1, comprising the steps of:

6. The method of claim 5, wherein in step (1), if the weight of the current mapping is a convolution layer weight and the weight block is small, the mapping is performed by diagonally filling a plurality of convolutions, wherein the replication factor is as large as possible but not larger than the reference based on the number of steps required for the convolutions to pass through a whole row or a whole column.

7. The method of claim 6, wherein in step (1), if the currently required mapped weight is a convolution layer weight and the weight block is smaller, and the next step of the current convolution operation is a pooling operation, then remapping the weights of the same channel in the adjacent weight-repeated block of the convolution block to adjacent columns in the unified memory array during weight mapping.

8. The method according to claim 5, wherein in step (4), the three-step pooling operation comprises the steps of: