CN113128675A

CN113128675A - Multiplication-free convolution scheduler based on impulse neural network and hardware implementation method thereof

Info

Publication number: CN113128675A
Application number: CN202110431741.5A
Authority: CN
Inventors: 李丽; 徐瑾; 傅玉祥; 陈沁雨; 王心沅; 沈思睿; 李伟; 何书专
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2021-04-21
Filing date: 2021-04-21
Publication date: 2021-07-16
Anticipated expiration: 2041-04-21
Also published as: CN113128675B

Abstract

The invention provides a multiplication-free convolution scheduler based on a pulse neural network and a hardware implementation method thereof, which utilize the characteristic that an SNN is driven based on an event and realize convolution calculation in the SNN through hardware, and provide an effective convolution scheduling method for the SNN in image segmentation; the method caches the input neuron state through FIFO and sends the neuron state to a '1' filter to realize the filtration of an effective state, thereby avoiding the invalid state from participating in calculation, improving the calculation efficiency and needing no multiplication; according to the characteristics of the data stream, a parallel storage structure is specially considered, and parallel storage is achieved by using less storage resources so as to adapt to the high parallel computing power of the computing unit; in the calculation process, the result of each time step is stored back in situ, so that the utilization rate of storage resources is improved; finally, 3 x 3 convolution calculation based on any specification input of the impulse neural network can be realized, and 64-path parallel calculation is supported; the method improves the performance of convolution calculation in the neural network, reduces the calculation complexity and the power consumption, and has higher flexibility.

Description

Multiplication-free convolution scheduler based on impulse neural network and hardware implementation method thereof

Technical Field

The invention relates to the field of convolutional neural network algorithms, in particular to a multiplication-free convolutional scheduler based on a pulse neural network and a hardware implementation method thereof.

Background

In recent years, the complexity of a neural network is gradually increased, and a traditional Artificial Neural Network (ANN) generally has a huge number of parameters and needs to participate in matrix multiplication calculation, so that the conventional artificial neural network consumes huge memory and power consumption when being implemented on a hardware platform; compared with the traditional ANN, the Spiking Neural Network (SNN) inspired by the biological brain directly reactivates neurons in hardware, has thousands of synapses, the spiking connection between the neurons is binary, which means that multiplication is not needed, the operation can be completed only through addition, and meanwhile, the spiking neural network is driven by events, and the sparsity of the activated neurons can be utilized to pursue higher efficiency and lower power consumption.

For a pulse neural network (SNN), each neuron corresponds to a membrane potential, the membrane potential is updated based on time step, in each step, the input of the neuron is spike activation signals of neurons in the previous layer, new neuron membrane potentials can be obtained by calculating the weight sum corresponding to effective neurons in a kernel convolution kernel, and when the membrane potential reaches a threshold value, the neurons are activated and output spike activation signals to the subsequent neurons.

By utilizing the advantages of SNN, realizing an efficient SNN biological model through hardware gradually becomes a research hotspot, however, the SNN effect is ideally executed by using the current CPU or GPU architecture, compared with the architecture provided by the CPU, the SNN needs an architecture with higher parallelism, and the GPU can realize high-parallelism calculation but is not suitable for an event-driven calculation mode; meanwhile, while SNN can achieve higher efficiency by reducing latency and computational effort, it requires receiving and processing of input data across multiple time steps, repeated data accesses can result in low throughput; in addition, the SNN activated neurons have sparsity, and the data storage structure and the control flow of the SNN activated neurons also need special consideration.

Disclosure of Invention

The purpose of the invention is as follows: aiming at overcoming the defects of the prior art, the utility model provides a multiplication-free convolution scheduler based on a pulse neural network and a hardware realization method thereof. The input neuron state is cached through FIFO and sent to a '1' filter to realize the filtration of the effective state, so that the invalid state is prevented from participating in the calculation, the calculation efficiency is improved, and the multiplication is not needed; according to the characteristics of the data stream, a parallel storage structure is specially considered, and parallel storage is achieved by using less storage resources so as to adapt to the high parallel computing power of the computing unit; in the calculation process, the result of each time step is stored back in situ, so that the utilization rate of storage resources is improved; finally, 3 x 3 convolution calculation based on any specification input of the impulse neural network can be realized, and 64-path parallel calculation is supported.

The technical scheme is as follows: a multiplication-free convolution scheduler based on a pulse neural network comprises a processor, an external DDR memory and a hardware accelerator, wherein the hardware accelerator comprises a convolution controller, a storage unit and a calculation unit;

the convolution controller is responsible for decoding an instruction of the processor, controlling the overall execution of convolution calculation, reading and writing data from the storage unit, managing the input and the output of the calculation unit and updating the state of the neuron according to the read data;

the memory cell comprises three separate parts, storing neuronal state (neurostate), membrane potential (Vmem) and synaptic Weight (Weight), respectively;

the computing unit bears most of computing tasks and is responsible for computing spike signals emitted by the previous layer of effective neurons, judging whether the neurons in the current layer are activated or not according to the spike signals, and finally updating the state of the neurons.

In a further embodiment, the efficient hardware implementation method of the multiplicative-free convolution scheduler based on the impulse neural network is further designed in that the convolution controller manages the input and the output of the computing unit, only considers the influence of the activated neurons on the neurons of the next layer, and controls the non-activated neurons not to participate in the computation, thereby effectively saving the computation time.

In a further embodiment, the efficient hardware implementation method of the multiplicative-free convolution scheduler based on the impulse neural network is further designed in that a convolution controller controls the overall execution of convolution calculations: firstly, reading neuron states corresponding to a kernel into FIFO in sequence; then, sending the number in the FIFO into a '1' filter, filtering out an index corresponding to the neuron with the state value of 1 in the kernel, and decoding the index into a weight address; then, corresponding Weight and Vmem values are taken from the storage unit according to the Weight address; and finally, sending the Vmem and the State result to a computing unit for computing, and storing the Vmem and the State result back in situ.

In a further embodiment, the efficient hardware implementation method of the multiplicative-free convolution scheduler based on the impulse neural network is further designed in such a way that a smaller RAM is used, a storage structure of NeuronState is specially considered, finally, one layer of NeuronState in 1 kernel can be taken out in 8 ways in parallel, all NeuronState in the kernel can be taken out in 1 cycle through three levels of cache and stored in FIFO, and pipeline access is achieved.

In a further embodiment, the efficient hardware implementation method of the multiplicationless convolutional scheduler based on the impulse neural network is further designed in that each position in the FIFO buffers N neuron states, where N is the weight number corresponding to 1 kernel, and "1" represents that a neuron is activated, and M is the number of the activated neurons in the kernel, and by using a "1" filter, indexes corresponding to M "1" state neurons can be filtered out in M cycles, and the processing speed is N/M times of that of the conventional CNN, and for sparse efficient data streams, that is, N/M is very large, a good optimization effect can be obtained.

In a further embodiment, the efficient hardware implementation method of the convolution scheduler without multiplication based on the impulse neural network is further designed in that a convolution module adopts 64 paths of parallel calculation, supports 3 × 3 convolution calculation of any specification input, 8 paths of controllers control and calculate 8 rows of convolution results, and each controller is responsible for calculating 8 channel results, wherein 1 channel represents 1 channel of the convolution results; each channel shares the same input, neuron state, corresponding to a different weight.

In a further embodiment, the efficient hardware implementation of the multiplicative-free convolutional scheduler based on the spikeless neural network is further designed in that Vmem and Weight require 8-way parallel storage to adapt to the high parallel computing power of the computing unit; vmem data are sequentially stored in the RAMs with the numbers of 0-7 according to rows, 8 rows of source data can be provided simultaneously during calculation, Vmem results of 1-8 channels are sequentially stored in each RAM from high to low, if the number of the channels is larger than 8, the Vmem results continue to be stored in 9-16 channels after the Vmem results of 1-8 channels are stored, and the like.

In a further embodiment, the efficient hardware implementation method of the multiplicative-free convolution scheduler based on the impulse neural network is further designed in such a way that the Weight specification is 3 × 3 × InC, where InC is the number of channels of the input layer, the results of each layer share the same Weight, and there are OutC weights of the 3 × 3 × InC specification, where OutC is the number of channels of the results; and each RAM sequentially stores the Weight values of 1 st to 8 th from high to low, and if OutC is larger than 8, the RAM continues storing.

To sum up, the efficient hardware implementation method of the multiplicative-free convolution scheduler based on the impulse neural network comprises the following steps:

step 1, storing NeuronState corresponding to a kernel into FIFO according to the sequence of a next layer of results;

step 2, sending the number in the FIFO into a '1' filter, decoding an address of an effective weight, and taking down a number after the filtering is finished;

step 3, corresponding Weight and Vmem values are taken from the memory unit according to the decoding address;

and 4, sending the Weight and the Vmem into the calculation unit, and storing the calculation result Vmem and the NeuroState in the storage unit in situ.

Has the advantages that:

firstly, the convolution controller can realize the running water access of the NeuronState, and all states corresponding to one kernel can be stored in the FIFO in 1 beat by means of parallel storage and cache of the NeuronState, so that the time consumed by data transportation is greatly reduced, and the data throughput rate is improved;

secondly, filtering 1 by using a filter, and controlling the inactive neurons not to participate in calculation by using the sparsity of the activated neurons, thereby effectively saving the calculation time and achieving low power consumption while having high efficiency;

thirdly, in SNN, the spike connection between neurons is binary, and no hardware multiplier is needed, reducing computational resources and hardware implementation complexity.

In summary, the present invention provides a hardware implementation of convolution calculation in SNN by using the advantages of SNN, which can effectively improve the performance of convolution calculation in a neural network, reduce the calculation complexity, obtain higher efficiency and lower power consumption, and have higher flexibility.

Drawings

Fig. 1 is a hardware configuration diagram of a convolution scheduler in the present invention.

Fig. 2 is a control flow diagram of the convolutional scheduler of the present invention.

FIG. 3 is the read neuron State module of the convolutional scheduler of the present invention.

FIG. 4 is a block diagram illustrating filtering and access in accordance with the present invention.

FIG. 5 is a block diagram of the calculation and result saving module of the present invention.

Fig. 6 is a memory structure diagram of the NeuronState in the present invention.

FIG. 7 is a memory structure diagram of Weight in the present invention.

FIG. 8 is a memory structure diagram of Vmem in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.

The first embodiment is as follows:

as shown in fig. 1, this embodiment proposes a hardware structure of a multiplicative-free convolution scheduler based on a pulse neural network, which is composed of a processor, an external DDR memory, a storage unit, a calculation unit and a convolution controller; the storage unit is used for storing neuron state NeuronState, membrane potential Vmem and synaptic Weight; the computing unit snpu comprises 8 × 8=64 parallel computing units and is responsible for judging whether neurons in the current layer are activated or not and updating the state of the neurons; the convolution controller is responsible for decoding instructions of the processor, controlling the overall execution of convolution calculations, reading and writing data from the storage unit and managing the input and output of the calculation unit.

The convolution controller controls the execution of convolution calculation, and the whole control flow is shown in FIG. 2; firstly, sequentially reading the STATE of a neuron corresponding to one kernel into FIFO; then sending the STATE in the FIFO into a '1' filter, filtering out the index corresponding to the neuron with the STATE value of 1 in the kernel, and decoding the index into a weight address; then, corresponding Weight and Vmem values are taken from the storage unit according to the Weight address; and finally, sending the Vmem and the State result to a computing unit for computing and storing the Vmem and the State result in situ.

For the 3 × 3 convolution calculation with step =1, the input layer specification is (N +2) × InC, and there are OutC 3 × 3 × InC weight kernels, the output layer is N × OutC, where N represents the length or width of the input image, and the convolution calculation is performed after zero padding is needed, InC is the number of channels of the input layer, and OutC is the number of channels of the output layer, and is also the number of weights.

Example two:

on the basis of the first embodiment, an example of the present invention is described in detail below, where N =80, InC =3, and OutC =8, and in practical applications, the present invention can support a convolution calculation of 3 × 3 with an input of any specification; the hardware acceleration system is designed based on Verilog HDL language, and basic test verification is completed by using VCS and FPGA, and the specific steps are as follows:

step 1

fetch _ STATE _ to _ FIFO, read STATE into FIFO, as shown in fig. 3, one kernel corresponds to 3 × 3 × 3=27 STATEs, STATEs are stored in FIFO in a particular order, FIFO _ DATA [2:0] is 3 STATEs of line 1, column 1 in the kernel, FIFO _ DATA [5:3] is 3 STATEs of line 1, column 2 in the kernel, and so on, FIFO _ DATA [26:24] is 3 STATEs of line 3, column 3 in the kernel.

According to the method, 64 paths of parallel calculation are adopted, for the

step

1, 8 paths of controllers calculate 8 lines of convolution results, corresponding source data are different STATEs, 8 STATE FIFOs are needed, and 8 kernel source STATEs corresponding to the 8 lines of convolution results are stored respectively; simultaneously, each path of controller simultaneously calculates the results of 8 channels; each channel shares the same source STATE, corresponding to a different weight.

Further, regarding the results of the 1 st row and the 1 st column, the STATE of the 1 st to 3 rd rows and the 1 st to 3 rd columns needs to be taken out, the position of 1 in the STATE is judged, and the weight is correspondingly taken out; this requires that 27 STATEs be stored in the FIFO according to the unified standard described above, for example 27 numbers 000_010_000_101_000_100_000_000_010 are stored in the STATE FIFO, as shown in fig. 3.

For the design, the convolutional layer is connected with the pooling layer, the maximum pooling operation is carried out according to the specification of kernel =2 × 2, the convolution result of the 1 st line is required to be calculated firstly, the 2 nd line is calculated after the 1 st line is calculated, namely, the 8 odd lines are calculated firstly, and then the even lines are calculated; the calculation process of the odd lines is marked as state1, the even lines are marked as state2, the 1 st, 3 rd, 5 th and … 15 th lines are calculated at the state1, then 8 paths of parallel calculation are carried out on the even 2 th, 4 th and … 16 th lines at the state2, and after the convolution of the first 16 lines is completed, the convolution result of the next 16 lines is calculated.

Furthermore, in order to improve the reading efficiency, 27 STATE values need to be taken out in 1 clock cycle, that is, 8 27bit numbers are stored in 8 FIFOs in each beat, and in order to realize that 8 paths of taking out one layer of neuron STATE in 1 kernel in parallel, certain requirements are imposed on the storage mode of the STATE.

As shown in fig. 6, which is a storage structure diagram of NeuronState, when calculating the convolution result of row 16, rows 16,17 and 18 of STATE need to be read, which requires that rows 16,17 and 18 are stored in the same address in NeuronState ram; meanwhile, when the convolution result of the 17 th row is calculated, the 17 th, 18 th and 19 th rows of the STATE need to be read, the 17 th, 18 th and 19 th rows are required to be stored in the same address in the neuroonstate RAM, the width of the RAM is limited, the boundary row needs to be repeatedly stored, and as shown in fig. 6, the 17 th and 18 th rows of the STATE are repeatedly stored, so that the pipelined STATE fetching can be realized.

Step 2

The module adopts 8-path parallelism, takes out the STATE signals from 8 FIFOs and sends the STATE signals to a filter, outputs the index corresponding to the neuron with the STATE value of 1 in the kernel, and takes out the number of the neurons from the FIFOs after 1-time filtering is finished.

In fig. 3, 27 numbers 000_010_000_101_000_100_000_000_010 _ are stored in the STATE FIFO, wherein 5 values are 1, and after 5 beats, the filtering operation is completed, and the output index corresponding to the filter is 1,11,15,17, 22.

Step 3

And the fetch _ src _ to _ snpu module decodes the filter output index into a Weight address, fetches the corresponding Weight and Vmem values from the storage unit according to the address, and sends the Weight and Vmem values into the snpu calculation unit for calculation.

The index analyzed by the filter can be decoded into a Weight address, then the activated Weight can be taken out, corresponding 8 Vmem values are also taken out, 8-path controllers calculate 8-line convolution results, simultaneously each path of controller simultaneously calculates results of 8 channels, the calculating unit completes 8 multiplied by 8= 64-path parallel convolution calculation, and the calculation between layers is pipelined.

FIG. 4 contains more details of step 3, where one kernel corresponds to 27 weights, 8 kernels correspond to 8 channels of the result, and each channel shares the input Neuron State, so that valid positions of weightings in the kernels 0-7 are the same, and the same addresses can be simultaneously fetched, and each address stores Weight values of the kernels 0-7 in sequence from high to low, and the weightings in each kernel are stored in 0-26 addresses in sequence according to the sequence of the first row and the second row, and the storage structure diagram of weightings is specifically shown in FIG. 7.

Meanwhile, the Vmem value of the previous time of the output layer needs to be read, updating is carried out on the basis, the memory structure diagram of Vmem is shown in FIG. 8, Vmem data are stored in the RAMs with the numbers of 0-7 in sequence according to rows, 8 rows of source data can be provided during calculation, meanwhile, Vmem results of 1-8 channels are stored in each RAM from high to low in sequence, if the number of the channels is larger than 8, after the 1-8 channel results are stored, 9-16 channels continue to be stored, and the rest is analogized.

Step 4

fetch _ res _ from _ snpu, which saves the result Vmem and neuron state returned by the compute unit back to the memory unit in-situ.

Design verification is carried out according to the scheme, the total neuron is N, wherein the activated neuron M can obtain N/M times of optimization effect compared with the traditional convolutional neural network, and the sparse optimization effect is better. In summary, the efficient hardware implementation method of the multiplicationless convolution scheduler based on the impulse neural network supports 3 × 3 convolution calculation based on any specification input of the impulse neural network, supports 64-path parallel calculation, does not need a hardware multiplier, can effectively improve the performance of convolution calculation in the neural network, reduces the calculation complexity, obtains higher efficiency and lower power consumption, and has higher flexibility.

Although the preferred embodiments of the present invention have been described in detail, the present invention is not limited to the details of the embodiments, and various equivalent modifications can be made within the technical spirit of the present invention, and the scope of the present invention is also within the scope of the present invention.

Claims

1. A multiplicative-free convolutional scheduler based on a spiking neural network, comprising:

a processor;

a memory cell comprising at least one set of first storage regions for storing neuron states, at least one set of second storage regions for storing membrane potentials, and at least one set of third storage regions for storing synaptic weights;

the convolution controller is used for decoding the instruction of the processor and controlling the integral execution of convolution calculation; the volume controller reads and writes data from the storage unit, manages the input and output of the computing unit and updates the state of the neuron according to the data;

and the computing unit is electrically connected with the convolution controller and used for computing the spike signal emitted by the previous layer of effective neurons, judging whether the current layer of neurons is activated or not according to the spike signal, and finally updating the state of the neurons.

2. The multiplicative-free convolutional scheduler based on a spiking neural network as claimed in claim 1, wherein the membrane potential and synaptic weight are stored in at least 8 ways in parallel; the membrane potential data are sequentially stored in the RAMs with the numbers of 0-7 according to rows, meanwhile, each RAM sequentially stores membrane potential results of 1-8 channels from high to low, if the number of the channels is larger than 8, after the results of the 1-8 channels are stored, 9-16 channels continue to be stored, and the like.

3. The convolutional scheduler without multiplication based on impulse neural network of claim 1, wherein the synaptic weight specification is 3 x 3 xnc, where InC is the number of channels of the input layer, the results of each layer share the same weight, there are OutC weights of the 3 x 3 xnc specification, where OutC is the number of channels of the results; and (4) sequentially storing 1 st to 8 th synapse weight values from high to low in each RAM, and if OutC is larger than 8, continuously storing the synapse weight values downwards.

4. The multiplicationless convolutional scheduler based on the impulse neural network as claimed in claim 1, wherein the convolutional controller adopts at least 64-way parallel computation, 8-way controllers control and compute 8 rows of convolutional results, and each controller is responsible for computing the results of 8 channels, and 1 channel represents 1 channel of the convolutional results; each channel shares the same input neuron state, corresponding to a different weight.

5. A hardware implementation method of a multiplication-free convolution scheduler based on a pulse neural network is characterized by comprising the following steps:

step 1, storing the neuron state corresponding to one kernel into a first-in first-out queue according to the sequence of a next layer of results;

step 2, sending the numbers in the first-in first-out queue into a '1' filter, filtering out the corresponding position index of the neuron with the state value of 1 in the kernel, decoding the address of the effective weight, and taking the next number after the filtering is finished;

step 3, reading a corresponding synapse weight value and a corresponding membrane potential value from the storage unit according to the decoding address;

and 4, sending the synaptic weight value and the membrane potential value into a calculation unit, and storing the calculation result back to the storage unit in situ.

6. The hardware implementation method of the multiplicative-free convolutional scheduler based on the impulse neural network as claimed in claim 5, wherein a RAM smaller than a predetermined value is used to fetch a layer of neuron states in at least 1 kernel in parallel, and all neuron states in the kernel are fetched in 1 predetermined cycle by at least three-level buffering, and the fetched neuron states are stored in a first-in first-out queue to implement pipelined fetching.

7. The hardware implementation method of the multiplicative-free convolutional scheduler based on the impulse neural network as claimed in claim 5, wherein each position in the fifo queue stores N neuron states;

wherein N is the weight number corresponding to 1 kernel, 1 represents the neuron is activated, M is the number of the activated neuron in the kernel, and M indexes corresponding to 1 state neurons are filtered out in M periods by using a 1 filter.