CN110738308B

CN110738308B - Neural network accelerator

Info

Publication number: CN110738308B
Application number: CN201910900439.2A
Authority: CN
Inventors: 陈小柏; 赖青松
Original assignee: Individual
Current assignee: Individual
Priority date: 2019-09-23
Filing date: 2019-09-23
Publication date: 2023-05-26
Anticipated expiration: 2039-09-23
Also published as: CN110738308A

Abstract

The invention discloses a neural network accelerator, which is used for externally connecting a global control module, a first direct memory access module DMA and a memory; the neural network accelerator comprises a second direct memory access module DMA, a convolution module, a single data processing module, a plane data processing module, a channel data processing module and a probability calculation module, which are sequentially connected in a pipeline mode; the convolution module performs multiply-add operation on the data; the single data processing module performs normalization, proportion operation and activation function processing; the plane data processing module performs maximum pooling, minimum pooling and average pooling processing; the channel data processing module performs channel splicing, surface rearrangement and matrix replacement processing; the probability calculation module finds out the maximum 5 values in the data and completes the probability operation of the maximum 5 values; the second DMA transmits the data to the convolution module; the convolution module and the channel data processing module share a DMA control bus.

Description

Neural network accelerator

Technical Field

The invention relates to the technical field of integrated circuits, in particular to a neural network accelerator.

Background

Convolutional Neural Networks (CNNs) are important algorithms for deep learning, and have very wide application in the field of computer vision, especially image recognition. At present, almost all identification and detection problems take convolutional neural networks as the first choice method, and various IT huge heads in the world also strive for related researches.

From the perspective of a computer, the image is actually a two-dimensional matrix, and the convolutional neural network is used for extracting features from a two-dimensional array by adopting convolution, pooling and other operations and identifying the image. Theoretically, convolutional neural networks can be used for identification and detection as long as they can be converted into two-dimensional matrix data. For example, the sound file can be divided into short segments, the level of each segment of musical scale can be converted into numbers, so that the whole segment of sound file can be converted into a two-dimensional matrix, and the like, and text data in natural language, chemical data in medical experiments and the like can be identified and detected by utilizing a convolutional neural network.

Compared to conventional algorithms, CNNs require higher computational effort and bandwidth requirements. Currently, computing is done mainly by means of Central Processing Unit (CPU) arrays, graphics Processing Unit (GPU) arrays. However, general processors such as a CPU and a GPU cannot fully utilize the characteristics of the convolutional neural network, so that the operation efficiency is low, and larger power consumption and cost overhead are brought. Furthermore, there is now an increasing need to perform the computation of artificial intelligence algorithms on terminal devices at low cost, low power consumption, high performance, which the existing general purpose processors cannot meet.

Disclosure of Invention

The invention provides a neural network accelerator, which aims to solve the problem that the operation efficiency of a neural network is low due to the fact that the characteristics of the convolutional neural network cannot be fully utilized in the prior art, and can improve the operation efficiency of the neural network and save operation time.

In order to achieve the above purpose of the present invention, the following technical scheme is adopted: the neural network accelerator is used for externally connecting a global control module, a first direct memory access module DMA and a memory; the neural network accelerator comprises a second direct memory access module DMA, a convolution module, a single data processing module, a plane data processing module, a channel data processing module and a probability calculation module;

the convolution module performs multiply-add operation on input data;

the single data processing module is used for sequentially carrying out normalization, activation function and proportional operation on the data;

the plane data processing module is used for carrying out maximum pooling, minimum pooling and average pooling processing on the data;

the channel data processing module is used for carrying out channel splicing, surface rearrangement and matrix replacement processing on the data;

the probability calculation module is used for finding out the maximum 5 values in the data and completing the probability operation of the maximum 5 values;

the convolution module, the single data processing module, the plane data processing module, the channel data processing module and the probability calculation module are connected in a pipeline mode; the second DMA transmits the data to the convolution module;

the convolution module and the channel data processing module share a DMA control bus.

Preferably, the convolution module, the single data processing module, the plane data processing module, the probability calculation module and the channel data processing module are all provided with bypass options;

when the operation only needs a convolution module and a single data processing module, a bypass plane data processing module, a channel data processing module and a probability calculation module;

when the operation only needs a convolution module, a bypass single data processing module, a plane data processing module, a channel data processing module and a probability calculation module are bypassed.

Further, the second direct memory access module DMA is in communication connection with the memory through an AXI communication protocol; the global control module is provided with an instruction FIFO; the DMA of the first direct memory access module is controlled by the CPU, data is loaded into the memory, and instructions are loaded into the FIFO of the global control module; and starting the global control module to start operation after all loading is finished, wherein the neural network accelerator reads data through the second DMA to perform operation, returns an interrupt to the CPU after the operation is finished, stores the data obtained by the operation into the memory through the second DMA, and reads the data obtained by the operation through the first DMA.

Still further, the data includes features, weights; the characteristics are stored in the memory according to an N-channel arrangement mode, the characteristics are a three-dimensional matrix, the width of the three-dimensional matrix is Wi, the height of the three-dimensional matrix is Hi, the number of channels is C, the arrangement sequence is arranged according to N channels, and the characteristics of each N channels are stored in the memory according to continuous addresses; the sum of all N is equal to C; the N is the power of 2.

Still further, the global control module includes an instruction FIFO; the global control module receives a starting command, takes out an instruction from the instruction FIFO and distributes the instruction to the convolution module, the single data processing module, the plane data processing module and the probability calculation module;

each instruction is 256 bytes, wherein the 0 th to 3 rd bytes represent a module enabling option, representing whether the module needs to be used or not; the 4 th to 67 th bytes represent control information of the convolution module, including characteristic height, width, channel, convolution kernel size, convolution step and convolution filling information; bytes 68-131 represent control information of the channel data processing module, including characteristic height, width, channel and rearrangement mode information; bytes 132-163 represent control information of the single data processing module, including height, width, channel, operation mode and parameter size information; the 164 th to 195 th bytes represent control information of the plane data processing module, including characteristic height, width, channel, size, pooling step and pooling filling information; bytes 196-227 represent probability calculation module control information, including class length information.

Still further, the convolution module comprises a DMA controller, a data distribution module, a ping-pong RAM, a RAM reading module, a stripe array module, a block accumulation module and a register module;

the DMA controller is used for controlling the DMA to read data from the memory; the data is separated from the characteristics and the weights through a data splitting module and then cached in a ping-pong RAM; the RAM reading module reads characteristics and weights from the ping-pong RAM to the stripe array module for operation processing, and an operation result is output through the block accumulation module; the register module receives instructions from the CPU through the global control module so as to control the operation of the convolution module.

Still further, the ping-pong RAM includes a feature RAM, a weight RAM; the feature RAM comprises two continuous output RAMs which are respectively marked as ramA and ramB, wherein the continuous output RAMs comprise Mk sub RAMs; the weight RAM comprises ramA ', ramB';

the ping-pong RAM adopts a continuous caching mode, namely, features are cached in ramA, when a RAM reading module reads the features in the ramA, ramB is cached in the next feature, and when the RAM reading module reads the features in the ramA, the RAM reading module reads the features in the ramB;

similarly, the weights are cached in the ramA ', when the RAM reading module reads the weights in the ramA', the RAM reading module simultaneously caches the next weights, and then the RAM reading module reads the weights in the ramB 'after the RAM reading module reads the weights in the ramA';

the stripe array module comprises Mk PE operation units, and the PE operation units comprise Tk/2 multipliers.

Still further, the planar data processing module performs width direction pooling operation on the feature data, and the planar data processing module is provided with N width direction operation units for parallel operation of N channels;

and then carrying out height direction pooling operation on the characteristic data, and simultaneously arranging N height direction operation units to operate N channels in parallel.

Still further, the channel data processing module includes a rearrangement operator for the BUF buffer, 2 selectors, disposed between the 2 selectors; the rearrangement operator comprises a channel splicing operator, a surface rearrangement operator and a matrix replacement operator;

the channel splicing operator is used for splicing two matrix channel directions, wherein the two matrix channels have the same height H and width W, the channels are not necessarily the same, the channels become a new feature after the channel splicing, the height is H, the width is W, the channels are C0+C1, and the C0 and the C1 represent the channel quantity of different matrix channels;

the surface rearrangement operator is used for rearranging each surface of one feature into four surfaces, and changing the surface rearrangement operator into a new feature, wherein the height is H/2, the width is W/2, and the channel is C4; wherein H represents the height of the original feature, W represents the width of the original feature, and C represents the channel of the original feature;

the matrix replacement operator is used for changing the dimension of the matrix and replacing the dimension sequence of the features to obtain different dimension sequences.

Still further, the probability calculation module comprises a top5_comp module, a probability operation unit and a reg register; the top5_comp module adopts a downward searching method to search 5 maximum values in the input data stream; the probability operation unit performs probability operation on the obtained 5 maximum values, and transmits an obtained operation result to a second DMA through a reg register;

the specific formula of the probability operation is as follows:

wherein x is _i Representing the input classification data.

The beneficial effects of the invention are as follows:

1. the invention classifies the operation of the neural network into four types, namely a single data processing module, a plane data processing module and a channel data processing module, so that the whole neural network accelerator is modularized and the operation is more efficient. And meanwhile, compared with the traditional CPU serving as a general controller, the global control module is used as an instruction distributor, so that the processing is quicker.

2. The invention adopts the pipeline method to work among the convolution module, the data processing module, the plane data processing module, the channel data processing module and the probability calculation module, thereby avoiding frequent memory caching operation and saving operation time.

3. The convolution module of the invention adopts a mode of independently controlling the second DMA, the speed of reading the memory is faster, and the convolution module adopts a stripe array layer to operate, which can support any convolution kernel size.

Drawings

Fig. 1 is a schematic diagram of the neural network accelerator according to embodiment 1.

Fig. 2 is a CPU control flow described in embodiment 1.

Fig. 3 is a schematic diagram of the N-channel arrangement of the features in example 1.

FIG. 4 is a Glb control flow in example 1.

Fig. 5 is a schematic diagram of the structure of the convolution module in embodiment 1.

Fig. 6 is a schematic diagram of the structure of a single data processing module of embodiment 1.

Fig. 7 is a schematic diagram of the structure of a planar data processing module of embodiment 1.

Fig. 8 is a schematic diagram of the structure of a channel data processing module of embodiment 1.

Fig. 9 is a schematic diagram of the channel splice of fig. 8.

Fig. 10 is a schematic view of the rearrangement of fig. 8.

Fig. 11 is a schematic diagram of the matrix permutation of fig. 8.

Fig. 12 is a schematic diagram of the structure of the probability calculation module in embodiment 1.

Fig. 13 is a schematic diagram of the working principle of the top5_comp module in fig. 12.

Fig. 14 is a schematic diagram of the bypass Pdp and Softmax of example 1.

Fig. 15 is a schematic diagram of the bypasses, dp and Softmax of example 1.

Fig. 16 is a characteristic feature segmentation diagram described in embodiment 2.

Fig. 17 is a table tennis RAM cache schematic described in example 2.

Fig. 18 is a continuous output RAM address output, output schematic diagram described in embodiment 2.

Fig. 19 is a schematic diagram of the multiplier operation described in embodiment 2.

Fig. 20 is a schematic view of the structure of the strip array module described in embodiment 2.

FIG. 21 is a schematic diagram of the operation of the stripe array module described in example 2.

Detailed Description

The invention is described in detail below with reference to the drawings and the detailed description.

Example 1

As shown in fig. 1, a neural network accelerator is used for externally connecting a global control module Glb, a first direct memory access module DMA and a memory DDR3; the neural network accelerator comprises a second direct memory access module DMA, a convolution module conv, a single data processing module sdp, a plane data processing module pdp, a channel data processing module cdp and a probability calculation module softmax;

the convolution module conv performs multiply-add operation on input data;

the single data processing module sdp is used for sequentially carrying out normalization, activation function and proportion operation processing on the data;

the plane data processing module pdp is used for carrying out maximum pooling, minimum pooling and average pooling processing on data;

the channel data processing module cdp is used for performing channel splicing, surface rearrangement and matrix replacement processing on the data;

the probability calculation module softmax is used for finding out the maximum 5 values in the data and completing the probability operation of the maximum 5 values;

As shown in fig. 1, the second direct memory access module DMA is in communication connection with the memory through an AXI communication protocol; the global control module is provided with an instruction FIFO; the DMA of the first direct memory access module is controlled by the CPU, data is loaded into the memory, and instructions are loaded into the FIFO of the global control module; and starting the global control module to start operation after all loading is finished, wherein the neural network accelerator reads data through the second DMA to perform operation, returns an interrupt to the CPU after the operation is finished, stores the data obtained by the operation into the memory through the second DMA, and reads the data obtained by the operation through the first DMA. The data comprises feature and weight.

In this embodiment, the CPU calculates, in advance, the feature, the weight, and all instructions required to be operated by the neural network accelerator, then loads the feature, the weight, and all instructions into the DDR3 through the first DMA, loads the instructions into the instruction FIFO in the global control module Glb, starts Glb to start operation after all the instructions are loaded, returns an interrupt to the CPU after the accelerator operation is completed, and then reads the operation result of the probability calculation module softmax. The CPU control flow chart is shown in fig. 2.

The data storage structure of this embodiment determines the performance of the neural network accelerator, and the storage structure of the Feature features in the embodiment in the DDR3 is stored in the DDR3 according to an N-channel arrangement mode, as shown in fig. 3, where the total number of Feature features is C, and the Feature features of each N channels are arranged according to N channels, and each N channels of Feature features is stored in the DDR3 according to a continuous address. The sum of all N is equal to C. N is typically a power of 2, e.g., 2, 4, 8, 16, 32, etc., the benefits of N-channel arrangement are two-fold, the first is that DDR3 read and write operations are burst transfers, and byte alignment is necessary, typically 8 bytes, 16 bytes, 32 bytes alignment, individual Feature quantities are sometimes not byte aligned, but the sum of N-channel Feature features must be byte aligned; and secondly, N data can be operated in parallel by each module operation, which is very beneficial to the acceleration of the algorithm.

The memory DDR3 stores the characteristic features, the size of which is the width Wi, the height Hi, the channel number C, the arrangement sequence is arranged according to N channels, the 1 st N channel characteristic Feature is stored in a first part of continuous addresses, the 2 nd N channel characteristic Feature is stored in a second part of continuous addresses, and so on. The sum of all N equals C.

The global control module in this embodiment includes an instruction FIFO, and after receiving the start command, the global control module Glb takes out an instruction from the instruction FIFO and distributes the instruction to the convolution module, the single data processing module, the plane data processing module, the channel data processing module, and the probability calculation module. Each instruction is 256 bytes, and the instruction includes a module enable option and a module register, not all modules will dispatch registers, only if the module enable is enabled.

Each instruction needs to contain control information of five modules, namely a convolution module, a single data processing module, a plane data processing module, a channel data processing module and a probability calculation module, wherein the 0 th to 3 rd bytes represent module enabling options and represent whether the module needs to be used or not; the 4 th to 67 th bytes represent control information of a convolution module conv, including feature height, width, channel, convolution kernel size, convolution stepping stride and convolution filling padding information; bytes 68-131 represent Cdp control information of the channel data processing module, including feature height, width, channel and rearrangement mode information; bytes 132-163 represent control information of a single data processing module, including feature height, width, channel, operation mode and parameter size information; bytes 164-195 represent control information of the plane data processing module Pdp, including feature height, width, channel, size, pooling step stride, pooling filling padding information; bytes 196-227 represent Softmax probability calculation module control information, including class length information.

One instruction may process a complete data stream, and read from memory DDR3 and then written back to DDR3 to represent a complete data stream. The Glb control flow chart is shown in fig. 4, and the neural network accelerator performs one operation to include a plurality of instructions, and after each instruction is operated, the next instruction needs to be distributed until all instructions are processed.

The global control module Glb is used as the instruction distributor, the number of the instructions of 10 stages is often less, the number of the instructions of 100 is often more, if the CPU is used as the controller, the CPU needs to be interrupted and responded many times for one operation of the neural network accelerator, and the accelerator is configured many times, so that the performance of the CPU is influenced, and the performance of the neural network accelerator is also influenced.

In this embodiment, the convolution module Conv is the most core module of the neural network accelerator, and determines the performance of the neural network accelerator, so that the convolution operation of the convolution module includes a large number of multiply-add operations, the operation efficiency is often the bottleneck of the convolution module, how to improve the operation efficiency is the content of the major research of the convolution module, the operation efficiency is determined by two points, the first is to reduce the rest time of the operation, and the second is to improve the operation times of each clock cycle. The convolution module Conv in this embodiment performs depth optimization on the two points at the same time, so that the operation efficiency is very high.

As shown in fig. 5, the convolution module includes a DMA controller, a data splitting module, a ping-pong RAM, a RAM reading module, a stripe array module, a block accumulation module, and a register module;

As shown in fig. 5, the ping-pong RAM includes a feature RAM and a weight RAM; the feature RAM comprises two continuous output RAMs which are respectively marked as ramA and ramB, wherein the continuous output RAMs comprise Mk sub RAMs; the weight RAM comprises ramA ', ramB';

As shown in fig. 6, a schematic structural diagram of a single data processing module is shown, where the single data processing module performs data normalization processing, activation function processing, and proportional operation processing in sequence, where the activation functions include Sigmoid function, tanh function, and ReLU function, and the embodiment uses the ReLU function as the activation function.

As shown in fig. 7, the planar data processing module performs a width direction pooling operation on the feature data, where the planar data processing module is provided with N width direction operation units for parallel operation of N channels;

And the plane data processing module processes max pooling, min pooling and average pooling operation.

As shown in fig. 8, the channel data processing module includes a rearrangement operator for BUF buffer, 2 selectors, and disposed between the 2 selectors; the rearrangement operator comprises a channel splicing operator, a surface rearrangement operator and a matrix replacement operator;

the data input is first buffered to the RAM in the BUF buffer and then a selector is used to select which rearrangement operator the data undergoes.

As shown in fig. 9, the channel splicing operator is configured to splice two matrix channels in directions, where the two matrix channels have the same height H and width W, the channels are not necessarily the same, and become a new feature after the channels are spliced, the height is H, the width is W, and the channels are c0+c1, where C0 and C1 represent the number of channels of different matrix channels;

as shown in fig. 10, the surface rearrangement operator is configured to rearrange each surface of a feature into four surfaces, and after surface rearrangement, a new feature is formed, where the height is H/2, the width is W/2, and the channel is c×4; wherein H represents the height of the original feature, W represents the width of the original feature, and C represents the channel of the original feature;

as shown in fig. 11, the matrix replacement operator is configured to change the dimensions of the matrix, and replace the dimensional sequence of the feature to obtain different dimensional sequences.

As shown in fig. 12, the probability calculation module includes a top5_comp module, a probability operation unit, and a reg register; the top5_comp module adopts a downward searching method to search 5 maximum values in the input data stream; the working principle of the top5_comp module is shown in fig. 13, and the maximum value can be updated only by judging that input data is larger than MAX1 when MAX1 is searched by a module with a logic judgment function, wherein the module is composed of a plurality of comparators, a plurality of AND gate modules, a MAX1 module, a MAX2 module, a MAX3 module, a MAX4 module and a MAX5 module; MAX2 searches for the maximum value which can be updated by judging whether the input data is larger than MAX2 and smaller than MAX 1; MAX3 searches for the maximum value which can be updated by judging whether the input data is larger than MAX3 and smaller than MAX 2; MAX4 searches for the maximum value which can be updated by judging whether the input data is larger than MAX4 and smaller than MAX 3; MAX5 finds the maximum value that needs to be updated by determining that the input data is greater than MAX5 and less than MAX 4.

The probability operation unit performs probability operation on the obtained 5 maximum values, and transmits an obtained operation result to a second DMA through a reg register;

the specific formula of the probability operation is as follows:

wherein x is _i Representing the input classification data.

The convolution module, the single data processing module, the plane data processing module, the probability calculation module and the channel data processing module are all provided with bypass options;

as shown in fig. 14, when the operation only needs a convolution module and a single data processing module, a bypass plane data processing module, a channel data processing module and a probability calculation module;

as shown in fig. 15, when the operation only needs the convolution module, the single data processing module, the plane data processing module, the channel data processing module, and the probability calculation module are bypassed.

Example 2

The specific operation method of the convolution module in this embodiment includes the following steps:

s1: setting the size of a single convolution kernel of weight as size multiplied by size; wherein size=1, 2, 3 … n, the number of PE arithmetic units of the stripe array is Mk; in this embodiment, mk=5;

s2: since the feature is too large to be loaded into the ping-pong RAM at a time, the feature needs to be split, and the split diagram is shown in fig. 16, and the embodiment needs to split in two directions. The first is the division in the direction of height H, m is a positive integer; for example, the present embodiment is divided into 4 minutes, then h0+h1+h2+h3=h; the second is channel direction division, which is divided into m shares, and the present embodiment divides the feature into m shares altogether. For example, the present embodiment is divided into 4 shares, then c0+c1+c2+c3=c, and the entire feature is divided into 4×4=16 shares of sub-features;

s3: the DMA controller calculates the address of each part of sub-feature and the address of each weight, reads the sub-feature and the weight from the memory through the DMA, and separates the sub-feature and the weight through the data distribution module;

s4: dividing each part of sub-features into Mk parts, and caching the Mk parts of sub-features in a continuous output RAM, wherein 1/Mk part of sub-feature data is stored in each sub-RAM;

the specific steps of the sub-feature buffer storage in the feature RAM are as follows:

a1: dividing the sub-feature of the address read by the DMA into Mk parts, and respectively storing the Mk sub-RAMs in the ramA, wherein each sub-RAM stores 1/Mk part of data of the sub-feature;

a2: the method comprises the steps that when a RAM reading module sequentially reads data in Mk sub-RAMs in the ramA according to an address calculation formula to form sub-features, DMA divides the sub-features of the next address into Mk parts and respectively stores the Mk sub-RAMs in the ramB, wherein each sub-RAM stores 1/Mk part of data of the sub-features;

a3: after the RAM reading module finishes reading the sub-features in the ramA, the RAM reading module sequentially reads data in the Mk sub-RAMs in the ramB according to an address calculation formula to form the sub-features;

a4: repeating the steps until the sub-features are read.

Taking the sub-feature as an example, firstly, DMA reads sub-feature 0 and stores it into ramA, then the stripe array calculates sub-feature 0, at this time DMA stores sub-feature 1 into ramB at the same time, after the stripe array calculates sub-feature 0, then calculates sub-feature 1, and the feature RAM caches the diagram 17 to show that the whole operation is seamless, thus effectively improving the efficiency.

According to the embodiment, when the RAM reading module reads each continuous output RAM sub-feature, data in the Mk sub-RAMs can be read simultaneously, and quick reading is achieved.

Meanwhile, the DMA caches the weight in the weight RAM;

the specific steps of the weight buffer in the weight RAM are as follows:

b1: the weight of the last address read by the DMA is stored in a ramA';

b2: when the PE operation unit reads the weight in the ramA 'through the RAM reading module, the DMA stores the weight of the next address into the ramB';

b3: after the PE operation unit reads the weight in ramA ', the weight in ramB' is read;

b4: repeating the steps until the weight is read.

Similarly, the embodiment realizes seamless connection of reading weights and effectively improves operation efficiency.

S5: the RAM reading module calculates data in Mk sub-RAMs in the continuous output RAM according to an address calculation formula, sequentially reads and forms a part of sub-features; meanwhile, the RAM reading module reads the weight in the weight RAM;

the address calculation formula in this embodiment is as follows:

address= (wr_addr/Mk)

Wr _i-1 _en＝((wr_addr％Mk)＝i-1)

Writing in the i-1 st ram;

wherein i=1, 2, …, mk; wraddr represents the address written into the sub-RAM, wr _i-1 En represents the write enable signal of the i-1 th sub-RAM;

the stripe array reads data, namely continuously outputting the Mk continuous data addresses and the data corresponding to the addresses by continuously outputting the RAM specifically comprises the following steps:

address＝(rd_addr/Mk)+((rd_addr/Mk)>0)

rd _i-1 _en＝i-1

where rd_addr represents the address of the read sub-RAM, rd _i-1 En represents the read enable signal of the i-1 th sub-RAM.

The input in the stripe array of this embodiment is 5 consecutive data, i.e., mk=5; in this embodiment, one continuous output RAM is taken as an example to cache sub-features into 5 sub-RAMs, and the sub-features are read, as shown in fig. 18, where the continuous output RAM can realize one-input continuous multi-output, the principle is that the continuous output RAM includes 5 RAMs, which are respectively denoted as RAM0, RAM1, RAM2, RAM3, RAM4, and the sub-features are divided into 5 parts according to the above address calculation formula and written into the 5 sub-RAMs respectively, where 1/5 parts of sub-features are stored in each sub-RAM, and when reading, the 5 sub-RAMs can simultaneously output one sub-feature data by sequential combination, so as to achieve the data effect of outputting continuous addresses.

S6: the PE operation unit acquires the sub-features and weights input by the RAM reading module; the multiplier in the PE operation unit performs multiplication operation on the input sub-features and weights to obtain corresponding operation results;

the sub-feature described in this embodiment is 8 bits, and the weight is 8 bits, so the convolution operation is 8×8. The prior art Xilinx FPGA internal multiplier bit width is typically 18 x 25, and implementing two multiplication operations a x B and a x C traditionally requires 2 multipliers.

In the PE operation unit of this embodiment, the multiplier multiplies the input sub-features and weights, as shown in fig. 19, and the multiplier algorithm of this embodiment is as follows: the weight C is shifted left by 16 bits, then the weight B of the next address is added, and then the result is multiplied by the sub-feature A, the obtained result is low by 16 bits and is the result of the sub-feature A by the weight B, the high order is the result of the sub-feature A by the weight C, and finally the two multiplication results are separated and accumulated respectively; by the method, the embodiment realizes that one multiplier performs 2 multiplication operations, and greatly improves the utilization efficiency of the multiplier. The weight C and the weight B represent weight data input to a multiplier; the sub-feature a represents sub-feature data input to the multiplier according to an address calculation formula.

S7: the block accumulation module accumulates and outputs the operation result of the sub-feature in the height H direction.

According to the RAM reading module, mk sub-features and Tk weights are read in each clock period and are respectively input into Mk PE operation units to be multiplied, one sub-feature and Tk weights are obtained in a clock period corresponding to each PE operation unit, one sub-feature and Tk weights are calculated in a clock period of each PE operation unit, and Tk results are obtained; the Mk PE operation units operate in parallel to realize that one calculation period outputs Mk Tk operation results; the one calculation period is size by size clock period, so the time for calculating one input sub-feature is:

time=w×h×c (size)/(mk×tk), unit: clock period

Wherein W is the input sub-feature length, H is the input sub-feature height, and C is the input sub-feature channel number; size is the size of the weight.

It follows that the calculation time is smaller as Mk and Tk are larger.

The stripe array module in this embodiment is shown in fig. 20, and includes a plurality of PE operation units, where each PE unit performs operation on Tk input features and Tk weights. Since one PE operation unit includes tk×tk/2 multipliers, one PE operation unit can complete tk×tk multiplication operations with one clock cycle because one multiplier can complete two multiplication operations. The stripe array module comprises Mk PE operation units, so that the whole stripe array module is provided with Mk Tk/2 multipliers in total, and Mk Tk multiplication operations can be completed in one clock period.

In this embodiment, taking the xilinux ZU4CG chip as an example, the chip has 728 multipliers in total, we configure Mk to be 5 and tk to be 16, then the stripe array consumes 5×16×16/2=640 multipliers, and the call rate reaches 87.9%, which is very high.

The number Mk of the stripe array PE operation units in this embodiment may be arbitrarily configured, because all PE operation units are used to calculate the same line of data, each PE operation unit accumulates the multiplier operation results, and Mk operation units obtain Mk operation results after each convolution kernel size is equal to the size clock period.

In this embodiment, each PE operation unit performs multiplier operation on the input sub-feature parallelism and weights (Tk-1) to obtain Tk operation results in one calculation cycle.

The embodiment is provided with 5 PE operation units, wherein the 5 PE units calculate the input sub-feature parallelism and the weight in parallel, and 5 operation results are obtained in each calculation period; thus tk×5 operation results are obtained for each calculation cycle.

As shown in fig. 21, a schematic diagram of a stripe array operation formed by Mk PE operation units in this embodiment is taken as an example of a feature of 10×10, in this embodiment, weight single convolution kernel size is set to 3×3, mk=5, clk is a clock signal, 1-9 represent 9 clock cycles, the 1 st clock cycle PE0 calculates 0×w0, PE1 calculates 1×w0, PE2 calculates 2×w0, PE3 calculates 3×w0, and PE4 calculates 4×w0; clock cycle 2 PE0 calculated 1×w1, PE1 calculated 2×w1, PE2 calculated 3×w1, PE3 calculated 4×w1, PE4 calculated 5×w1; the 3 rd clock cycle PE0 calculates 2 x w2, PE1 calculates 3 x w2, PE2 calculates 4 x w2, PE3 calculates 5 x w2, PE4 calculates 6 x w2; and by analogy, each PE operation unit accumulates the operation results of each clock cycle, and finally obtains 5 operation results F0-F4 after 9 clock cycles.

The convolution kernel of the example is 3*3, and the stripe array obtains 5 results F0-F4 after 9 clock cycles; when the size of the single convolution kernel of the weight is 5*5, 5 results F0-F4 are obtained by the stripe array through 25 clock cycles; when the size of the single convolution kernel of the weight is 7*7, 5 results F0-F4 are obtained by the stripe array through 49 clock cycles; therefore, the method can support any convolution kernel size, and all multipliers can be used regardless of convolution kernels, and the utilization rate is 100%.

It is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.

Claims

1. The neural network accelerator is used for externally connecting a global control module, a first direct memory access module DMA and a memory; the method is characterized in that: the neural network accelerator comprises a second direct memory access module DMA, a convolution module, a single data processing module, a plane data processing module, a channel data processing module and a probability calculation module;

the convolution module performs multiply-add operation on input data;

the convolution module, the single data processing module, the plane data processing module, the channel data processing module and the probability calculation module are connected in a pipeline mode; the second direct memory access module DMA transmits data to the convolution module;

2. The neural network accelerator of claim 1, wherein: the convolution module, the single data processing module, the plane data processing module, the probability calculation module and the channel data processing module are all provided with bypass options;

3. The neural network accelerator of claim 1, wherein: the second direct memory access module DMA is in communication connection with the memory through an AXI communication protocol; the global control module is provided with an instruction FIFO; the DMA of the first direct memory access module is controlled by the CPU, data is loaded into the memory, and instructions are loaded into the FIFO of the global control module; and starting the global control module to start operation after all loading is finished, wherein the neural network accelerator reads data through the second direct memory access module DMA to perform operation, returns an interrupt to the CPU after the operation is finished, stores the data obtained by the operation into the memory through the second direct memory access module DMA, and reads the data obtained by the operation through the first direct memory access module DMA.

4. A neural network accelerator according to any one of claims 2 or 3, wherein: the data comprises characteristics and weights; the characteristics are stored in the memory according to an N-channel arrangement mode, the characteristics are a three-dimensional matrix, the width of the three-dimensional matrix is Wi, the height of the three-dimensional matrix is Hi, the number of channels is C, the arrangement sequence is arranged according to N channels, and the characteristics of each N channels are stored in the memory according to continuous addresses; the sum of all N is equal to C; the N is the power of 2.

5. The neural network accelerator of claim 4, wherein: the global control module comprises an instruction FIFO; the global control module receives a starting command, takes out an instruction from the instruction FIFO and distributes the instruction to the convolution module, the single data processing module, the plane data processing module and the probability calculation module;

6. The neural network accelerator of claim 4, wherein: the convolution module comprises a DMA controller, a data distribution module, a ping-pong RAM, a RAM reading module, a stripe array module, a block accumulation module and a register module;

7. The neural network accelerator of claim 6, wherein: the ping-pong RAM comprises a feature RAM and a weight RAM; the feature RAM comprises two continuous output RAMs which are respectively marked as ramA and ramB, wherein the continuous output RAMs comprise Mk sub RAMs; the weight RAM comprises ramA ', ramB';

8. The neural network accelerator of claim 7, wherein: the plane data processing module firstly carries out width direction pooling operation on the characteristic data, and is provided with N width direction operation units for parallel operation of N channels;

9. The neural network accelerator of claim 8, wherein: the channel data processing module comprises a BUF buffer, 2 selectors and a rearrangement operator arranged between the 2 selectors; the rearrangement operator comprises a channel splicing operator, a surface rearrangement operator and a matrix replacement operator;

the channel splicing operator is used for splicing two matrix channel directions, wherein the two matrix channels have the same height H and width W, the channels are not necessarily the same, the channels become a new feature after being spliced, the height is H, the width is W, the number of the channels is C0+C1, and the C0 and the C1 represent the number of the channels of different matrix channels;

the surface rearrangement operator is used for rearranging each surface of one feature into four surfaces, and changing the surface rearrangement operator into a new feature, wherein the height is H/2, the width is W/2, and the number of channels is C4; wherein H represents the height of the original feature, W represents the width of the original feature, and C represents the number of channels of the original feature;

10. The neural network accelerator of claim 8, wherein: the probability calculation module comprises a top5_comp module, a probability operation unit and a reg register; the top5_comp module adopts a downward searching method to search 5 maximum values in the input data stream; the probability operation unit performs probability operation on the obtained 5 maximum values, and transmits an obtained operation result to a second direct memory access module DMA through a reg register;

the specific formula of the probability operation is as follows:

wherein x is _i Representing the input classification data.