CN110738308A

CN110738308A - neural network accelerators

Info

Publication number: CN110738308A
Application number: CN201910900439.2A
Authority: CN
Inventors: 陈小柏; 赖青松
Original assignee: Individual
Current assignee: Individual
Priority date: 2019-09-23
Filing date: 2019-09-23
Publication date: 2020-01-31
Anticipated expiration: 2039-09-23
Also published as: CN110738308B

Abstract

The invention discloses neural network accelerators, which are externally connected with a global control module, a direct memory access module DMA and a storage, and comprise a second direct memory access module DMA, a convolution module, a single data processing module, a plane data processing module, a channel data processing module and a probability calculation module which are sequentially connected in a pipeline mode, wherein the convolution module carries out multiply-add operation on data, the single data processing module carries out , proportion operation and activation function processing, the plane data processing module carries out maximum pooling, minimum pooling and average pooling, the channel data processing module carries out channel splicing, surface rearrangement and matrix replacement processing, the probability calculation module finds out the maximum 5 values in the data and completes probability operation of the 5 maximum values, the second DMA transmits the data to the convolution module, and the channel data processing module shares DMA control buses.

Description

neural network accelerators

Technical Field

The invention relates to the technical field of integrated circuits, in particular to neural network accelerators.

Background

The Convolutional Neural Network (CNN) is an important algorithm for deep learning, and has very extensive application in the field of computer vision, particularly image recognition.

From the computer perspective, the images are actually two-dimensional matrixes, the convolutional neural network extracts features from the two-dimensional array by operations of convolution, pooling and the like and identifies the images, theoretically, the convolutional neural network can be used for identifying and detecting as long as the data can be converted into the two-dimensional matrixes, for example, a sound file can be divided into very short segments, the height of each segment of scale can be converted into numbers, so that the whole segment of the sound file can be converted into the two-dimensional matrixes, and similarly, text data in natural languages, chemical data in medical experiments and the like can be identified and detected by the convolutional neural network.

Compared with the conventional algorithm, CNN requires a higher amount of computation and bandwidth. At present, the calculation is mainly completed by a Central Processing Unit (CPU) array and a Graphic Processing Unit (GPU) array. However, general processors such as CPU and GPU cannot fully utilize the characteristics of the convolutional neural network, which results in low operation efficiency and large power consumption and cost overhead. Furthermore, there is an increasing need to perform the computation of artificial intelligence algorithms on terminal devices with low cost, low power consumption, and high performance, which cannot be met by existing general purpose processors.

Disclosure of Invention

In order to solve the problem that the operation efficiency of the neural network is low due to the fact that the characteristics of the convolutional neural network cannot be fully utilized in the prior art, neural network accelerators are provided, the operation efficiency of the neural network can be improved, and operation time can be saved.

In order to achieve the purpose of the invention, the adopted technical scheme is that neural network accelerators are used for being externally connected with a global control module, a direct memory access module DMA and a memory, and each neural network accelerator comprises a second direct memory access module DMA, a convolution module, a single data processing module, a plane data processing module, a channel data processing module and a probability calculation module;

the convolution module carries out multiply-add operation on input data;

the single data processing module is used for sequentially performing classification, function activation and proportional operation on data;

the plane data processing module is used for performing maximum pooling, minimum pooling and average pooling on data;

the channel data processing module is used for carrying out channel splicing, surface rearrangement and matrix replacement processing on data;

the probability calculation module is used for finding out the maximum 5 values in the data and completing the probability calculation of the 5 maximum values;

the convolution module, the single data processing module, the plane data processing module, the channel data processing module and the probability calculation module are connected in a pipeline mode; the second DMA transmits data to a convolution module;

the convolution module and the channel data processing module share DMA control buses.

Preferably, the convolution module, the single data processing module, the plane data processing module, the probability calculation module and the channel data processing module are all provided with bypass options;

when the operation only needs the convolution module and the single data processing module, the plane data processing module, the channel data processing module and the probability calculation module are bypassed;

when the operation only needs the convolution module, the single data processing module, the plane data processing module, the channel data processing module and the probability calculation module are bypassed.

, the second DMA is connected with the memory through the AXI communication protocol, the global control module is provided with an instruction FIFO, the direct memory access module DMA is controlled by the CPU to load data into the memory and load the instruction into the FIFO in the global control module, the global control module is started to operate after all the instructions are loaded, the neural network accelerator reads the data through the second DMA to operate, the operation is returned to the CPU after the operation is finished, the data obtained by the operation is stored into the memory through the second DMA, and the CPU reads the data obtained by the operation through the DMA.

And , storing the data including features and weights in a memory according to an N-channel arrangement mode, wherein the features are three-dimensional matrixes with width Wi, height Hi and channel number C, the arrangement sequence is arranged according to N channels, the features of every N channels are stored in the memory according to continuous addresses, the accumulated sum of all N is equal to C, and N is the power of 2.

, the global control module comprises an instruction FIFO, the global control module takes out the instruction from the instruction FIFO and distributes the instruction to the convolution module, the single data processing module, the plane data processing module and the probability calculation module after receiving the starting command;

each instruction is 256 bytes, wherein, the 0 th byte to the 3 rd byte represents a module enabling option and represents whether the module needs to be used or not; the 4 th to 67 th bytes represent convolution module control information, including feature height, width, channel, convolution kernel size, convolution stepping and convolution filling information; bytes 68-131 represent control information of the channel data processing module, including characteristic height, width, channel and rearrangement mode information; the 132 th-163 th bytes represent single data processing module control information, including information of sign height, width, channel, operation mode and parameter size; the 164 th-195 th byte represents control information of a plane data processing module, and the control information comprises characteristic height, width, channel, size, pooling stepping and pooling filling information; bytes 196-227 represent probability calculation module control information, including classification length information.

, the convolution module comprises a DMA controller, a data distribution module, a ping-pong RAM, a RAM reading module, a strip array module, a block accumulation module and a register module;

the DMA controller is used for controlling the DMA to read data from the memory; the data is subjected to characteristic and weight separation through a data distribution module and then cached in a ping-pong RAM; the RAM reading module reads the characteristics and the weight from the ping-pong RAM and sends the characteristics and the weight to the strip array module for operation processing, and an operation result is output through the block accumulation module; the register module receives an instruction from a CPU through a global control module, so that the operation of the convolution module is controlled.

, the ping-pong RAM comprises a characteristic RAM and a weight RAM, wherein the characteristic RAM comprises two continuous output RAMs which are respectively marked as ramA and ramB, the continuous output RAMs comprise Mk sub RAMs, and the weight RAM comprises ramA 'and ramB';

the ping-pong RAM adopts a continuous cache mode, namely, the features are cached into ramA, when the RAM reading module reads the features in the ramA, simultaneously, the ramB is cached into the next features, and when the RAM reading module finishes reading the features in the ramA, the features in the ramB are read;

similarly, the weight is cached into the ramA ', when the RAM reading module reads the weight in the ramA ', the next weights are cached into the ramB ', and when the RAM reading module finishes reading the weight in the ramA ', the weight in the ramB ' is read;

the stripe array module comprises Mk PE arithmetic units, and the PE arithmetic units comprise Tk/2 multipliers.

, the planar data processing module firstly performs width direction pooling operation on the characteristic data, and the planar data processing module is provided with N width direction operation units for operating N channels in parallel;

and then, carrying out height direction pooling operation on the characteristic data, and also setting N height direction operation units to operate N channels in parallel.

, the channel data processing module comprises BUF buffer, 2 selectors and rearrangement operators arranged among the 2 selectors, wherein the rearrangement operators comprise a channel splicing operator, a surface rearrangement operator and a matrix permutation operator;

the channel splicing operator is used for splicing two matrix channel directions, wherein the two matrix channels have the same height H and width W, the channels are not fixed to be the same, the two matrix channels become new characteristics after channel splicing, the height is H, the width is W, the channels are C0+ C1, and C0 and C1 represent the channel number of different matrix channels;

the surface rearrangement operator is used for rearranging each surface of features into four surfaces, and after surface rearrangement, the four surfaces become new features, wherein the height of the new features is H/2, the width of the lane is W/2, and the lane is C4, wherein H represents the height of the original features, W represents the width of the original features, and C represents the lane of the original features;

and the matrix permutation operator is used for changing the dimensionality of the matrix and permuting the dimensionality sequence of the features to obtain different dimensionality sequences.

, the probability calculation module comprises a top5_ comp module, a probability calculation unit and a reg register, wherein the top5_ comp module searches 5 maximum values in the input data stream by adopting a downward search method;

the probability operation is specifically as follows:

in the formula, x_iRepresenting the input classification data.

The invention has the following beneficial effects:

1. the invention classifies the neural network operation into four types, namely a single data processing module, a plane data processing module and a channel data processing module, so that the whole neural network accelerator is modularized and the operation is more efficient. Compared with the traditional Central Processing Unit (CPU) serving as a master controller, the global control module is used as an instruction distributor, and processing is quicker.

2. The invention adopts the pipeline method to operate among the convolution module, the data processing module, the plane data processing module, the channel data processing module and the probability calculation module, thereby avoiding frequent memory cache operation and saving operation time.

3. The convolution module of the invention adopts a mode of autonomously controlling the second DMA, the speed of reading the memory is faster, and the convolution module adopts the stripe array layer for operation and can support any convolution kernel size.

Drawings

Fig. 1 is a schematic structural diagram of a neural network accelerator according to embodiment 1.

FIG. 2 shows a control flow of the CPU described in embodiment 1.

Fig. 3 is a schematic diagram of an N-channel arrangement of the features of example 1.

FIG. 4 shows the Glb control scheme in example 1.

Fig. 5 is a schematic structural diagram of a convolution module in embodiment 1.

FIG. 6 is a schematic diagram of a single data processing module according to embodiment 1.

Fig. 7 is a schematic structural diagram of a planar data processing module according to embodiment 1.

Fig. 8 is a schematic structural diagram of a channel data processing module according to embodiment 1.

Fig. 9 is a schematic illustration of the channel splice of fig. 8.

FIG. 10 is a schematic representation of the face rearrangement of FIG. 8.

Fig. 11 is a schematic diagram of the matrix permutation in fig. 8.

Fig. 12 is a schematic configuration diagram of a probability calculation module in embodiment 1.

Fig. 13 is a schematic diagram of the operation principle of the top5_ comp module in fig. 12.

FIG. 14 is a schematic view of bypass Pdp and Softmax in embodiment 1.

FIG. 15 is a schematic diagram of bypassing Sdp, Pdp and Softmax in example 1.

Fig. 16 is a schematic diagram of feature segmentation described in embodiment 2.

Fig. 17 is a schematic diagram of a ping-pong RAM buffer in embodiment 2.

Fig. 18 is a schematic diagram of the continuous output RAM address output and output in embodiment 2.

Fig. 19 is a diagram illustrating the operation of the multiplier in embodiment 2.

Fig. 20 is a schematic structural view of the strip array module described in embodiment 2.

FIG. 21 is a schematic diagram of the operation of the stripe array module in embodiment 2.

Detailed Description

The invention is described in detail below with reference to the drawings and the detailed description.

Example 1

As shown in fig. 1, kinds of neural network accelerators are used for being externally connected with a global control module Glb, a th direct memory access module DMA and a storage DDR3, and each neural network accelerator comprises a second direct memory access module DMA, a convolution module conv, a single data processing module sdp, a plane data processing module pdp, a channel data processing module cdp and a probability calculation module softmax;

the convolution module conv performs multiply-add operation on the input data;

the single data processing module sdp is used for sequentially performing classification, activation function and proportion operation processing on data;

the plane data processing module pdp is used for performing maximum pooling, minimum pooling and average pooling on data;

the channel data processing module cdp is configured to perform channel splicing, surface rearrangement, and matrix permutation on data;

the probability calculation module softmax is used for finding out the maximum 5 values in the data and completing probability calculation of the 5 maximum values;

As shown in FIG. 1, the second DMA is in communication connection with the memory through an AXI communication protocol, the global control module is provided with an instruction FIFO, the th DMA is controlled by the CPU to load data into the memory and load the instruction into the FIFO in the global control module, the global control module is started to start operation after all the data are loaded, the neural network accelerator reads the data through the second DMA to perform operation, the data are returned and interrupted to the CPU after the operation is finished, the data obtained by the operation are stored into the memory through the second DMA, and the data obtained by the operation are read by the CPU through the th DMA.

The CPU of the embodiment calculates the feature, the weight and all the instructions which need to be operated by the neural network accelerator in advance, loads the feature and the weight into the DDR3 through DMA, loads the instructions into the instruction FIFO in the global control module Glb, starts the Glb to operate after all the instructions are loaded, returns to the CPU after the accelerator is operated, and then reads the operation result of the probability calculation module softmax, wherein the CPU control flow chart is shown in FIG. 2.

The data storage structure of this embodiment determines the performance of a neural network accelerator, the storage structure of Feature features in the DDR3 described in this embodiment is stored in the DDR3 according to an N-channel arrangement mode, as shown in fig. 3, the total number of Feature features is C, the Feature features are arranged according to N channels, the Feature features of every N channels are stored in the DDR3 according to continuous addresses, the cumulative sum of all N is equal to C.N, which is usually a power of 2, such as 2, 4, 8, 16, 32, etc., the advantage of the N-channel arrangement is twofold, is that the read-write operation of the DDR3 is burst transmission, byte alignment is necessary, usually 8 byte, 16 byte, 32 byte alignment, and the number of single Feature features is sometimes not byte alignment, but the sum of Feature features of N channels is byte alignment, and secondly, the operations of each module can operate N data in parallel, which is very beneficial to the acceleration of the algorithm.

The memory DDR3 stores Feature features of width Wi, height Hi, number of channels C, arranged in N channels in order, Feature of 1 st N channel stored at part consecutive addresses, Feature of 2 nd N channel stored at second part consecutive addresses, and so on.

The global control module Glb in this embodiment receives the start command, fetches an instruction from the instruction FIFO, and distributes the instruction to the convolution module, the single data processing module, the plane data processing module, the channel data processing module, and the probability calculation module. Each instruction is 256 bytes, the instruction comprises a module enable enabling option and a module register, not all modules need to distribute the register, and the register is distributed only when the module enable is enabled.

Each instruction needs to contain control information of five modules, namely a convolution module, a single data processing module, a plane data processing module, a channel data processing module and a probability calculation module, wherein 0-3 bytes represent module enabling options and represent whether the module needs to be used or not; bytes 4-67 represent conv control information of a convolution module, including feature height, width, channel, convolution kernel size, convolution stepping stride and convolution filling padding information; bytes 68-131 represent control information of the Cdp of the channel data processing module, including feature height, width, channel and rearrangement mode information; the 132 th-163 th bytes represent single data processing module control information, including feature height, width, channel, operation mode and parameter size information; bytes 164-195 represent the control information of the plane data processing module Pdp, including the information of feature height, width, channel, size, pooling stepping stride and pooling filling padding; bytes 196-227 represent control information of the Softmax probability calculation module, and comprise classification length information.

instructions can process complete data streams, read from the DDR3, and write back to DDR3 representing complete data streams after operation. Glb control flow chart is shown in FIG. 4. the neural network accelerator comprises a plurality of instructions, each instruction needs to distribute instructions after operation is completed, until all instructions are processed.

The advantage of using global control module Glb as instruction distributor is that, the operation of neural network is often times, and the operation is less, 10-level instructions, and more, 100 instructions, if CPU is used as controller, then the operation of neural network accelerator times requires interrupt response many times, and the configuration of accelerator many times, which affects not only the performance of CPU, but also the performance of neural network accelerator.

The convolution module Conv in this embodiment is a most core module of the neural network accelerator, and also determines the performance of the neural network accelerator, so that the convolution operation of the convolution module includes a large number of multiply-add operations, the operation efficiency is often a bottleneck of the convolution module, how to improve the operation efficiency is the content of the convolution module, which is important to be researched, and the operation efficiency is determined by two points, is to reduce the quiescent time of the operation, and secondly to improve the operation times of each clock cycle.

As shown in fig. 5, the convolution module includes a DMA controller, a data splitting module, a ping-pong RAM, a RAM reading module, a stripe array module, a block accumulation module, and a register module;

As shown in fig. 5, the ping-pong RAM includes a feature RAM, a weight RAM; the characteristic RAM comprises two continuous output RAMs which are respectively marked as ramA and ramB, wherein the continuous output RAMs comprise Mk sub RAMs; the weight RAM comprises ramA 'and ramB';

As shown in fig. 6, which is a schematic structural diagram of a single data processing module, the single data processing module sequentially performs data normalization processing, activation function processing, and proportional operation processing on data, where the activation function includes a Sigmoid function, a tanh function, and a ReLU function, and the ReLU function is used as the activation function in this embodiment.

As shown in fig. 7, the planar data processing module first performs a width direction pooling operation on the feature data, and the planar data processing module is provided with N width direction operation units to operate N channels in parallel;

The plane data processing module processes max pooling, min pooling and average pooling.

As shown in fig. 8, the channel data processing module includes a BUF buffer, 2 selectors, and a reordering operator disposed between the 2 selectors; the rearrangement operator comprises a channel splicing operator, a surface rearrangement operator and a matrix permutation operator;

the data input is first buffered to RAM in the BUF buffer and then selectors are used to select which sort operator the data undergoes.

As shown in fig. 9, the channel splicing operator is used for splicing two matrix channel directions, where two matrix channels have the same height H and width W, the channels are not fixed to be the same, and become new features after channel splicing, the height is H, the width is W, and the channels are C0+ C1, where C0 and C1 represent the number of channels of different matrix channels;

as shown in fig. 10, the surface rearrangement operator is configured to rearrange each surface of features into four surfaces, and after surface rearrangement, the four surfaces become new features, where H denotes the height of the original feature, W denotes the width of the original feature, and C denotes the channel of the original feature, and the height is H/2, the width is W/2, and the channel is C × 4;

as shown in fig. 11, the matrix permutation operator is configured to change the dimension of the matrix, and permute the feature dimension order to obtain different dimension orders.

As shown in fig. 12, the probability calculation module includes a top5_ comp module, a probability operation unit, and a reg register, the top5_ comp module searches for 5 maximum values in an input data stream by using a downward search method, wherein the top5_ comp module has a working principle as shown in fig. 13, and comprises a plurality of comparators, a plurality of and modules, a MAX1 module, a MAX2 module, a MAX3 module, a MAX4 module, and a MAX5 module, the MAX1 search for a module with a logic determination function only needs to determine that input data is greater than MAX1 to update the maximum value, the MAX2 search for a module that needs to determine that input data is greater than MAX2 and less than MAX1 to update the maximum value, the MAX3 search for a module that needs to determine that input data is greater than MAX3 and less than MAX2 to update the maximum value, the MAX4 search for input data that needs to determine that is greater than MAX4 and less than MAX3 to update the maximum value, and the MAX5 needs to determine that is greater than MAX5 and less than MAX4 to.

The probability operation unit performs probability operation on the obtained 5 maximum values and transmits an obtained operation result to the second DMA through a reg register;

the probability operation is specifically as follows:

in the formula, x_iRepresenting the input classification data.

In this embodiment, the convolution module, the single data processing module, the plane data processing module, the probability calculation module and the channel data processing module are all provided with bypass options;

as shown in fig. 14, when only the convolution module and the single data processing module are needed for operation, the plane data processing module, the channel data processing module, and the probability calculation module are bypassed;

as shown in fig. 15, when only the convolution module is needed for operation, the single data processing module, the plane data processing module, the channel data processing module, and the probability calculation module are bypassed.

Example 2

The specific operation method of the convolution module in this embodiment includes the following steps:

s1: setting the size of a single convolution kernel of the weight to be size multiplied by size; wherein, size is 1, 2, 3 …. n, the number of PE arithmetic units of the stripe array is Mk; in this embodiment, Mk is 5;

s2, because the feature is too large to be loaded to the ping-pong RAM times, the feature needs to be segmented, a segmentation schematic diagram is shown in fig. 16, in this embodiment, two directions need to be segmented, the th is the height H direction segmentation, which is segmented into m parts, m is a positive integer, for example, the embodiment is segmented into 4 parts, then H0+ H1+ H2+ H3 is H, the second is the channel direction segmentation, which is segmented into m parts, and the embodiment segments the feature into m parts, for example, the embodiment is segmented into 4 parts, then C0+ C1+ C2+ C3C, and the whole feature is segmented into 4 × 4 parts, 16 parts of sub-features;

s3, the DMA controller calculates the address of each parts of sub-characteristics and the address of the weight, reads the sub-characteristics and the weight from the memory through the DMA, and separates the sub-characteristics and the weight through the data distribution module;

s4, dividing each parts of sub-features into Mk parts, and caching the Mk parts of sub-features in the continuous output RAM, wherein each part of sub-RAM stores data of 1/Mk parts of sub-features;

the specific steps of caching the sub-features in the feature RAM are as follows:

a1: the sub-features of the last address read by the DMA are divided into Mk parts and respectively stored in Mk sub-RAMs in the ramA, wherein each sub-RAM stores data of 1/Mk part of the sub-features;

a2: sequentially reading data in Mk sub-RAMs in the ramA by an RAM reading module according to an address calculation formula to form sub-features, dividing the sub-features of a next address into Mk parts by the DMA, and respectively storing the Mk sub-RAMs in the ramB, wherein each sub-RAM stores 1/Mk part of sub-feature data;

a3: after the RAM reading module reads the sub-features in the ramA, the RAM reading module sequentially reads data in Mk sub-RAMs in the ramB according to an address calculation formula to form the sub-features;

a4: and repeating the steps until the sub-features are read.

In the embodiment, taking the sub-feature as an example, the DMA reads the sub-feature 0 and stores it in ramA, then the stripe array calculates the sub-feature 0, at this time, the DMA stores the sub-feature 1 in ramB at the same time, and after the stripe array calculates the sub-feature 0, the sub-feature 1 is calculated, as shown in the schematic diagram 17 of the feature RAM cache, the whole operation is seamless, and the efficiency is effectively improved.

In the embodiment, when the RAM reading module reads each continuous output RAM sub-feature, the data in Mk sub-RAMs can be read simultaneously, so that the quick reading is realized.

Meanwhile, DMA buffers the weight in a weight RAM;

the specific steps of the weight caching in the weight RAM are as follows:

b1: the weight of the last address read by the DMA is stored into ramA';

b2: when the PE arithmetic unit reads the weight in the ramA 'through the RAM reading module, the DMA stores the weight of the next address into the ramB';

b3: after the PE arithmetic unit reads the weight in the ramA ', the weight in the ramB' is read;

b4: and repeating the steps until the weight is read.

Similarly, the embodiment realizes the seamless connection of the read weight and effectively improves the operation efficiency.

S5, the RAM reading module calculates the data in Mk sub-RAMs in the continuous output RAM according to the address calculation formula, reads and forms parts of sub-features in sequence, and simultaneously reads the weight in the weight RAM;

the address calculation formula in this embodiment is as follows:

get address ═ wr _ address/Mk)

Wr_i-1_en＝((wr_addr％Mk)＝i-1)

Writing into the (i-1) th ram;

wherein i is 1, 2, …, Mk; wr _ addr represents an address written into the sub-RAM, Wr_i-1_ en denotes a write enable signal of the i-1 st sub-RAM;

the strip array reads data, namely continuously outputs Mk continuous data addresses of the RAM and data corresponding to the addresses, specifically as follows:

address＝(rd_addr/Mk)+((rd_addr/Mk)>0)

rd_i-1_en＝i-1

wherein rd _ addr represents the address of the read sub-RAM, rd_i-1And en denotes a read enable signal of the i-1 st sub-RAM.

In the embodiment, the input of the stripe array is continuous 5 data, that is, Mk is 5, which requires that the ping-pong RAM can output any 5 continuous address data in clock cycles, and therefore, the sub-features need to be read from the feature RAM quickly, the embodiment takes continuous output RAMs as an example to illustrate a process of caching the sub-features into 5 sub-RAMs and reading the sub-features, as shown in fig. 18, the continuous output RAM can realize in and out continuously, the principle is that the continuous output RAM includes 5 RAMs, which are respectively denoted as RAM0, RAM1, RAM2, RAM3, and RAM4, and the sub-features are divided into 5 parts and written into the 5 sub-RAMs according to the above address calculation formula, wherein 1/5 parts of sub-features are stored in each sub-RAM, and when reading, the 5 sub-RAMs simultaneously output sub-feature data by sequential combination, so as to achieve the effect of outputting continuous address data.

S6: the PE arithmetic unit acquires the sub-characteristics and the weight input by the RAM reading module; a multiplier in the PE arithmetic unit carries out multiplication operation on the input sub-characteristics and the weight to obtain a corresponding operation result;

the sub-feature described in this embodiment is 8 bits, and the weight is 8 bits, so the convolution operation is 8 × 8. The multiplier bit width inside the state of the art Xilinx FPGA is typically 18 × 25, and conventionally 2 multipliers are required to implement two multiplication operations a × B and a × C.

The multiplier in the PE operation unit performs multiplication operation on input sub-features and weights, as shown in FIG. 19, the multiplier operation method in this embodiment includes that a weight C is shifted to the left by 16 bits, then weights B of upper and lower addresses are added, then the multiplication is performed on the weight C and the sub-feature A, the lower 16 bits of the obtained result are the result of the sub-feature A and the weight B, the higher bits are the result of the sub-feature A and the weight C, and finally the two multiplication results are separated and accumulated respectively.

S7: and the block accumulation module accumulates and outputs the operation result of the height H direction sub-characteristic.

The RAM reading module reads Mk sub-features and Tk weights in each clock cycle, the Mk sub-features and the Tk weights are respectively input to Mk PE operation units for multiplication operation, sub-features and Tk weights are obtained corresponding to clock cycles of each PE operation unit, sub-features and Tk weights are calculated in each PE operation unit clock cycles, Tk results are obtained, the Mk PE operation units perform parallel operation, Mk Tk operation results are output in calculation cycles, calculation cycles are size clock cycles, and therefore the time for calculating input sub-feature features is as follows:

time ═ W × H × (size × size)/(Mk) —, in units: clock period

Wherein W is the input sub-feature length, H is the input sub-feature height, and C is the input sub-feature channel number; size is the size of the weight.

It follows that the calculation time is smaller as Mk and Tk are larger.

As shown in fig. 20, the stripe array module according to this embodiment includes a plurality of PE operation units, each of which operates Tk input features feature and Tk weight, since PE operation units include Tk/2 multipliers, multipliers can complete two multiplications, and therefore PE operation units clock cycles can complete Tk multiplication, and the stripe array module includes Mk PE operation units, so the whole stripe array module has Mk Tk/2 multipliers in total, and clock cycles can complete Mk Tk multiplication.

In this embodiment, taking xilinx ZU4CG chip as an example, the chip has 728 multipliers in total, and we configure Mk to 5 and Tk to 16, then the stripe array consumes 5 × 16 × 16/2 — 640 multipliers, and the call rate reaches 87.9%, which is very high.

The number Mk of the PE operation units in the stripe array of this embodiment can be configured arbitrarily, because all the PE operation units are used to calculate the same rows of data, each PE operation unit accumulates the operation results of the multipliers, and Mk operation results are obtained by the Mk operation units through each convolution kernel size × size clock cycle.

In the embodiment, each PE operation units perform multiplier operation on input sub-feature features and weights 0-weight (Tk-1) in parallel, so that calculation cycles are realized to obtain Tk operation results.

The embodiment is provided with 5 PE operation units, wherein the 5 PE units parallelly calculate input sub-feature parallel and weight, and 5 operation results are obtained in each calculation period; therefore, Tk 5 operation results are obtained in each calculation period.

As shown in fig. 21, a schematic diagram of the operation of the stripe array composed of Mk PE operation units in the present embodiment is shown, in the present embodiment, taking a feature of 10 × 10 as an example, let a weight single convolution kernel size be 3 × 3, Mk ═ 5, clk be a clock signal, 1 to 9 represent 9 clock cycles, PE0 calculates 0 × w0 in the 1 st clock cycle, PE1 calculates 1 × w0, PE2 calculates 2 × w0, PE3 calculates 3 × w0, and PE4 calculates 4 × w 0; the 2 nd clock cycle PE0 calculates 1 × w1, PE1 calculates 2 × w1, PE2 calculates 3 × w1, PE3 calculates 4 × w1, PE4 calculates 5 × w 1; the 3 rd clock cycle PE0 calculates 2 × w2, PE1 calculates 3 × w2, PE2 calculates 4 × w2, PE3 calculates 5 × w2, PE4 calculates 6 × w 2; and by analogy, each PE operation unit accumulates operation results of each clock cycle, and finally, 5 operation results F0-F4 are obtained after 9 clock cycles.

The convolution kernel of the above example is 3 × 3, and the strip array yields 5 results F0-F4 after 9 clock cycles; when the weight single convolution kernel size is 5 × 5, the stripe array obtains 5 results F0-F4 through 25 clock cycles; when the weight single convolution kernel size is 7 × 7, the stripe array obtains 5 results F0-F4 through 49 clock cycles; therefore, the method can support any convolution kernel size, and all multipliers can be used no matter what convolution kernels are used, and the utilization rate is 100%.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

The device comprises a neural network accelerator, a global control module, a direct memory access module DMA and a memory, wherein the neural network accelerator is externally connected with the global control module, the direct memory access module DMA, the convolution module, a single data processing module, a plane data processing module, a channel data processing module and a probability calculation module;

the convolution module carries out multiply-add operation on input data;

the single data processing module is used for sequentially performing classification, function activation and proportional operation on data;

the plane data processing module is used for performing maximum pooling, minimum pooling and average pooling on data;

the channel data processing module is used for carrying out channel splicing, surface rearrangement and matrix replacement processing on data;

the probability calculation module is used for finding out the maximum 5 values in the data and completing the probability calculation of the 5 maximum values;

the convolution module, the single data processing module, the plane data processing module, the channel data processing module and the probability calculation module are connected in a pipeline mode; the second DMA transmits data to a convolution module;

the convolution module and the channel data processing module share DMA control buses.
2. The neural network accelerator of claim 1, wherein: the convolution module, the single data processing module, the plane data processing module, the probability calculation module and the channel data processing module are all provided with bypass options;

when the operation only needs the convolution module and the single data processing module, the plane data processing module, the channel data processing module and the probability calculation module are bypassed;

when the operation only needs the convolution module, the single data processing module, the plane data processing module, the channel data processing module and the probability calculation module are bypassed.
3. The neural network accelerator of claim 1, wherein the second direct memory access module DMA is in communication connection with a memory through an AXI communication protocol, the global control module is provided with an instruction FIFO, the th direct memory access module DMA is controlled by a CPU to load data into the memory and load instructions into the FIFO in the global control module, the global control module is started to start operation after all the instructions are loaded, the neural network accelerator reads data through the second DMA to perform operation, the operation is returned to the CPU after the operation is completed, the data obtained by the operation is stored into the memory through the second DMA, and the CPU reads the data obtained by the operation through the st DMA.
4. The neural network accelerator according to , wherein the data includes features and weights, the features are stored in a memory in an N-channel arrangement, the features are three-dimensional matrices, the width of each three-dimensional matrix is Wi, the height of each three-dimensional matrix is Hi, the number of channels is C, the arrangement order is N channels, the features of each N channel are stored in the memory in consecutive addresses, the cumulative sum of all N is equal to C, and N is a power of 2.
5. The neural network accelerator of claim 4, wherein: the global control module comprises an instruction FIFO; the global control module takes out an instruction from the instruction FIFO and distributes the instruction to the convolution module, the single data processing module, the plane data processing module and the probability calculation module after receiving the starting command;

each instruction is 256 bytes, wherein, the 0 th byte to the 3 rd byte represents a module enabling option and represents whether the module needs to be used or not; the 4 th to 67 th bytes represent convolution module control information, including feature height, width, channel, convolution kernel size, convolution stepping and convolution filling information; bytes 68-131 represent control information of the channel data processing module, including characteristic height, width, channel and rearrangement mode information; the 132 th-163 th bytes represent single data processing module control information, including information of sign height, width, channel, operation mode and parameter size; the 164 th-195 th byte represents control information of a plane data processing module, and the control information comprises characteristic height, width, channel, size, pooling stepping and pooling filling information; bytes 196-227 represent probability calculation module control information, including classification length information.
6. The neural network accelerator of claim 4, wherein: the convolution module comprises a DMA controller, a data distribution module, a ping-pong RAM, an RAM reading module, a strip array module, a block accumulation module and a register module;

the DMA controller is used for controlling the DMA to read data from the memory; the data is subjected to characteristic and weight separation through a data distribution module and then cached in a ping-pong RAM; the RAM reading module reads the characteristics and the weight from the ping-pong RAM and sends the characteristics and the weight to the strip array module for operation processing, and an operation result is output through the block accumulation module; the register module receives an instruction from a CPU through a global control module, so that the operation of the convolution module is controlled.
7. The neural network accelerator of claim 6, wherein: the ping-pong RAM comprises a feature RAM and a weight RAM; the characteristic RAM comprises two continuous output RAMs which are respectively marked as ramA and ramB, wherein the continuous output RAMs comprise Mk sub RAMs; the weight RAM comprises ramA 'and ramB';

the ping-pong RAM adopts a continuous cache mode, namely, the features are cached into ramA, when the RAM reading module reads the features in the ramA, simultaneously, the ramB is cached into the next features, and when the RAM reading module finishes reading the features in the ramA, the features in the ramB are read;

similarly, the weight is cached into the ramA ', when the RAM reading module reads the weight in the ramA ', the next weights are cached into the ramB ', and when the RAM reading module finishes reading the weight in the ramA ', the weight in the ramB ' is read;

the stripe array module comprises Mk PE arithmetic units, and the PE arithmetic units comprise Tk/2 multipliers.
8. The neural network accelerator of claim 7, wherein: the planar data processing module firstly performs width direction pooling operation on the characteristic data, and is provided with N width direction operation units for operating N channels in parallel;

and then, carrying out height direction pooling operation on the characteristic data, and also setting N height direction operation units to operate N channels in parallel.
9. The neural network accelerator of claim 8, wherein: the channel data processing module comprises a BUF buffer, 2 selectors and a rearrangement operator arranged among the 2 selectors; the rearrangement operator comprises a channel splicing operator, a surface rearrangement operator and a matrix permutation operator;

the channel splicing operator is used for splicing two matrix channel directions, wherein the two matrix channels have the same height H and width W, the channels are not fixed to be the same, the two matrix channels become new characteristics after channel splicing, the height is H, the width is W, the channels are C0+ C1, and C0 and C1 represent the channel number of different matrix channels;

the surface rearrangement operator is used for rearranging each surface of features into four surfaces, and after surface rearrangement, the four surfaces become new features, wherein the height of the new features is H/2, the width of the lane is W/2, and the lane is C4, wherein H represents the height of the original features, W represents the width of the original features, and C represents the lane of the original features;

and the matrix permutation operator is used for changing the dimensionality of the matrix and permuting the dimensionality sequence of the features to obtain different dimensionality sequences.
10. The neural network accelerator of claim 8, wherein: the probability calculation module comprises a top5_ comp module, a probability operation unit and a reg register; the top5_ comp module searches for 5 maximum values in the input data stream by adopting a downward search method; the probability operation unit performs probability operation on the obtained 5 maximum values and transmits an obtained operation result to the second DMA through a reg register;

the probability operation is specifically as follows:

in the formula, x_iRepresenting the input classification data.