WO2020258528A1

WO2020258528A1 - Configurable universal convolutional neural network accelerator

Info

Publication number: WO2020258528A1
Application number: PCT/CN2019/105533
Authority: WO
Inventors: 陆生礼; 庞伟; 舒程昊; 刘昊; 范雪梅; 苏晶晶
Original assignee: 东南大学
Priority date: 2019-06-25
Filing date: 2019-09-12
Publication date: 2020-12-30
Also published as: CN110390384B; CN110390384A

Abstract

A configurable universal convolutional neural network accelerator. The accelerator comprises: a PE array, a state controller, a function module, a weight cache region, a feature map cache region, an output cache region, and a register stack; the state controller comprises a network parameter register and a working state controller. An excellent acceleration effect can be achieved for networks of different scales by configuring the network parameter register, and the working state controller controls switching of the working state of the accelerator and sends a control signal to other modules. The weight cache region, the feature map cache region, and the output cache region are each composed of multiple data sub-cache regions and used for respectively storing weight data, feature map data, and calculation results. According to the accelerator, proper data reuse modes, array sizes, and the number of sub-cache regions can be configured for different network characteristics, and the accelerator is good in universality, low in power consumption, and high in throughput.

Description

A configurable general convolutional neural network accelerator

Technical field

The invention discloses a configurable general convolutional neural network accelerator, which belongs to the technical field of calculation, calculation and counting.

Background technique

In recent years, deep neural networks have developed faster and faster and have been widely used, and have achieved remarkable results in application fields such as text recognition, image recognition, target tracking, face detection and recognition. The scale of the deep neural network continues to increase as the application scenarios become more complex, and a large number of parameters need to be stored and calculated. Therefore, how to accelerate and implement large-scale deep neural networks on hardware has become an important issue in the field of machine learning.

GPU (Graphic Processing Unit, graphics processor) and multi-core CPU (Central Processing Unit, central processing unit) are commonly used equipment for accelerating large-scale deep neural networks, but they must be transplanted to mobile devices with limited power consumption and volume. Neural networks are almost impossible. Therefore, it is necessary to design a dedicated acceleration circuit to meet the calculation and storage requirements of large-scale deep neural networks. Compared with GPU and multi-core CPU, ASIC (Application Specific Integrated Circuit) has higher performance and lower power consumption, but its development cycle is long, cost is high, and design flexibility is low. FPGA (Filed Programmable Gate Array) is another mainstream acceleration hardware. Compared with ASIC, FPGA has the characteristics of short development cycle, low cost, and high design flexibility, but its performance is low and High power consumption. Appropriate acceleration hardware should be selected for specific application scenarios.

Existing hardware accelerators can achieve better throughput and energy efficiency ratio for specific deep neural network structures, but in the face of complex application scenarios, deep neural network structures are constantly changing. Dedicated hardware accelerators often have insufficient acceleration effects in the face of changing network structures. Therefore, it is necessary to develop a configurable general neural network accelerator for the changeable deep neural network structure.

Based on the above analysis, the existing accelerators have the problem of low versatility and inability to adapt to changing neural network structures. This application aims to propose a configurable general convolutional neural network that uses different scales of storage and computing resources for different network structures. Network to achieve excellent throughput and energy efficiency ratio indicators.

Summary of the invention

The purpose of the present invention is to provide a configurable general convolutional neural network accelerator in view of the above-mentioned background technology deficiencies. By configuring the network parameters, the acceleration of the convolutional neural network structure of various scales is realized, aiming at the neural network of different structures. Adopting different data multiplexing modes and adopting highly parallelized processing units to obtain higher computing throughput under the condition of using less resources, which solves the problem that the existing hardware accelerators cannot adapt to the application requirements of the changeable neural network structure. The hardware accelerator faces the technical problem that the acceleration effect of the changed network structure is not ideal.

The present invention adopts the following technical solutions to achieve the above-mentioned purpose of the invention:

A configurable general convolutional neural network accelerator, including: state controller, feature map buffer, weight buffer, register stack, PE array, output buffer, functional module and AXI4 bus interface.

The state controller selects the accelerator data reuse mode and state transition sequence according to the network parameters and controls the switching of the accelerator working state; the characteristic map buffer area is used to buffer the characteristic map data read from the external memory through the AXI4 bus interface, and before the calculation starts , The feature map data required for a convolution calculation is stored in the register stack; the weight buffer area is used to cache the weight data read from the external memory through the AXI4 bus interface, and after the calculation is started, the weight data is directly input to Each PE unit; the register stack is used to cache the feature map data required for a calculation, and the feature map data in the register stack is gradually updated after the calculation is started; the PE array is used to read the feature map data in the register stack and the weight buffer area and Weight data, and store the convolution calculation result in the output buffer area; the output buffer area is used to store the convolution calculation result, and the calculation result is sent to the function module after the calculation is completed; the function module is used to complete the convolution after adding paranoia , BN calculation, RELU calculation, average pooling and maximum pooling operations, and the final calculation results are packaged and sent to the external memory through the AXI4 bus interface.

The state controller is divided into network parameter register and working state controller. In reading the network parameter state, the state controller reads the network parameter from the external memory through the AXI4 bus interface, and updates its own network parameter register. The accelerator configuration can be updated by updating the network parameter register, which is used to accelerate neural network structures of different sizes and use the optimal configuration parameters. Configuration parameters include: data reuse mode, feature map size, convolution kernel size, array size, number of sub-buffer areas, number of input channels, number of output channels, and functional module configuration information. The working states of the accelerator are: waiting, reading network parameters, reading BN parameters, reading feature maps, reading weights, calculating and sending. The working state controller will control the accelerator's working state switching according to the read network parameters, and send corresponding control signals to other modules.

Data reuse modes include: input reuse, output reuse, and weight reuse. Choose appropriate data reuse modes for convolutional layers of different sizes, minimize the number of memory accesses, and improve accelerator performance. The data reuse mode used by each layer is configured through network parameters. Input reuse mode means that after a batch of data calculation is completed, the input feature map is retained and the weight data is replaced. The accelerator first enters the working state of reading the feature map, then enters the reading weight state, and enters the calculation state after reading, and returns to the reading weight after the calculation is completed State, repeat the previous state jump process until the controller prompts to enter the sending state. Output reuse mode means that after a batch of data calculation is completed, the intermediate calculation results are retained, and the feature map and weight data are replaced at the same time. The accelerator first enters the state of reading the feature map, then enters the state of reading weights, and enters the calculation state after reading, and after the calculation is completed Return to the state of reading the feature map and repeat the previous state jump process until the controller prompts to enter the sending state. The weight reuse mode means that after a batch of data calculation is completed, the weights are retained and the feature map data is replaced. The accelerator first enters the weight reading state, then enters the read feature map state, enters the calculation state after reading, and returns to the read feature map state after the calculation is completed , Repeat the previous state jump process until the controller prompts to enter the sending state.

The feature map cache is divided into M feature map sub-buffer areas, and the value of M is determined by the number of sub-buffer areas in the configuration parameter. The feature map data of each input channel read from the external memory through the AXI4 interface is stored in the corresponding feature map sub-buffer area by line. When the last feature map sub-buffer area has stored one line of image data, the next line of image data of the feature map data is returned and stored in the first feature map sub-buffer area. The feature map data of the next input channel is stored in the feature map buffer in the same mode.

The weight buffer is divided into N weight sub-buffer areas, and the value of N is determined by the number of PE array columns in the configuration parameter. The weight data read from the external memory through the AXI4 interface is stored in the weight sub-buffer in the order of the filter. Each column of PEs shares a weight sub-buffer area, and during calculation, the weight sub-buffer area sends the weight to each PE in the corresponding column.

The output buffer is divided into R output sub-buffer areas, and the value of R is determined by the number of PE array rows used. Each row of PE corresponds to an output sub-buffer area. The output result of each line of PE is one line of data of multiple output feature maps, which are stored in the output sub-buffer area corresponding to each line of PE in the order of output feature maps.

The register stack buffers all the feature map data required for one calculation of the PE array before the calculation starts. In the calculation process, every time K*S feature points (K: convolution kernel size, S: step size) are calculated, the feature map data in the register stack will be updated to ensure that before the next convolution calculation starts, the required The feature map data caching is completed.

The PE array is a two-dimensional systolic array composed of multiple arithmetic units, which is used for convolution operations. Each row of PE corresponds to one row of the output feature map, and each column of PE corresponds to one output feature map. The input feature map data is input from the PE of the first column and passed to the PE of the next adjacent column in turn. The weight data is directly input to each PE from the weight sub-buffer corresponding to each column.

The AXI4 bus interface is used to mount the accelerator on any bus device that uses the AXI4 protocol. The bit width of the AXI4 bus is greater than the data bit width used in calculations, so multiple data are spliced into one bus for data transmission to improve transmission efficiency.

The present invention adopts the above technical scheme and has the following beneficial effects: the present invention uses the state controller to configure the optimal accelerator parameters matching the neural network structure according to the network parameters, and flexibly adjusts the PE array size and the division of the sub-buffer area to meet the needs of the neural network structure. Change application requirements, obtain the best acceleration effect under certain resource constraints, at the same time, adopt a configurable data reuse mode, adopt the optimal data reuse mode for different neural network structures, make full use of transmission bandwidth, and take advantage of high parallelism The PE array structure achieves a higher data throughput rate.

Description of the drawings

Fig. 1 is a schematic structural diagram of a general convolutional neural network accelerator disclosed in the present invention.

Figure 2 is a schematic diagram of the data flow of the PE array in the present invention.

Fig. 3 is a schematic diagram of the work flow of the general convolutional neural network accelerator disclosed in the present invention.

Detailed ways

The technical solution of the invention will be described in detail below in conjunction with the drawings.

The configurable general convolutional neural network accelerator designed by the present invention is shown in Figure 1. The size of the two PE arrays is 14*16, the size of the convolution kernel is 3*3, the step length of the convolution kernel is 1, and the input feature map is The size is 15*15 (after adding padding), the number of input channels in a single batch is 14, the number of output channels in a single batch is 32, and the data multiplexing mode is output multiplexing as an example, and the working method is described in detail.

After the accelerator in the waiting state receives the start signal. The accelerator reads the network parameters from the external memory through the bus interface, updates the network parameter register, and determines the data reuse mode and the switching sequence of the working state according to the value of the network parameter register. The BN parameter read through the accelerator interface is divided into two parts and stored in two BN parameter buffer areas for the output of the two PE arrays. The feature map data read through the accelerator interface is cached in five feature map sub-buffer areas by line, and 3 lines of feature map data are cached in each feature map sub-buffer area. The weight data read through the accelerator interface is stored in 32 weight sub-buffer areas in the order of the filter. Before the calculation starts, 15*3*14 feature map data are read from the feature map sub-buffer area and stored in the register stack. In the calculation process, the PE array obtains data from the register stack and the weight sub-buffer area for convolution operation. After calculating 3 times of multiplying and accumulating operations, 15*1 feature map data in the register stack is updated. After calculating 3*3*14 data, output the calculation result once. The calculation results are stored in the corresponding output sub-buffer area in the order of the feature map. Since the data multiplexing mode of this layer network is output multiplexing, it returns to the read characteristic map state after the calculation is completed, and enters the read weight state and the calculation state in sequence, and repeats the state cycle until the state controller gives the calculation completion command. The accelerator sends the data to the functional module to calculate the final output data after the convolution calculation is completed, the accelerator jumps from the calculation state to the sending state, and sends the data to the external memory through the AXI4 bus interface.

Referring to FIG. 2, the input data of each row of the PE array is provided by the register stack, the weight data of each column is provided by the corresponding weight sub-buffer area, and the output data of each row of PE is stored in the corresponding output sub-buffer area. Before the calculation starts, 15*3*14 feature map data are stored in the register stack. The data from rows 1 to 3 are sent to the first PE in column 1, the data from rows 2 to 4 are sent to the second PE in column 1,..., the data from rows 13 to 15 are sent to the first PE List the last PE. In a 3*3 convolution window, take the first PE in the first column as an example, read the data in column order, and read the first data of the first three rows in the register stack in the first three clock cycles. After calculating 3*1 feature map data, update the first column of feature map data in the register stack. After the calculation is completed, the data is output to the corresponding output sub-buffer, and each output sub-buffer stores one line of the output feature map. Take the output sub-buffer area of the first line of PE as an example. The 13 data of the first output feature map are stored at addresses 0 to 12, and the addresses 13 to 25 are stored for the second output feature map data,..., all are stored in this order The output feature map data.

As shown in Figure 3, the accelerator has 7 working states: waiting, reading network parameters, reading BN parameters, reading feature maps, reading weights, calculating and sending. The selection and switching of the working state are determined by the accelerator working state controller. The accelerator working status controller judges whether it is necessary to read the BN parameter status by reading the network parameter register value. And the accelerator state controller determines the cycle sequence of reading the characteristic map, reading the weight and calculating the three states by judging the accelerator data reuse mode.

In the input reuse mode, the accelerator first enters the state of reading feature maps, then enters the state of reading weights, and finally enters the calculation state, and returns to the state of reading weights after the calculation state ends; in the output reuse mode, the accelerator first enters the state of reading feature maps, and then Enter the read weight state, and finally enter the calculation state. After the calculation state is over, return to the read feature map state; in the weight reuse mode, the accelerator first enters the read weight state, then enters the read feature map state, and finally enters the calculation state, and ends in the calculation state Then return to the status of reading the feature map.

Waiting state: After initialization, the accelerator is in the waiting state. The accelerator working state controller waits for an external signal to start the accelerator. After the accelerator receives the start signal, it jumps to the state of reading network parameters. After the calculation of the last layer of convolution is completed, the accelerator returns to the waiting state and waits for the next trigger.

Read network parameter state: the accelerator enters the state of reading network parameters, reads the previously stored network parameters from the external memory through the AXI4 bus interface, analyzes and stores the read back bus data into the corresponding network parameter register, and reads back the network parameters Including data storage offset address, accelerator data multiplexing mode, accelerator function module configuration, network size and convolution kernel size, the optimal accelerator network parameters can be selected for different network sizes, and the best can be achieved for different network accelerators Performance indicators.

Read BN parameter state: After the accelerator enters this state, it reads the BN and bias parameters from the external memory through the AXI4 bus interface, and stores them in two BN parameter storage areas and bias parameter storage areas. After the data is read, it is based on the current data The multiplexing mode enters the read feature map state or read weight state.

Read feature map state: After the accelerator enters this state, the number of feature map sub-buffer areas used is determined by the configured accelerator network parameters, and the feature map data is read from the external memory through the AXI4 bus interface and stored in the feature map in line order In the buffer area, after reading the data, it is determined whether to enter the read weight state or the calculation state according to the current data multiplexing mode.

Read weight state: After the accelerator enters this state, the number of weight sub-buffer areas used is determined by the network parameters, and the weight data is read from the external memory through the AXI4 bus interface, and the weight data is stored in the weight sub-buffer area in the order of the filter, and read After the data is completed, it is determined whether to enter the read characteristic map state or the calculation state according to the current data multiplexing mode.

Calculation state: After the accelerator enters this state, it starts to read the calculation data from the register stack and the weight buffer in turn, and completes the convolution calculation. After the calculation is completed, according to the accelerator working state controller signal, it decides to enter the function module calculation or return Read the data state, after the convolution result is output by the function module, the calculation state ends, and the accelerator enters the sending state.

Sending state: In the sending state, the accelerator packs the calculation results output by the functional module and sends them to the external storage area through the AXI4 bus interface. In the embodiment, the output result bit width is 16 bits, and the bus bit width is 64 bits, so 4 The output result is combined into one bus data transmission, after the transmission is finished, the accelerator returns to the waiting state, waiting for the next trigger work.

The examples are merely illustrative of the technical ideas of the present invention, and cannot be used to limit the scope of protection of the present invention. Any changes made on the basis of the technical solutions disclosed in the present invention that conform to the inventive concept of the present application fall into the protection scope of the present invention.

Claims

A configurable general convolutional neural network accelerator, which is characterized in that it includes:

The state controller reads the network parameters from the external memory, configures the accelerator parameters including the data reuse mode and array size and the number of sub-buffer areas according to the network parameters, and switches the accelerator working state according to the data reuse mode.

A feature map buffer area containing multiple sub-buffer areas, and the feature map data read from the external memory is cached in rows according to the number of sub-buffer areas configured by the state controller.

The register stack, which caches the feature map data required for one-time calculation of the PE array,

The weight buffer area contains multiple sub-buffer areas, and the weight data read from the external memory is buffered in filter order according to the number of sub-buffer areas configured by the state controller.

PE array, each row of PE units reads feature map data from the register stack, each column of PE units reads the weight data cached in the same weight sub-buffer area, performs convolution calculations on the feature map data and weight data, and,

The output buffer area containing multiple sub-buffer areas buffers the row data of different feature maps output by each row of PE units.
The configurable universal convolutional neural network accelerator according to claim 1, wherein the state controller includes the size of the convolutional layer according to the network parameters read from the external memory, and configures the minimum number of accesses according to the size of the convolutional layer The data reuse mode includes: input data reuse mode, weight data reuse mode, and output data reuse mode.
The configurable universal convolutional neural network accelerator according to claim 1, wherein the accelerator further comprises:

BN parameter storage area, when the network parameter read from the external memory by the state controller contains the configuration information of the function module, the BN parameter read from the external memory is cached,

Bias parameter memory, when the network parameter read from the external memory by the state controller contains the configuration information of the function module, the Bias parameter read from the external memory is cached,

The function module, after receiving the instruction of the function operation from the state controller, performs the offset addition, normalization, activation, and pooling operations on the characteristic map line data stored in the output buffer in turn, and finally outputs the calculation of the neural network result.
The configurable universal convolutional neural network accelerator according to claim 1, wherein the number of sub-buffer areas in the feature map data buffer area is determined according to the number of sub-buffer areas configured by the state controller.
The configurable universal convolutional neural network accelerator according to claim 1, wherein the number of sub-buffer areas of the weight buffer is determined according to the number of array columns configured by the state controller.
The configurable universal convolutional neural network accelerator according to claim 1, wherein the number of sub-buffer areas of the output buffer area is determined according to the number of array rows configured by the state controller.
The configurable universal convolutional neural network accelerator according to claim 2, wherein the state controller initializes the accelerator to enter the state of reading network parameters, and after completing the configuration of the accelerator parameters according to the read network parameters, the input data is reused In the mode, switch the accelerator to enter the read feature map state, read weight state, and calculate state in sequence. In the weight data reuse mode, switch the accelerator to enter the read weight state, read feature map state, and calculate state in turn. In the output data reuse state, switch the accelerator to enter in turn Read feature map status, read weight status, calculation status, and switch to data sending status after completing convolution calculation.
The configurable universal convolutional neural network accelerator according to claim 7, wherein when the network parameters read by the state controller from the external memory include functional module configuration information, switch to read BN parameters and Bias parameters , And then switch the working state according to the data reuse mode.