WO2020258528A1 - Configurable universal convolutional neural network accelerator - Google Patents

Configurable universal convolutional neural network accelerator Download PDF

Info

Publication number
WO2020258528A1
WO2020258528A1 PCT/CN2019/105533 CN2019105533W WO2020258528A1 WO 2020258528 A1 WO2020258528 A1 WO 2020258528A1 CN 2019105533 W CN2019105533 W CN 2019105533W WO 2020258528 A1 WO2020258528 A1 WO 2020258528A1
Authority
WO
WIPO (PCT)
Prior art keywords
accelerator
data
state
read
feature map
Prior art date
Application number
PCT/CN2019/105533
Other languages
French (fr)
Chinese (zh)
Inventor
陆生礼
庞伟
舒程昊
刘昊
范雪梅
苏晶晶
Original Assignee
东南大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 东南大学 filed Critical 东南大学
Publication of WO2020258528A1 publication Critical patent/WO2020258528A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/60Memory management

Definitions

  • the invention discloses a configurable general convolutional neural network accelerator, which belongs to the technical field of calculation, calculation and counting.
  • GPU Graphic Processing Unit, graphics processor
  • multi-core CPU Central Processing Unit, central processing unit
  • ASIC Application Specific Integrated Circuit
  • FPGA Field Programmable Gate Array
  • the existing accelerators have the problem of low versatility and inability to adapt to changing neural network structures.
  • This application aims to propose a configurable general convolutional neural network that uses different scales of storage and computing resources for different network structures. Network to achieve excellent throughput and energy efficiency ratio indicators.
  • the purpose of the present invention is to provide a configurable general convolutional neural network accelerator in view of the above-mentioned background technology deficiencies.
  • the acceleration of the convolutional neural network structure of various scales is realized, aiming at the neural network of different structures. Adopting different data multiplexing modes and adopting highly parallelized processing units to obtain higher computing throughput under the condition of using less resources, which solves the problem that the existing hardware accelerators cannot adapt to the application requirements of the changeable neural network structure.
  • the hardware accelerator faces the technical problem that the acceleration effect of the changed network structure is not ideal.
  • a configurable general convolutional neural network accelerator including: state controller, feature map buffer, weight buffer, register stack, PE array, output buffer, functional module and AXI4 bus interface.
  • the state controller selects the accelerator data reuse mode and state transition sequence according to the network parameters and controls the switching of the accelerator working state;
  • the characteristic map buffer area is used to buffer the characteristic map data read from the external memory through the AXI4 bus interface, and before the calculation starts ,
  • the feature map data required for a convolution calculation is stored in the register stack;
  • the weight buffer area is used to cache the weight data read from the external memory through the AXI4 bus interface, and after the calculation is started, the weight data is directly input to Each PE unit;
  • the register stack is used to cache the feature map data required for a calculation, and the feature map data in the register stack is gradually updated after the calculation is started;
  • the PE array is used to read the feature map data in the register stack and the weight buffer area and Weight data, and store the convolution calculation result in the output buffer area;
  • the output buffer area is used to store the convolution calculation result, and the calculation result is sent to the function module after the calculation is completed;
  • the function module is used to complete the convolution after adding paranoi
  • the state controller is divided into network parameter register and working state controller.
  • the state controller In reading the network parameter state, the state controller reads the network parameter from the external memory through the AXI4 bus interface, and updates its own network parameter register.
  • the accelerator configuration can be updated by updating the network parameter register, which is used to accelerate neural network structures of different sizes and use the optimal configuration parameters.
  • Configuration parameters include: data reuse mode, feature map size, convolution kernel size, array size, number of sub-buffer areas, number of input channels, number of output channels, and functional module configuration information.
  • the working states of the accelerator are: waiting, reading network parameters, reading BN parameters, reading feature maps, reading weights, calculating and sending.
  • the working state controller will control the accelerator's working state switching according to the read network parameters, and send corresponding control signals to other modules.
  • Data reuse modes include: input reuse, output reuse, and weight reuse. Choose appropriate data reuse modes for convolutional layers of different sizes, minimize the number of memory accesses, and improve accelerator performance.
  • the data reuse mode used by each layer is configured through network parameters.
  • Input reuse mode means that after a batch of data calculation is completed, the input feature map is retained and the weight data is replaced.
  • the accelerator first enters the working state of reading the feature map, then enters the reading weight state, and enters the calculation state after reading, and returns to the reading weight after the calculation is completed State, repeat the previous state jump process until the controller prompts to enter the sending state.
  • Output reuse mode means that after a batch of data calculation is completed, the intermediate calculation results are retained, and the feature map and weight data are replaced at the same time.
  • the accelerator first enters the state of reading the feature map, then enters the state of reading weights, and enters the calculation state after reading, and after the calculation is completed Return to the state of reading the feature map and repeat the previous state jump process until the controller prompts to enter the sending state.
  • the weight reuse mode means that after a batch of data calculation is completed, the weights are retained and the feature map data is replaced.
  • the accelerator first enters the weight reading state, then enters the read feature map state, enters the calculation state after reading, and returns to the read feature map state after the calculation is completed , Repeat the previous state jump process until the controller prompts to enter the sending state.
  • the feature map cache is divided into M feature map sub-buffer areas, and the value of M is determined by the number of sub-buffer areas in the configuration parameter.
  • the feature map data of each input channel read from the external memory through the AXI4 interface is stored in the corresponding feature map sub-buffer area by line.
  • the last feature map sub-buffer area has stored one line of image data
  • the next line of image data of the feature map data is returned and stored in the first feature map sub-buffer area.
  • the feature map data of the next input channel is stored in the feature map buffer in the same mode.
  • the weight buffer is divided into N weight sub-buffer areas, and the value of N is determined by the number of PE array columns in the configuration parameter.
  • the weight data read from the external memory through the AXI4 interface is stored in the weight sub-buffer in the order of the filter.
  • Each column of PEs shares a weight sub-buffer area, and during calculation, the weight sub-buffer area sends the weight to each PE in the corresponding column.
  • the output buffer is divided into R output sub-buffer areas, and the value of R is determined by the number of PE array rows used. Each row of PE corresponds to an output sub-buffer area.
  • the output result of each line of PE is one line of data of multiple output feature maps, which are stored in the output sub-buffer area corresponding to each line of PE in the order of output feature maps.
  • the register stack buffers all the feature map data required for one calculation of the PE array before the calculation starts.
  • K*S feature points K: convolution kernel size, S: step size
  • the feature map data in the register stack will be updated to ensure that before the next convolution calculation starts, the required
  • the feature map data caching is completed.
  • the PE array is a two-dimensional systolic array composed of multiple arithmetic units, which is used for convolution operations.
  • Each row of PE corresponds to one row of the output feature map
  • each column of PE corresponds to one output feature map.
  • the input feature map data is input from the PE of the first column and passed to the PE of the next adjacent column in turn.
  • the weight data is directly input to each PE from the weight sub-buffer corresponding to each column.
  • the AXI4 bus interface is used to mount the accelerator on any bus device that uses the AXI4 protocol.
  • the bit width of the AXI4 bus is greater than the data bit width used in calculations, so multiple data are spliced into one bus for data transmission to improve transmission efficiency.
  • the present invention adopts the above technical scheme and has the following beneficial effects: the present invention uses the state controller to configure the optimal accelerator parameters matching the neural network structure according to the network parameters, and flexibly adjusts the PE array size and the division of the sub-buffer area to meet the needs of the neural network structure. Change application requirements, obtain the best acceleration effect under certain resource constraints, at the same time, adopt a configurable data reuse mode, adopt the optimal data reuse mode for different neural network structures, make full use of transmission bandwidth, and take advantage of high parallelism The PE array structure achieves a higher data throughput rate.
  • Fig. 1 is a schematic structural diagram of a general convolutional neural network accelerator disclosed in the present invention.
  • Figure 2 is a schematic diagram of the data flow of the PE array in the present invention.
  • Fig. 3 is a schematic diagram of the work flow of the general convolutional neural network accelerator disclosed in the present invention.
  • the configurable general convolutional neural network accelerator designed by the present invention is shown in Figure 1.
  • the size of the two PE arrays is 14*16
  • the size of the convolution kernel is 3*3
  • the step length of the convolution kernel is 1
  • the input feature map is The size is 15*15 (after adding padding)
  • the number of input channels in a single batch is 14
  • the number of output channels in a single batch is 32
  • the data multiplexing mode is output multiplexing as an example, and the working method is described in detail.
  • the accelerator After the accelerator in the waiting state receives the start signal.
  • the accelerator reads the network parameters from the external memory through the bus interface, updates the network parameter register, and determines the data reuse mode and the switching sequence of the working state according to the value of the network parameter register.
  • the BN parameter read through the accelerator interface is divided into two parts and stored in two BN parameter buffer areas for the output of the two PE arrays.
  • the feature map data read through the accelerator interface is cached in five feature map sub-buffer areas by line, and 3 lines of feature map data are cached in each feature map sub-buffer area.
  • the weight data read through the accelerator interface is stored in 32 weight sub-buffer areas in the order of the filter.
  • 15*3*14 feature map data are read from the feature map sub-buffer area and stored in the register stack.
  • the PE array obtains data from the register stack and the weight sub-buffer area for convolution operation. After calculating 3 times of multiplying and accumulating operations, 15*1 feature map data in the register stack is updated. After calculating 3*3*14 data, output the calculation result once.
  • the calculation results are stored in the corresponding output sub-buffer area in the order of the feature map. Since the data multiplexing mode of this layer network is output multiplexing, it returns to the read characteristic map state after the calculation is completed, and enters the read weight state and the calculation state in sequence, and repeats the state cycle until the state controller gives the calculation completion command.
  • the accelerator sends the data to the functional module to calculate the final output data after the convolution calculation is completed, the accelerator jumps from the calculation state to the sending state, and sends the data to the external memory through the AXI4 bus interface.
  • the input data of each row of the PE array is provided by the register stack
  • the weight data of each column is provided by the corresponding weight sub-buffer area
  • the output data of each row of PE is stored in the corresponding output sub-buffer area.
  • 15*3*14 feature map data are stored in the register stack.
  • the data from rows 1 to 3 are sent to the first PE in column 1
  • the data from rows 2 to 4 are sent to the second PE in column 1
  • the data from rows 13 to 15 are sent to the first PE List the last PE.
  • a 3*3 convolution window take the first PE in the first column as an example, read the data in column order, and read the first data of the first three rows in the register stack in the first three clock cycles.
  • the accelerator has 7 working states: waiting, reading network parameters, reading BN parameters, reading feature maps, reading weights, calculating and sending.
  • the selection and switching of the working state are determined by the accelerator working state controller.
  • the accelerator working status controller judges whether it is necessary to read the BN parameter status by reading the network parameter register value.
  • the accelerator state controller determines the cycle sequence of reading the characteristic map, reading the weight and calculating the three states by judging the accelerator data reuse mode.
  • the accelerator In the input reuse mode, the accelerator first enters the state of reading feature maps, then enters the state of reading weights, and finally enters the calculation state, and returns to the state of reading weights after the calculation state ends; in the output reuse mode, the accelerator first enters the state of reading feature maps, and then Enter the read weight state, and finally enter the calculation state. After the calculation state is over, return to the read feature map state; in the weight reuse mode, the accelerator first enters the read weight state, then enters the read feature map state, and finally enters the calculation state, and ends in the calculation state Then return to the status of reading the feature map.
  • Waiting state After initialization, the accelerator is in the waiting state.
  • the accelerator working state controller waits for an external signal to start the accelerator. After the accelerator receives the start signal, it jumps to the state of reading network parameters. After the calculation of the last layer of convolution is completed, the accelerator returns to the waiting state and waits for the next trigger.
  • Read network parameter state the accelerator enters the state of reading network parameters, reads the previously stored network parameters from the external memory through the AXI4 bus interface, analyzes and stores the read back bus data into the corresponding network parameter register, and reads back the network parameters Including data storage offset address, accelerator data multiplexing mode, accelerator function module configuration, network size and convolution kernel size, the optimal accelerator network parameters can be selected for different network sizes, and the best can be achieved for different network accelerators Performance indicators.
  • Read BN parameter state After the accelerator enters this state, it reads the BN and bias parameters from the external memory through the AXI4 bus interface, and stores them in two BN parameter storage areas and bias parameter storage areas. After the data is read, it is based on the current data The multiplexing mode enters the read feature map state or read weight state.
  • Read feature map state After the accelerator enters this state, the number of feature map sub-buffer areas used is determined by the configured accelerator network parameters, and the feature map data is read from the external memory through the AXI4 bus interface and stored in the feature map in line order In the buffer area, after reading the data, it is determined whether to enter the read weight state or the calculation state according to the current data multiplexing mode.
  • Read weight state After the accelerator enters this state, the number of weight sub-buffer areas used is determined by the network parameters, and the weight data is read from the external memory through the AXI4 bus interface, and the weight data is stored in the weight sub-buffer area in the order of the filter, and read After the data is completed, it is determined whether to enter the read characteristic map state or the calculation state according to the current data multiplexing mode.
  • Calculation state After the accelerator enters this state, it starts to read the calculation data from the register stack and the weight buffer in turn, and completes the convolution calculation. After the calculation is completed, according to the accelerator working state controller signal, it decides to enter the function module calculation or return Read the data state, after the convolution result is output by the function module, the calculation state ends, and the accelerator enters the sending state.
  • Sending state In the sending state, the accelerator packs the calculation results output by the functional module and sends them to the external storage area through the AXI4 bus interface.
  • the output result bit width is 16 bits
  • the bus bit width is 64 bits, so 4 The output result is combined into one bus data transmission, after the transmission is finished, the accelerator returns to the waiting state, waiting for the next trigger work.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

A configurable universal convolutional neural network accelerator. The accelerator comprises: a PE array, a state controller, a function module, a weight cache region, a feature map cache region, an output cache region, and a register stack; the state controller comprises a network parameter register and a working state controller. An excellent acceleration effect can be achieved for networks of different scales by configuring the network parameter register, and the working state controller controls switching of the working state of the accelerator and sends a control signal to other modules. The weight cache region, the feature map cache region, and the output cache region are each composed of multiple data sub-cache regions and used for respectively storing weight data, feature map data, and calculation results. According to the accelerator, proper data reuse modes, array sizes, and the number of sub-cache regions can be configured for different network characteristics, and the accelerator is good in universality, low in power consumption, and high in throughput.

Description

一种可配置的通用卷积神经网络加速器A configurable general convolutional neural network accelerator 技术领域Technical field
本发明公开了一种可配置的通用卷积神经网络加速器,属于计算、推算、计数的技术领域。The invention discloses a configurable general convolutional neural network accelerator, which belongs to the technical field of calculation, calculation and counting.
背景技术Background technique
近年来,深度神经网络发展得越来越快且得到了广泛的应用,在文字识别、图像识别、目标跟踪、人脸检测与识别等应用领域取得了显著的成果。深度神经网络的规模随着应用场景的越加复杂不断增大,需要存储和计算大量的参数。因此,如何加速并在硬件上实现大规模的深度神经网络成为机器学习领域的重要问题。In recent years, deep neural networks have developed faster and faster and have been widely used, and have achieved remarkable results in application fields such as text recognition, image recognition, target tracking, face detection and recognition. The scale of the deep neural network continues to increase as the application scenarios become more complex, and a large number of parameters need to be stored and calculated. Therefore, how to accelerate and implement large-scale deep neural networks on hardware has become an important issue in the field of machine learning.
GPU(Graphic Processing Unit,图形处理器)和多核CPU(Central Processing Unit,中央处理器)是加速大规模深度神经网络的常用设备,但要在功耗和体积受限的移动设备上移植大规模深度神经网络几乎是不可能的,因此,需要设计专用的加速电路来满足大规模深度神经网络的计算和存储需求。与GPU和多核CPU相比,ASIC(Application Specific Integrated Circuit,专用集成电路)的性能更高功耗更低,但其开发周期长,成本高并且设计灵活性低。FPGA(Filed Programmable Gate Array,现场可编程门阵列)是另一种主流的加速硬件,与ASIC相比,FPGA具有开发周期短、成本低、设计灵活性高的特点,但FPGA的性能较低且功耗较高。针对具体的应用场景应选择合适的加速硬件。GPU (Graphic Processing Unit, graphics processor) and multi-core CPU (Central Processing Unit, central processing unit) are commonly used equipment for accelerating large-scale deep neural networks, but they must be transplanted to mobile devices with limited power consumption and volume. Neural networks are almost impossible. Therefore, it is necessary to design a dedicated acceleration circuit to meet the calculation and storage requirements of large-scale deep neural networks. Compared with GPU and multi-core CPU, ASIC (Application Specific Integrated Circuit) has higher performance and lower power consumption, but its development cycle is long, cost is high, and design flexibility is low. FPGA (Filed Programmable Gate Array) is another mainstream acceleration hardware. Compared with ASIC, FPGA has the characteristics of short development cycle, low cost, and high design flexibility, but its performance is low and High power consumption. Appropriate acceleration hardware should be selected for specific application scenarios.
现有硬件加速器针对特定深度神经网络结构能够取得较好的吞吐量和能效比,但面对复杂的应用场景,深度神经网络结构也在不断发生变化。专用的硬件加速器面对变化的网络结构往往加速效果不够理想。因此,针对多变的深度神经网络结构,需要开发一种可配置的通用神经网络加速器。Existing hardware accelerators can achieve better throughput and energy efficiency ratio for specific deep neural network structures, but in the face of complex application scenarios, deep neural network structures are constantly changing. Dedicated hardware accelerators often have insufficient acceleration effects in the face of changing network structures. Therefore, it is necessary to develop a configurable general neural network accelerator for the changeable deep neural network structure.
基于上述分析,现有加速器存在通用性低,无法适应多变的神经网络结构的问题,本申请旨在提出一种针对不同的网络结构采用不同规模的存储和计算资源的可配置通用卷积神经网络以取得优异的吞吐量和能效比指标。Based on the above analysis, the existing accelerators have the problem of low versatility and inability to adapt to changing neural network structures. This application aims to propose a configurable general convolutional neural network that uses different scales of storage and computing resources for different network structures. Network to achieve excellent throughput and energy efficiency ratio indicators.
发明内容Summary of the invention
本发明的发明目的是针对上述背景技术的不足,提供了一种可配置的通用卷 积神经网络加速器,通过配置网络参数实现了各种规模卷积神经网络结构的加速,针对不同结构的神经网络采用不同的数据复用模式并采用高度并行化的处理单元,在使用较少资源情况下获取较高的运算吞吐率,解决了现有硬件加速器不能适应多变的神经网络结构的应用需求而专用的硬件加速器面对变化的网络结构加速效果不够理想的技术问题。The purpose of the present invention is to provide a configurable general convolutional neural network accelerator in view of the above-mentioned background technology deficiencies. By configuring the network parameters, the acceleration of the convolutional neural network structure of various scales is realized, aiming at the neural network of different structures. Adopting different data multiplexing modes and adopting highly parallelized processing units to obtain higher computing throughput under the condition of using less resources, which solves the problem that the existing hardware accelerators cannot adapt to the application requirements of the changeable neural network structure. The hardware accelerator faces the technical problem that the acceleration effect of the changed network structure is not ideal.
本发明为实现上述发明目的采用如下技术方案:The present invention adopts the following technical solutions to achieve the above-mentioned purpose of the invention:
一种可配置的通用卷积神经网络加速器,包括:状态控制器、特征图缓存区、权重缓存区、寄存器栈、PE阵列、输出缓存区、功能模块及AXI4总线接口。A configurable general convolutional neural network accelerator, including: state controller, feature map buffer, weight buffer, register stack, PE array, output buffer, functional module and AXI4 bus interface.
状态控制器根据网络参数选择加速器数据重用模式和状态转换顺序并控制加速器工作状态的切换;特征图缓存区用于缓存通过AXI4总线接口从外部存储器中读取的特征图数据,并在计算开始前,将一次卷积计算所需要的特征图数据存入寄存器栈中;权重缓存区用于缓存通过AXI4总线接口从外部存储器中读取的权重数据,并在开始计算后,将权重数据直接输入至各个PE单元;寄存器栈用于缓存一次计算所需要的特征图数据,在开始计算后逐步更新寄存器栈中的特征图数据;PE阵列用于读取寄存器栈和权重缓存区中的特征图数据及权重数据,并将卷积计算结果存入输出缓存区中;输出缓存区用于存储卷积计算结果,在计算结束后将计算结果发送到功能模块;功能模块用于完成卷积后的加偏执、BN计算、RELU计算、平均池化和最大池化操作,并将最终的计算结果打包通过AXI4总线接口发送至外部存储器。The state controller selects the accelerator data reuse mode and state transition sequence according to the network parameters and controls the switching of the accelerator working state; the characteristic map buffer area is used to buffer the characteristic map data read from the external memory through the AXI4 bus interface, and before the calculation starts , The feature map data required for a convolution calculation is stored in the register stack; the weight buffer area is used to cache the weight data read from the external memory through the AXI4 bus interface, and after the calculation is started, the weight data is directly input to Each PE unit; the register stack is used to cache the feature map data required for a calculation, and the feature map data in the register stack is gradually updated after the calculation is started; the PE array is used to read the feature map data in the register stack and the weight buffer area and Weight data, and store the convolution calculation result in the output buffer area; the output buffer area is used to store the convolution calculation result, and the calculation result is sent to the function module after the calculation is completed; the function module is used to complete the convolution after adding paranoia , BN calculation, RELU calculation, average pooling and maximum pooling operations, and the final calculation results are packaged and sent to the external memory through the AXI4 bus interface.
状态控制器分为网络参数寄存器和工作状态控制器。在读网络参数状态中,状态控制器通过AXI4总线接口从外部存储器中读取网络参数,并更新自身网络参数寄存器。通过更新网络参数寄存器可以更新加速器配置,用来加速不同规模的神经网络结构并使用最优的配置参数。配置参数包括:数据重用模式、特征图尺寸、卷积核尺寸、阵列尺寸、子缓存区个数、输入通道数、输出通道数和功能模块配置信息。加速器工作状态分别为:等待、读网络参数、读BN参数、读特征图、读权重、计算和发送。工作状态控制器会根据读取的网络参数控制加速器的工作状态切换,并将相应的控制信号发送到其它模块。The state controller is divided into network parameter register and working state controller. In reading the network parameter state, the state controller reads the network parameter from the external memory through the AXI4 bus interface, and updates its own network parameter register. The accelerator configuration can be updated by updating the network parameter register, which is used to accelerate neural network structures of different sizes and use the optimal configuration parameters. Configuration parameters include: data reuse mode, feature map size, convolution kernel size, array size, number of sub-buffer areas, number of input channels, number of output channels, and functional module configuration information. The working states of the accelerator are: waiting, reading network parameters, reading BN parameters, reading feature maps, reading weights, calculating and sending. The working state controller will control the accelerator's working state switching according to the read network parameters, and send corresponding control signals to other modules.
数据重用模式包括:输入重用、输出重用和权重重用。对于不同尺寸的卷积层选择合适的数据重用模式,最小化访存次数,提高加速器性能。每层使用的数 据重用模式通过网络参数配置。输入重用模式是指一批数据计算完成后,保留输入特征图,更换权重数据,加速器先进入读特征图工作状态,再进入读权重状态,读取完毕后进入计算状态,计算结束后返回读权重状态,重复之前状态跳转流程,直到控制器提示进入发送状态。输出重用模式是指一批数据计算完成后,保留中间计算结果,同时更换特征图和权重数据,加速器先进入读特征图状态,再进入读权重状态,读取完毕后进入计算状态,计算结束后返回读特征图状态,重复之前状态跳转流程,直到控制器提示进入发送状态。权重重用模式是指一批数据计算完成后,保留权重,更换特征图数据,加速器先进入读权重状态,再进入读特征图状态,读取完毕后进入计算状态,计算结束后返回读特征图状态,重复之前状态跳转流程,直到控制器提示进入发送状态。Data reuse modes include: input reuse, output reuse, and weight reuse. Choose appropriate data reuse modes for convolutional layers of different sizes, minimize the number of memory accesses, and improve accelerator performance. The data reuse mode used by each layer is configured through network parameters. Input reuse mode means that after a batch of data calculation is completed, the input feature map is retained and the weight data is replaced. The accelerator first enters the working state of reading the feature map, then enters the reading weight state, and enters the calculation state after reading, and returns to the reading weight after the calculation is completed State, repeat the previous state jump process until the controller prompts to enter the sending state. Output reuse mode means that after a batch of data calculation is completed, the intermediate calculation results are retained, and the feature map and weight data are replaced at the same time. The accelerator first enters the state of reading the feature map, then enters the state of reading weights, and enters the calculation state after reading, and after the calculation is completed Return to the state of reading the feature map and repeat the previous state jump process until the controller prompts to enter the sending state. The weight reuse mode means that after a batch of data calculation is completed, the weights are retained and the feature map data is replaced. The accelerator first enters the weight reading state, then enters the read feature map state, enters the calculation state after reading, and returns to the read feature map state after the calculation is completed , Repeat the previous state jump process until the controller prompts to enter the sending state.
特征图缓存区分为M个特征图子缓存区,M的取值由配置参数中的子缓存区个数决定。通过AXI4接口从外部存储器读取的每个输入通道的特征图数据,按行存入对应的特征图子缓存区中。当最后一个特征图子缓存区存完一行图像数据后,特征图数据的下一行图像数据返回存入第一个特征图子缓存区。下一个输入通道的特征图数据按相同模式存入特征图缓存区。The feature map cache is divided into M feature map sub-buffer areas, and the value of M is determined by the number of sub-buffer areas in the configuration parameter. The feature map data of each input channel read from the external memory through the AXI4 interface is stored in the corresponding feature map sub-buffer area by line. When the last feature map sub-buffer area has stored one line of image data, the next line of image data of the feature map data is returned and stored in the first feature map sub-buffer area. The feature map data of the next input channel is stored in the feature map buffer in the same mode.
权重缓存区分为N个权重子缓存区,N的取值由配置参数中的PE阵列列数决定。通过AXI4接口从外部存储器读取的权重数据,按滤波器顺序存入权重子缓存区。每列PE共用一个权重子缓存区,计算时,权重子缓存区将权重发送至对应列上的各个PE。The weight buffer is divided into N weight sub-buffer areas, and the value of N is determined by the number of PE array columns in the configuration parameter. The weight data read from the external memory through the AXI4 interface is stored in the weight sub-buffer in the order of the filter. Each column of PEs shares a weight sub-buffer area, and during calculation, the weight sub-buffer area sends the weight to each PE in the corresponding column.
输出缓存区分为R个输出子缓存区,R的取值由使用的PE阵列行数决定。每行PE对应一个输出子缓存区。每行PE输出的结果是多张输出特征图的一行数据,按输出特征图顺序存入每行PE对应的输出子缓存区。The output buffer is divided into R output sub-buffer areas, and the value of R is determined by the number of PE array rows used. Each row of PE corresponds to an output sub-buffer area. The output result of each line of PE is one line of data of multiple output feature maps, which are stored in the output sub-buffer area corresponding to each line of PE in the order of output feature maps.
寄存器栈在计算开始前,缓存PE阵列计算一次所需要的所有特征图数据。在计算过程中,每计算完K*S个特征点(K:卷积核尺寸,S:步长),则开始更新寄存器栈中的特征图数据,保证在下次卷积计算开始前,所需要的特征图数据缓存完成。The register stack buffers all the feature map data required for one calculation of the PE array before the calculation starts. In the calculation process, every time K*S feature points (K: convolution kernel size, S: step size) are calculated, the feature map data in the register stack will be updated to ensure that before the next convolution calculation starts, the required The feature map data caching is completed.
PE阵列是由多个运算单元组成的二维脉动阵列,用于进行卷积运算。每行PE对应计算出输出特征图的一行,每列PE对应计算出一张输出特征图。输入特征图数据从第一列PE输入,依次传递给相邻下一列PE。权重数据从每列对应的 权重子缓存区直接输入至每个PE。The PE array is a two-dimensional systolic array composed of multiple arithmetic units, which is used for convolution operations. Each row of PE corresponds to one row of the output feature map, and each column of PE corresponds to one output feature map. The input feature map data is input from the PE of the first column and passed to the PE of the next adjacent column in turn. The weight data is directly input to each PE from the weight sub-buffer corresponding to each column.
AXI4总线接口用于将加速器挂载在任意使用AXI4协议的总线设备上。AXI4总线位宽大于计算使用的数据位宽,故将多个数据拼接成一个总线数据发送,提高传输效率。The AXI4 bus interface is used to mount the accelerator on any bus device that uses the AXI4 protocol. The bit width of the AXI4 bus is greater than the data bit width used in calculations, so multiple data are spliced into one bus for data transmission to improve transmission efficiency.
本发明采用上述技术方案,具有以下有益效果:本发明利用状态控制器根据网络参数配置与神经网络结构匹配的最优加速器参数,灵活调整PE阵列尺寸以及子缓存区的划分以满足神经网络结构多变的应用需求,在一定的资源限制下获得最优的加速效果,同时,采用可配置的数据重用模式,针对不同神经网络结构采用最优的数据重用模式,充分利用传输带宽,并利用高度并行的PE阵列结构,实现了较高的数据吞吐率。The present invention adopts the above technical scheme and has the following beneficial effects: the present invention uses the state controller to configure the optimal accelerator parameters matching the neural network structure according to the network parameters, and flexibly adjusts the PE array size and the division of the sub-buffer area to meet the needs of the neural network structure. Change application requirements, obtain the best acceleration effect under certain resource constraints, at the same time, adopt a configurable data reuse mode, adopt the optimal data reuse mode for different neural network structures, make full use of transmission bandwidth, and take advantage of high parallelism The PE array structure achieves a higher data throughput rate.
附图说明Description of the drawings
图1是本发明公开的通用卷积神经网络加速器的结构示意图。Fig. 1 is a schematic structural diagram of a general convolutional neural network accelerator disclosed in the present invention.
图2是本发明中PE阵列的数据流示意图。Figure 2 is a schematic diagram of the data flow of the PE array in the present invention.
图3是本发明公开的通用卷积神经网络加速器的工作流程示意图。Fig. 3 is a schematic diagram of the work flow of the general convolutional neural network accelerator disclosed in the present invention.
具体实施方式Detailed ways
下面结合附图对发明的技术方案进行详细说明。The technical solution of the invention will be described in detail below in conjunction with the drawings.
本发明设计的可配置通用卷积神经网络加速器如图1所示,以两片PE阵列大小均为14*16,卷积核大小为3*3,卷积核步长为1,输入特征图大小15*15(加padding后),单批次输入通道数量为14,单批次输出通道数量为32,数据复用模式为输出复用为例,详述其工作方式。The configurable general convolutional neural network accelerator designed by the present invention is shown in Figure 1. The size of the two PE arrays is 14*16, the size of the convolution kernel is 3*3, the step length of the convolution kernel is 1, and the input feature map is The size is 15*15 (after adding padding), the number of input channels in a single batch is 14, the number of output channels in a single batch is 32, and the data multiplexing mode is output multiplexing as an example, and the working method is described in detail.
处于等待状态的加速器接收到开始信号后。加速器通过总线接口从外部存储器中读取网络参数,更新网络参数寄存器,并根据网络参数寄存器的值决定数据重用模式和工作状态的切换顺序。通过加速器接口读取的BN参数分为两部分存储在两片BN参数缓存区域,供两片PE阵列的输出使用。通过加速器接口读取的特征图数据,按行缓存在5个特征图子缓存区中,每个特征图子缓存区中缓存3行特征图数据。通过加速器接口读取的权重数据按滤波器顺序存入32个权重子缓存区。计算开始前,从特征图子缓存区读取15*3*14个特征图数据存入寄存器栈中。在计算过程中,PE阵列从寄存器栈和权重子缓存区中获取数据进行卷积运算。每当计算3次乘累加操作后,更新寄存器栈中15*1个特征图数据。每 计算完3*3*14个数据后,输出一次计算结果。计算结果按特征图顺序存入对应输出子缓存区。由于该层网络的数据复用模式为输出复用,所以在计算完成后返回读特征图状态,并依次进入读权重状态和计算状态,重复该状态循环直到状态控制器给出计算完成命令。加速器在卷积计算完后将数据送入功能模块计算得到最终的输出数据后,加速器从计算状态跳入发送状态,并通过AXI4总线接口将数据发送至外部存储器。After the accelerator in the waiting state receives the start signal. The accelerator reads the network parameters from the external memory through the bus interface, updates the network parameter register, and determines the data reuse mode and the switching sequence of the working state according to the value of the network parameter register. The BN parameter read through the accelerator interface is divided into two parts and stored in two BN parameter buffer areas for the output of the two PE arrays. The feature map data read through the accelerator interface is cached in five feature map sub-buffer areas by line, and 3 lines of feature map data are cached in each feature map sub-buffer area. The weight data read through the accelerator interface is stored in 32 weight sub-buffer areas in the order of the filter. Before the calculation starts, 15*3*14 feature map data are read from the feature map sub-buffer area and stored in the register stack. In the calculation process, the PE array obtains data from the register stack and the weight sub-buffer area for convolution operation. After calculating 3 times of multiplying and accumulating operations, 15*1 feature map data in the register stack is updated. After calculating 3*3*14 data, output the calculation result once. The calculation results are stored in the corresponding output sub-buffer area in the order of the feature map. Since the data multiplexing mode of this layer network is output multiplexing, it returns to the read characteristic map state after the calculation is completed, and enters the read weight state and the calculation state in sequence, and repeats the state cycle until the state controller gives the calculation completion command. The accelerator sends the data to the functional module to calculate the final output data after the convolution calculation is completed, the accelerator jumps from the calculation state to the sending state, and sends the data to the external memory through the AXI4 bus interface.
参照图2,PE阵列每行的输入数据均由寄存器栈提供,每列的权重数据均由对应的权重子缓存区提供,每行PE的输出数据存入对应的输出子缓存区。计算开始前,寄存器栈中存储15*3*14个特征图数据。第1行至第3行数据送入第1列第1个PE,第2行至第4行数据送入第1列第2个PE,…,第13行至第15行数据送入第1列最后一个PE。在一个3*3的卷积窗内,以第一列第一个PE为例,按列顺序读入数据,前三个时钟周期依次读取寄存器栈中前三行的第一个数据。在计算完3*1个特征图数据后,更新寄存器栈中第一列特征图数据。计算完成后数据输出至对应输出子缓存区中,每个输出子缓存区中存储输出特征图的一行。以第一行PE的输出子缓存区为例,第一张输出特征图的13个数据存放在地址0至12,地址13至25存放第二张输出特征图数据,…,以该顺序存放所有的输出特征图数据。Referring to FIG. 2, the input data of each row of the PE array is provided by the register stack, the weight data of each column is provided by the corresponding weight sub-buffer area, and the output data of each row of PE is stored in the corresponding output sub-buffer area. Before the calculation starts, 15*3*14 feature map data are stored in the register stack. The data from rows 1 to 3 are sent to the first PE in column 1, the data from rows 2 to 4 are sent to the second PE in column 1,..., the data from rows 13 to 15 are sent to the first PE List the last PE. In a 3*3 convolution window, take the first PE in the first column as an example, read the data in column order, and read the first data of the first three rows in the register stack in the first three clock cycles. After calculating 3*1 feature map data, update the first column of feature map data in the register stack. After the calculation is completed, the data is output to the corresponding output sub-buffer, and each output sub-buffer stores one line of the output feature map. Take the output sub-buffer area of the first line of PE as an example. The 13 data of the first output feature map are stored at addresses 0 to 12, and the addresses 13 to 25 are stored for the second output feature map data,..., all are stored in this order The output feature map data.
如图3所示,加速器共有等待、读网络参数、读BN参数、读特征图、读权重、计算和发送这7种工作状态。工作状态的选择和切换由加速器工作状态控制器决定。加速器工作状态控制器通过读取网络参数寄存器值,判断是否需要读BN参数状态。并且加速器状态控制器通过判断加速器数据重用模式,决定读特征图、读权重和计算这个三状态的循环顺序。As shown in Figure 3, the accelerator has 7 working states: waiting, reading network parameters, reading BN parameters, reading feature maps, reading weights, calculating and sending. The selection and switching of the working state are determined by the accelerator working state controller. The accelerator working status controller judges whether it is necessary to read the BN parameter status by reading the network parameter register value. And the accelerator state controller determines the cycle sequence of reading the characteristic map, reading the weight and calculating the three states by judging the accelerator data reuse mode.
在输入重用模式下,加速器先进入读特征图状态,再进入读权重状态,最后进入计算状态,在计算状态结束后返回读权重状态;在输出重用模式下,加速器先进入读特征图状态,再进入读权重状态,最后进入计算状态,在计算状态结束后返回读特征图状态;在权重重用模式下,加速器先进入读权重状态,再进入读特征图状态,最后进入计算状态,在计算状态结束后返回读特征图状态。In the input reuse mode, the accelerator first enters the state of reading feature maps, then enters the state of reading weights, and finally enters the calculation state, and returns to the state of reading weights after the calculation state ends; in the output reuse mode, the accelerator first enters the state of reading feature maps, and then Enter the read weight state, and finally enter the calculation state. After the calculation state is over, return to the read feature map state; in the weight reuse mode, the accelerator first enters the read weight state, then enters the read feature map state, and finally enters the calculation state, and ends in the calculation state Then return to the status of reading the feature map.
等待状态:初始化后加速器始处于等待状态,加速器工作状态控制器等待外部给与加速器开始工作信号,在加速器接收到开始信号后,则跳入读网络参数状 态。在卷积最后一层计算结束后,加速器返回等待状态,等待下一次触发工作。Waiting state: After initialization, the accelerator is in the waiting state. The accelerator working state controller waits for an external signal to start the accelerator. After the accelerator receives the start signal, it jumps to the state of reading network parameters. After the calculation of the last layer of convolution is completed, the accelerator returns to the waiting state and waits for the next trigger.
读网络参数状态:加速器进入读网络参数状态,通过AXI4总线接口从外部存储器中读取事先存储好的网络参数,将读回的总线数据解析并存入对应的网络参数寄存器,读回的网络参数包括数据存储偏移地址,加速器数据复用模式,加速器功能模块配置,网络尺寸和卷积核大小,针对不同的网络尺寸可以选择最优的加速器网络参数,对不同的网络加速器都可以取得最优的性能指标。Read network parameter state: the accelerator enters the state of reading network parameters, reads the previously stored network parameters from the external memory through the AXI4 bus interface, analyzes and stores the read back bus data into the corresponding network parameter register, and reads back the network parameters Including data storage offset address, accelerator data multiplexing mode, accelerator function module configuration, network size and convolution kernel size, the optimal accelerator network parameters can be selected for different network sizes, and the best can be achieved for different network accelerators Performance indicators.
读BN参数状态:加速器进入该状态后,通过AXI4总线接口从外部存储器中读取BN和bias参数,并存入两片BN参数存储区和bias参数存储区,读取数据完成后,根据当前数据复用模式进入读特征图状态或读权重状态。Read BN parameter state: After the accelerator enters this state, it reads the BN and bias parameters from the external memory through the AXI4 bus interface, and stores them in two BN parameter storage areas and bias parameter storage areas. After the data is read, it is based on the current data The multiplexing mode enters the read feature map state or read weight state.
读特征图状态:加速器进入该状态后,通过配置的加速器网络参数决定使用的特征图子缓存区数量,并通过AXI4总线接口从外部存储器中读取特征图数据,按行顺序存入特征图子缓存区中,读取数据完成后,根据当前数据复用模式决定进入读权重状态还是计算状态。Read feature map state: After the accelerator enters this state, the number of feature map sub-buffer areas used is determined by the configured accelerator network parameters, and the feature map data is read from the external memory through the AXI4 bus interface and stored in the feature map in line order In the buffer area, after reading the data, it is determined whether to enter the read weight state or the calculation state according to the current data multiplexing mode.
读权重状态:加速器进入该状态后,通过网络参数决定使用的权重子缓存区数量,并通过AXI4总线接口从外部存储器中读取权重数据,按滤波器顺序存入权重子缓存区中,读取数据完成后,根据当前数据复用模式决定进入读特征图状态还是计算状态。Read weight state: After the accelerator enters this state, the number of weight sub-buffer areas used is determined by the network parameters, and the weight data is read from the external memory through the AXI4 bus interface, and the weight data is stored in the weight sub-buffer area in the order of the filter, and read After the data is completed, it is determined whether to enter the read characteristic map state or the calculation state according to the current data multiplexing mode.
计算状态:加速器进入该状态后,开始依次从寄存器栈和权重缓存区中读取计算数据,并完成卷积计算,在计算完成后,根据加速器工作状态控制器信号,决定进入功能模块计算或者返回读取数据状态,在卷积结果经过功能模块输出后,计算状态结束,加速器进入发送状态。Calculation state: After the accelerator enters this state, it starts to read the calculation data from the register stack and the weight buffer in turn, and completes the convolution calculation. After the calculation is completed, according to the accelerator working state controller signal, it decides to enter the function module calculation or return Read the data state, after the convolution result is output by the function module, the calculation state ends, and the accelerator enters the sending state.
发送状态:加速器在发送状态中,将经功能模块输出的计算结果打包通过AXI4总线接口发送至外部存储区,实施例中输出结果位宽为16位,总线位宽为64位,故将4个输出结果合并为一个总线数据发送,在发送结束后加速器返回等待状态,等待下一次触发工作。Sending state: In the sending state, the accelerator packs the calculation results output by the functional module and sends them to the external storage area through the AXI4 bus interface. In the embodiment, the output result bit width is 16 bits, and the bus bit width is 64 bits, so 4 The output result is combined into one bus data transmission, after the transmission is finished, the accelerator returns to the waiting state, waiting for the next trigger work.
实施例仅为说明本发明的技术思想,不能以此限定本发明的保护范围,在本发明所公开技术方案基础上所做的任何符合本申请发明构思的改动均落入本发明保护范围。The examples are merely illustrative of the technical ideas of the present invention, and cannot be used to limit the scope of protection of the present invention. Any changes made on the basis of the technical solutions disclosed in the present invention that conform to the inventive concept of the present application fall into the protection scope of the present invention.

Claims (8)

  1. 一种可配置的通用卷积神经网络加速器,其特征在于,包括:A configurable general convolutional neural network accelerator, which is characterized in that it includes:
    状态控制器,从外部存储器读取网络参数,根据网络参数配置包含数据重用模式和阵列尺寸以及子缓存区个数的加速器参数,根据数据重用模式切换加速器工作状态,The state controller reads the network parameters from the external memory, configures the accelerator parameters including the data reuse mode and array size and the number of sub-buffer areas according to the network parameters, and switches the accelerator working state according to the data reuse mode.
    包含多个子缓存区的特征图缓存区,根据状态控制器配置的子缓存区个数按行缓存从外部存储器读取的特征图数据,A feature map buffer area containing multiple sub-buffer areas, and the feature map data read from the external memory is cached in rows according to the number of sub-buffer areas configured by the state controller.
    寄存器栈,缓存PE阵列一次计算所需的特征图数据,The register stack, which caches the feature map data required for one-time calculation of the PE array,
    包含多个子缓存区的权重缓存区,根据状态控制器配置的子缓存区个数按滤波器顺序缓存从外部存储器读取的权重数据,The weight buffer area contains multiple sub-buffer areas, and the weight data read from the external memory is buffered in filter order according to the number of sub-buffer areas configured by the state controller.
    PE阵列,每行PE单元从寄存器栈读取特征图数据,每列PE单元读取同一权重子缓存区中缓存的权重数据,对特征图数据和权重数据进行卷积计算,及,PE array, each row of PE units reads feature map data from the register stack, each column of PE units reads the weight data cached in the same weight sub-buffer area, performs convolution calculations on the feature map data and weight data, and,
    包含多个子缓存区的输出缓存区,缓存各行PE单元输出的不同特征图的行数据。The output buffer area containing multiple sub-buffer areas buffers the row data of different feature maps output by each row of PE units.
  2. 根据权利要求1所述一种可配置的通用卷积神经网络加速器,其特征在于,状态控制器根据从外部存储器读取的网络参数包含卷积层尺寸,根据卷积层尺寸配置访存次数最少的数据重用模式,所述数据重用模式包括:输入数据重用模式、权重数据重用模式、输出数据重用模式。The configurable universal convolutional neural network accelerator according to claim 1, wherein the state controller includes the size of the convolutional layer according to the network parameters read from the external memory, and configures the minimum number of accesses according to the size of the convolutional layer The data reuse mode includes: input data reuse mode, weight data reuse mode, and output data reuse mode.
  3. 根据权利要求1所述一种可配置的通用卷积神经网络加速器,其特征在于,所述加速器还包括:The configurable universal convolutional neural network accelerator according to claim 1, wherein the accelerator further comprises:
    BN参数存储区,在状态控制器从外部存储器读取的网络参数包含功能模块配置信息时缓存从外部存储器读取的BN参数,BN parameter storage area, when the network parameter read from the external memory by the state controller contains the configuration information of the function module, the BN parameter read from the external memory is cached,
    Bias参数存储器,在状态控制器从外部存储器读取的网络参数包含功能模块配置信息时缓存从外部存储器读取的Bias参数,Bias parameter memory, when the network parameter read from the external memory by the state controller contains the configuration information of the function module, the Bias parameter read from the external memory is cached,
    功能模块,在接收到状态控制器发出的进行功能运算的指令后,对输出缓存器存储的特征图行数据依次进行偏置加、归一化、激活、池化操作,最终输出神经网络的计算结果。The function module, after receiving the instruction of the function operation from the state controller, performs the offset addition, normalization, activation, and pooling operations on the characteristic map line data stored in the output buffer in turn, and finally outputs the calculation of the neural network result.
  4. 根据权利要求1所述一种可配置的通用卷积神经网络加速器,其特征在于,根据状态控制器配置的子缓存区个数确定特征图数据缓存区的子缓存区个数。The configurable universal convolutional neural network accelerator according to claim 1, wherein the number of sub-buffer areas in the feature map data buffer area is determined according to the number of sub-buffer areas configured by the state controller.
  5. 根据权利要求1所述一种可配置的通用卷积神经网络加速器,其特征在于,根据状态控制器配置的阵列列数确定权重缓存区的子缓存区个数。The configurable universal convolutional neural network accelerator according to claim 1, wherein the number of sub-buffer areas of the weight buffer is determined according to the number of array columns configured by the state controller.
  6. 根据权利要求1所述一种可配置的通用卷积神经网络加速器,其特征在于,根据状态控制器配置的阵列行数确定输出缓存区的子缓存区个数。The configurable universal convolutional neural network accelerator according to claim 1, wherein the number of sub-buffer areas of the output buffer area is determined according to the number of array rows configured by the state controller.
  7. 根据权利要求2所述一种可配置的通用卷积神经网络加速器,其特征在于,状态控制器初始化加速器进入读网络参数状态,根据读取的网络参数完成加速器参数的配置后,在输入数据重用模式下依次切换加速器进入读特征图状态、读权重状态、计算状态,在权重数据重用模式下依次切换加速器进入读权重状态、读特征图状态、计算状态,在输出数据重用状态下依次切换加速器进入读特征图状态、读权重状态、计算状态,在完成卷积计算后切换至数据发送状态。The configurable universal convolutional neural network accelerator according to claim 2, wherein the state controller initializes the accelerator to enter the state of reading network parameters, and after completing the configuration of the accelerator parameters according to the read network parameters, the input data is reused In the mode, switch the accelerator to enter the read feature map state, read weight state, and calculate state in sequence. In the weight data reuse mode, switch the accelerator to enter the read weight state, read feature map state, and calculate state in turn. In the output data reuse state, switch the accelerator to enter in turn Read feature map status, read weight status, calculation status, and switch to data sending status after completing convolution calculation.
  8. 根据权利要求7所述一种可配置的通用卷积神经网络加速器,其特征在于,在状态控制器从外部存储器读取的网络参数包含功能模块配置信息时,切换至读取BN参数和Bias参数的状态,再根据数据重用模式切换工作状态。The configurable universal convolutional neural network accelerator according to claim 7, wherein when the network parameters read by the state controller from the external memory include functional module configuration information, switch to read BN parameters and Bias parameters , And then switch the working state according to the data reuse mode.
PCT/CN2019/105533 2019-06-25 2019-09-12 Configurable universal convolutional neural network accelerator WO2020258528A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910554533.7A CN110390384B (en) 2019-06-25 2019-06-25 Configurable general convolutional neural network accelerator
CN201910554533.7 2019-06-25

Publications (1)

Publication Number Publication Date
WO2020258528A1 true WO2020258528A1 (en) 2020-12-30

Family

ID=68285786

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/105533 WO2020258528A1 (en) 2019-06-25 2019-09-12 Configurable universal convolutional neural network accelerator

Country Status (2)

Country Link
CN (1) CN110390384B (en)
WO (1) WO2020258528A1 (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113222129A (en) * 2021-04-02 2021-08-06 西安电子科技大学 Convolution operation processing unit and system based on multi-level cache cyclic utilization
CN113313251A (en) * 2021-05-13 2021-08-27 中国科学院计算技术研究所 Deep separable convolution fusion method and system based on data stream architecture
US20210295145A1 (en) * 2020-03-23 2021-09-23 Mentium Technologies Inc. Digital-analog hybrid system architecture for neural network acceleration
CN113962361A (en) * 2021-10-09 2022-01-21 西安交通大学 Winograd-based data conflict-free scheduling method for CNN accelerator system
CN114707649A (en) * 2022-03-28 2022-07-05 北京理工大学 General convolution arithmetic device
CN114781632A (en) * 2022-05-20 2022-07-22 重庆科技学院 Deep neural network accelerator based on dynamic reconfigurable pulse tensor operation engine
CN114997386A (en) * 2022-06-29 2022-09-02 桂林电子科技大学 CNN neural network acceleration design method based on multi-FPGA heterogeneous architecture
CN115965067A (en) * 2023-02-01 2023-04-14 苏州亿铸智能科技有限公司 Neural network accelerator for ReRAM
CN116050474A (en) * 2022-12-29 2023-05-02 上海天数智芯半导体有限公司 Convolution calculation method, SOC chip, electronic equipment and storage medium
CN118070855A (en) * 2024-04-18 2024-05-24 南京邮电大学 Convolutional neural network accelerator based on RISC-V architecture
CN118070855B (en) * 2024-04-18 2024-07-09 南京邮电大学 Convolutional neural network accelerator based on RISC-V architecture

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111382094B (en) * 2018-12-29 2021-11-30 深圳云天励飞技术有限公司 Data processing method and device
CN112819022B (en) * 2019-11-18 2023-11-07 同方威视技术股份有限公司 Image recognition device and image recognition method based on neural network
US11216375B2 (en) 2020-02-26 2022-01-04 Hangzhou Zhicun Intelligent Technology Co., Ltd. Data caching
CN113313228B (en) * 2020-02-26 2022-10-14 杭州知存智能科技有限公司 Data caching circuit and method
CN111401543B (en) * 2020-06-08 2020-11-10 深圳市九天睿芯科技有限公司 Neural network accelerator with full on-chip storage and implementation method thereof
CN111832717B (en) * 2020-06-24 2021-09-28 上海西井信息科技有限公司 Chip and processing device for convolution calculation
CN111967587B (en) * 2020-07-27 2024-03-29 复旦大学 Method for constructing operation unit array structure facing neural network processing
CN111626414B (en) * 2020-07-30 2020-10-27 电子科技大学 Dynamic multi-precision neural network acceleration unit
CN111931911B (en) * 2020-07-30 2022-07-08 山东云海国创云计算装备产业创新中心有限公司 CNN accelerator configuration method, system and device
CN112232499B (en) * 2020-10-13 2022-12-23 华中光电技术研究所(中国船舶重工集团公司第七一七研究所) Convolutional neural network accelerator
KR20220049325A (en) 2020-10-14 2022-04-21 삼성전자주식회사 Accelerator and electronic device including the same
CN112465110B (en) * 2020-11-16 2022-09-13 中国电子科技集团公司第五十二研究所 Hardware accelerator for convolution neural network calculation optimization
CN112766479B (en) * 2021-01-26 2022-11-11 东南大学 Neural network accelerator supporting channel separation convolution based on FPGA
CN112949847B (en) * 2021-03-29 2023-07-25 上海西井科技股份有限公司 Neural network algorithm acceleration system, scheduling system and scheduling method
CN113570034B (en) * 2021-06-18 2022-09-27 北京百度网讯科技有限公司 Processing device, neural network processing method and device
CN113807509B (en) * 2021-09-14 2024-03-22 绍兴埃瓦科技有限公司 Neural network acceleration device, method and communication equipment
CN113792868B (en) * 2021-09-14 2024-03-29 绍兴埃瓦科技有限公司 Neural network computing module, method and communication equipment
CN113792687A (en) * 2021-09-18 2021-12-14 兰州大学 Human intrusion behavior early warning system based on monocular camera
CN114239816B (en) * 2021-12-09 2023-04-07 电子科技大学 Reconfigurable hardware acceleration architecture of convolutional neural network-graph convolutional neural network
CN114820630B (en) * 2022-07-04 2022-09-06 国网浙江省电力有限公司电力科学研究院 Target tracking algorithm model pipeline acceleration method and circuit based on FPGA
CN116010313A (en) * 2022-11-29 2023-04-25 中国科学院深圳先进技术研究院 Universal and configurable image filtering calculation multi-line output system and method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107341544A (en) * 2017-06-30 2017-11-10 清华大学 A kind of reconfigurable accelerator and its implementation based on divisible array
WO2018196863A1 (en) * 2017-04-28 2018-11-01 北京市商汤科技开发有限公司 Convolution acceleration and calculation processing methods and apparatuses, electronic device and storage medium
CN108805272A (en) * 2018-05-03 2018-11-13 东南大学 A kind of general convolutional neural networks accelerator based on FPGA
CN109102065A (en) * 2018-06-28 2018-12-28 广东工业大学 A kind of convolutional neural networks accelerator based on PSoC
CN109598338A (en) * 2018-12-07 2019-04-09 东南大学 A kind of convolutional neural networks accelerator of the calculation optimization based on FPGA

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11775313B2 (en) * 2017-05-26 2023-10-03 Purdue Research Foundation Hardware accelerator for convolutional neural networks and method of operation thereof
CN108241890B (en) * 2018-01-29 2021-11-23 清华大学 Reconfigurable neural network acceleration method and architecture

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018196863A1 (en) * 2017-04-28 2018-11-01 北京市商汤科技开发有限公司 Convolution acceleration and calculation processing methods and apparatuses, electronic device and storage medium
CN107341544A (en) * 2017-06-30 2017-11-10 清华大学 A kind of reconfigurable accelerator and its implementation based on divisible array
CN108805272A (en) * 2018-05-03 2018-11-13 东南大学 A kind of general convolutional neural networks accelerator based on FPGA
CN109102065A (en) * 2018-06-28 2018-12-28 广东工业大学 A kind of convolutional neural networks accelerator based on PSoC
CN109598338A (en) * 2018-12-07 2019-04-09 东南大学 A kind of convolutional neural networks accelerator of the calculation optimization based on FPGA

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210295145A1 (en) * 2020-03-23 2021-09-23 Mentium Technologies Inc. Digital-analog hybrid system architecture for neural network acceleration
CN113222129A (en) * 2021-04-02 2021-08-06 西安电子科技大学 Convolution operation processing unit and system based on multi-level cache cyclic utilization
CN113222129B (en) * 2021-04-02 2024-02-13 西安电子科技大学 Convolution operation processing unit and system based on multi-level cache cyclic utilization
CN113313251B (en) * 2021-05-13 2023-05-23 中国科学院计算技术研究所 Depth separable convolution fusion method and system based on data flow architecture
CN113313251A (en) * 2021-05-13 2021-08-27 中国科学院计算技术研究所 Deep separable convolution fusion method and system based on data stream architecture
CN113962361A (en) * 2021-10-09 2022-01-21 西安交通大学 Winograd-based data conflict-free scheduling method for CNN accelerator system
CN113962361B (en) * 2021-10-09 2024-04-05 西安交通大学 Winograd-based CNN accelerator system data conflict-free scheduling method
CN114707649A (en) * 2022-03-28 2022-07-05 北京理工大学 General convolution arithmetic device
CN114781632A (en) * 2022-05-20 2022-07-22 重庆科技学院 Deep neural network accelerator based on dynamic reconfigurable pulse tensor operation engine
CN114997386B (en) * 2022-06-29 2024-03-22 桂林电子科技大学 CNN neural network acceleration design method based on multi-FPGA heterogeneous architecture
CN114997386A (en) * 2022-06-29 2022-09-02 桂林电子科技大学 CNN neural network acceleration design method based on multi-FPGA heterogeneous architecture
CN116050474A (en) * 2022-12-29 2023-05-02 上海天数智芯半导体有限公司 Convolution calculation method, SOC chip, electronic equipment and storage medium
CN115965067B (en) * 2023-02-01 2023-08-25 苏州亿铸智能科技有限公司 Neural network accelerator for ReRAM
CN115965067A (en) * 2023-02-01 2023-04-14 苏州亿铸智能科技有限公司 Neural network accelerator for ReRAM
CN118070855A (en) * 2024-04-18 2024-05-24 南京邮电大学 Convolutional neural network accelerator based on RISC-V architecture
CN118070855B (en) * 2024-04-18 2024-07-09 南京邮电大学 Convolutional neural network accelerator based on RISC-V architecture

Also Published As

Publication number Publication date
CN110390384B (en) 2021-07-06
CN110390384A (en) 2019-10-29

Similar Documents

Publication Publication Date Title
WO2020258528A1 (en) Configurable universal convolutional neural network accelerator
CN108171317B (en) Data multiplexing convolution neural network accelerator based on SOC
CN110390385B (en) BNRP-based configurable parallel general convolutional neural network accelerator
CN109598338B (en) Convolutional neural network accelerator based on FPGA (field programmable Gate array) for calculation optimization
US20190026626A1 (en) Neural network accelerator and operation method thereof
CN107301455B (en) Hybrid cube storage system for convolutional neural network and accelerated computing method
CN110334799B (en) Neural network reasoning and training accelerator based on storage and calculation integration and operation method thereof
CN108416437A (en) The processing system and method for artificial neural network for multiply-add operation
CN108805272A (en) A kind of general convolutional neural networks accelerator based on FPGA
CN110516801A (en) A kind of dynamic reconfigurable convolutional neural networks accelerator architecture of high-throughput
CN108537331A (en) A kind of restructural convolutional neural networks accelerating circuit based on asynchronous logic
CN105912501B (en) A kind of SM4-128 Encryption Algorithm realization method and systems based on extensive coarseness reconfigurable processor
CN115880132B (en) Graphics processor, matrix multiplication task processing method, device and storage medium
CN115860080B (en) Computing core, accelerator, computing method, apparatus, device, medium, and system
US20230128421A1 (en) Neural network accelerator
US20230376733A1 (en) Convolutional neural network accelerator hardware
RU2294561C2 (en) Device for hardware realization of probability genetic algorithms
CN113673691A (en) Storage and computation combination-based multi-channel convolution FPGA (field programmable Gate array) framework and working method thereof
CN106569968A (en) Inter-array data transmission structure and scheduling method used for reconfigurable processor
CN101452572A (en) Image rotating VLSI structure based on cubic translation algorithm
CN117291240B (en) Convolutional neural network accelerator and electronic device
US11068200B2 (en) Method and system for memory control
CN115965067B (en) Neural network accelerator for ReRAM
CN113177877B (en) Schur elimination accelerator oriented to SLAM rear end optimization
Rezaei et al. Smart Memory: Deep Learning Acceleration In 3D-Stacked Memories

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19935697

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19935697

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 19935697

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 31/08/2022)

122 Ep: pct application non-entry in european phase

Ref document number: 19935697

Country of ref document: EP

Kind code of ref document: A1