WO2020258841A1 - 一种基于幂指数量化的深度神经网络硬件加速器 - Google Patents

一种基于幂指数量化的深度神经网络硬件加速器 Download PDF

Info

Publication number
WO2020258841A1
WO2020258841A1 PCT/CN2020/071150 CN2020071150W WO2020258841A1 WO 2020258841 A1 WO2020258841 A1 WO 2020258841A1 CN 2020071150 W CN2020071150 W CN 2020071150W WO 2020258841 A1 WO2020258841 A1 WO 2020258841A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
input
weight
shift
buffer
Prior art date
Application number
PCT/CN2020/071150
Other languages
English (en)
French (fr)
Inventor
陆生礼
庞伟
武瑞利
樊迎博
刘昊
黄成�
Original Assignee
东南大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 东南大学 filed Critical 东南大学
Priority to US17/284,480 priority Critical patent/US20210357736A1/en
Publication of WO2020258841A1 publication Critical patent/WO2020258841A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F5/00Methods or arrangements for data conversion without changing the order or content of the data handled
    • G06F5/01Methods or arrangements for data conversion without changing the order or content of the data handled for shifting, e.g. justifying, scaling, normalising
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/544Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
    • G06F7/552Powers or roots, e.g. Pythagorean sums
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/544Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
    • G06F7/556Logarithmic or exponential functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2207/00Indexing scheme relating to methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F2207/38Indexing scheme relating to groups G06F7/38 - G06F7/575
    • G06F2207/48Indexing scheme relating to groups G06F7/48 - G06F7/575
    • G06F2207/4802Special implementations
    • G06F2207/4818Threshold devices
    • G06F2207/4824Neural networks

Definitions

  • the invention discloses a deep neural network hardware accelerator based on power exponent quantization, relates to a processor structure for deep neural network convolution calculation hardware acceleration, and belongs to the technical field of calculation, calculation and counting.
  • Deep learning is a hot area in machine learning research, and it is widely used in mainstream artificial intelligence algorithms.
  • DCNN Deep Convolutional Neural Network
  • AI Artificial Intelligence
  • Deep convolutional neural networks often have tens or even tens of billions of parameters, which makes them perform better in accuracy than other existing machine learning algorithms, but their huge computing power and storage requirements make it difficult to carry them on limited resources.
  • Mobile communication devices Internet of Things devices, wearable devices and robots.
  • Some researchers have made a lot of efforts to compress the CNN model, trying to train low-precision CNN model data.
  • BinaryNet, XNOR-net and DeRaFa-net have all been compressed
  • the processed CNN model however, the compressed model is still floating-point precision data, which is not particularly conducive to hardware design.
  • a hardware-friendly quantization scheme is to quantize the model data into a power of two representation.
  • the purpose of the present invention is to provide a deep neural network hardware accelerator based on power exponent quantization in view of the deficiencies of the above-mentioned background technology.
  • a hardware accelerator for shifting operations based on the result of quantizing the power exponent of neural network parameters, it avoids
  • the use of complex multiplication circuits to implement floating-point multiplication operations reduces processor power consumption and chip area, and solves the technical problems of complex circuits in existing processors that implement deep neural network convolution operations, large storage space requirements, and high power consumption. .
  • the hardware accelerator includes: AXI-4 bus interface, input buffer area, output buffer area, weight buffer area, weight index buffer area, encoding module, configurable state controller module, PE array.
  • the PE array contains R*C PE units, each PE unit uses binary shift operations instead of floating point multiplication operations; the input buffer area and output buffer area are designed as a line buffer structure, and the input buffer area and weight buffer area are used for buffering through AXI.
  • -4 bus reads the input data from the external memory DDR and the weight data quantized by the power exponent.
  • the AXI-4 bus interface can mount the accelerator to any bus device using the AXI-4 protocol to work; the output buffer is used for Cache the calculation results output by the PE array; the configurable state controller module is used to control the working state of the accelerator and realize the conversion between working states; the encoder encodes the quantized weight data according to the ordered quantization set to obtain the positive and negative weights and the weights in the The weight index value of the position information in the ordered quantization set, and the ordered quantization set stores the possible values of the absolute value of the ownership weighted (in the form of the power of two).
  • the PE unit reads data from the input buffer area and the weight index buffer area for calculation, and sends the calculation result to the output buffer area after the calculation is completed.
  • the multiplier DSP is used less, the hardware resources are sufficient, and the systolic array has high parallelism, which can greatly improve the throughput of the accelerator, so the PE array is designed as a systolic array form.
  • Data is loaded from the input buffer area to the leftmost column of the PE array. Every clock cycle, the input data moves one step to the right; each column in the PE array outputs different points of the same output channel in parallel, and different columns output data of different output channels.
  • the PE unit uses the weight index value to determine the positive or negative of the quantized value of the weight data and the moving digits and moving direction of the input data.
  • the weight index value is positive
  • the weight data quantization value is positive
  • the shift table (the data in the shift table is determined by the exponent of the power of two in the ordered quantization set, and the arrangement sequence is consistent with the data sequence in the quantization set) determines the direction of movement and the number of bits.
  • the number of lines in the input buffer area is determined by the size of the convolution kernel of the current layer of the deep neural network, the step size of the convolution kernel, and the size of the output feature map.
  • the data range of the ordered quantization set is determined by the maximum value of the weighted accuracy and the unquantized absolute weight.
  • the data in the ordered quantization set are arranged in a certain order, and the stored values are the absolute values after quantization. For example, if the weight is converted to 5 bits, the quantization set may be ⁇ 2, 0, 2 -1 , 2 -2 , 2 -3 ⁇ .
  • the direction and size of the input data shift during the shift operation are determined according to the shift table stored in the PE unit. If the corresponding value in the shift table is positive, the input data will be shifted left, and when it is negative, the input data will be shifted left. Shift the input data to the right, the absolute value of each element in the shift table represents the magnitude of the displacement.
  • the weight index value is positive, the shift operation is directly performed, and when the weight index value is negative, the input data is inverted and then the shift operation is performed.
  • the weight index value obtained after encoding is composed of two parts: a symbol and an index.
  • the symbol represents the sign of the quantized weight data
  • the index represents the position of the absolute value of the weight data in the ordered quantization set.
  • the AXI4 bus interface splices multiple data into one data transmission, which improves the calculation speed.
  • the present invention uses shift operations to replace multiplication operations in accelerator calculations, implements binary shift operations through pulsating PE arrays, avoids the use of complex multiplication circuits, and reduces the requirements for computing resources, storage resources and communication bandwidth, and further Improve the calculation efficiency of accelerator;
  • the PE unit realizes the binary shift operation by searching the shift table corresponding to the exponent quantized by the power exponent.
  • the hardware is easy to implement and occupies a small area, which reduces the consumption of hardware resources and energy consumption, and achieves higher Computing performance;
  • the hardware accelerator involved in this application is used in a neural network based on power exponent quantization. Based on the power exponent quantization result, a binary displacement convolution operation is used to replace the traditional floating point multiplication operation and reduce the calculation of convolution operation. It reduces the accuracy of neural network parameters and reduces the storage of neural network parameters. It increases the calculation rate of neural network and reduces the consumption of hardware resources by complex calculations of neural network. It is used in small embedded terminal equipment for compressed neural network. Wait for hardware applications to create possibilities.
  • Fig. 1 is a schematic diagram of the structure of the hardware accelerator disclosed in the present invention.
  • Fig. 2 is a schematic diagram of the structure of the PE unit disclosed in the present invention.
  • Figure 3 is a schematic diagram of the line buffer structure.
  • Fig. 4(a) is a schematic diagram of weight coding
  • Fig. 4(b) is a schematic diagram of a table lookup method for weight index values.
  • Figure 5 is a schematic diagram of the configurable state controller switching the working state of the hardware accelerator.
  • the accelerator passes AXI- The 4 bus buffers the input data and the weight data quantized by the power exponent into the input buffer area and the weight buffer area.
  • the AXI-4 bus reads 16 convolution kernel data from the DDR and stores it in the weight buffer area. , In order to input 16 weight index values into the PE array in parallel.
  • the convolution kernel data stored in the weight buffer area is read to the encoding module, and then according to the positive and negative of the weight data quantized by the power exponent and the position in the ordered quantization set, the weight data is encoded in a certain way so that it can be used in PE Shift calculation in the unit.
  • the input buffer area When calculating, the input buffer area outputs 16 data in parallel by row, and is input in parallel to the first PE unit of each row of the PE array for calculation, and is passed to each adjacent PE unit in turn. The input data is buffered in each PE.
  • the weight index value is input in parallel to the first PE unit in each column of the PE array through 16 weight index sub-buffers, and is sequentially transferred to the adjacent PE units in each column, and finally buffered in the PE unit The weight in the index sub-buffer.
  • Each PE unit reads the weight index value from its weight index sub-buffer area, and determines the direction and size of the data displacement by looking up the table method, that is, searching the established shift table.
  • Each PE unit is shown in Figure 2.
  • the input data is buffered in the input sub-buffer area, and the encoded convolution kernel data (weight index value) is buffered in the weight index sub-buffer area; the PE control unit generates the data according to the weight index information
  • the flag bit S1S0 in the configuration of the shift operation mode of the PE unit, the flag bit S0 representing the inversion operation is generated according to the sign of the weight, and the flag bit S1 representing the shift direction is generated according to the position information of the weight in the ordered quantization set and indexed from the weight
  • the PE unit sets two flag bits S1 and S0.
  • S0 When S0 is 0, the weight is negative, and the data with the opposite sign of data A is used for shift operation. On the contrary, data A is directly shifted; when S1 is 0, it means shift If the corresponding value b in the table is positive, the output data b enters the data left shift unit for shift operation. When S1 is 1, it means that the value b is negative, then the output data b enters the data right shift unit for shift operation.
  • the absolute value corresponding to the shift value b represents the magnitude of the shift.
  • the input buffer area and the output buffer area use the line buffer structure shown in Figure 3.
  • (r-1)*n data are shared between adjacent blocks, where r is the number of convolution kernel columns, and n is the row of the input feature map after block number.
  • the data of each input line buffer is M*W, where M represents the number of channels of the input feature map , W represents the width of the input feature map;
  • the data of each output line buffer area is N*C, where N represents the number of channels of the output feature map, and C represents the width of the output feature map.
  • the line buffer area is a circular buffer.
  • the accelerator reads n lines of data from the line buffer area for calculation. At the same time, the accelerator reads data from the external memory DDR and loads it into the remaining m lines of buffer area. The calculation of n lines of data and m lines of data The caches are executed in parallel. After the n rows of data are calculated, the accelerator skips m rows of data and continues to read the next n rows of data for calculation, and the skipped m rows are covered by external data.
  • the encoded convolution kernel data (weight index value) consists of two parts: the symbol shown in Figure 4 (a) and the index shown in Figure 4 (b), the symbol represents the positive and negative of the weight data after the power exponent quantization ,
  • the index represents the position of the weight data quantized by the power exponent in the ordered quantization set.
  • the encoded convolution kernel data determines the direction and size of the displacement through a look-up table method. For example: quantization set ⁇ 2, 0, 2 -1 , 2 -2 , 2 -3 ⁇ , shift table ⁇ 1, 0, -1, -2, -3 ⁇ , quantized weight data 2 is encoded as 0 , Find 1 after looking up the shift table, then shift 1 bit to the right.
  • the accelerator enters the waiting state (idle state), waiting for the state signal flag.
  • the flag When the flag is 001, it enters the state of sending input data.
  • the flag When the flag is 010, it enters the state of sending convolution kernel data.
  • the flag When the flag is 10000, it enters the data state. Calculate status status. When the calculation is over, it automatically jumps to the state of sending the calculation result, and returns to the waiting state after the transmission is completed, waiting for the next data write.
  • Send input data (map) status when the flag is 001, the accelerator reads data from the DDR through the AXI-4 bus, reads 16 lines of data from the input characteristic map and buffers it into the input buffer area, because the input buffer area is designed
  • 16 data can be output in parallel from the 16 line buffer areas of the input buffer area in one clock cycle, and input into the input sub-buffer area of each row of the PE array in parallel, each clock cycle , The data moves to the right in the PE array.
  • Sending convolution kernel data (weight) status When the flag is 010, the accelerator reads 16 convolution kernel data (weight data after power exponent quantization) from the DDR and stores them in the weight buffer area, and the weight index area The data is encoded to obtain a weight index, and the weight index is stored in the weight index buffer area. In one clock cycle, the weight index buffer area outputs 16 data in parallel to 16 PE units in each column of the PE array, and finally buffers them in the weight index sub-buffer area of each PE unit.
  • Data calculation (cal) status when flag is 010, look up the shift table according to the position information represented by the weight data in the weight index sub-buffer area to determine the displacement direction and size of the input data, when 3*3*input channel number is performed After the second shift operation, it indicates that all data has been calculated, and the next clock will jump into the state of sending the calculation result.
  • Calculation result sending (send) status The calculation result is read out from the 16 calculation result buffer areas in turn, and the first output channel data in each calculation result buffer area is taken out, and every four pieces are pieced together into a 64-bit output data. It is sent to the external memory DDR through the AXI-4 bus interface, and 16 output channel data are sent to the external memory DDR in turn, and the accelerator jumps back to the waiting state.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Neurology (AREA)
  • Mathematical Optimization (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Complex Calculations (AREA)

Abstract

一种基于幂指数量化的深度神经网络硬件加速器,涉及深度神经网络卷积计算硬件加速的处理器结构,属于计算、推算、计数的技术领域。硬件加速器包括:AXI-4总线接口、输入缓存区、输出缓存区、权重缓存区、权重索引缓存区、编码模块、可配置状态控制器模块、PE阵列。输入缓存区和输出缓存区设计成行缓存结构;编码器依据有序量化集对权重编码,该量化集存放所有权重量化后的绝对值可能取值。加速器计算时,PE单元从输入缓存区、权重索引缓存区读取数据进行移位计算,将计算结果送至输出缓存区。所述加速器用移位运算代替浮点乘法运算,降低了对计算资源、存储资源以及通信带宽的要求,进而提高了加速器计算效率。

Description

一种基于幂指数量化的深度神经网络硬件加速器 技术领域
本发明公开了一种基于幂指数量化的深度神经网络硬件加速器,涉及深度神经网络卷积计算硬件加速的处理器结构,属于计算、推算、计数的技术领域。
背景技术
近年来,人工智能已经渗透到生活的各个领域且对世界经济和社会活动产生了重大影响。深度学习是机器学习研究中的一个热点领域,被广泛地应用于主流的人工智能算法中,深度卷积神经网络(Deep Convolutional Neural Network,DCNN)作为深度学习的一种技术,目前广泛用于许多人工智能(Artificial Intelligence,AI)应用,深度卷积神经网络在计算机视觉、语音识别和机器人等技术领域特别是图像识别领域取得了一些显著的成果。
深度卷积神经网络往往具有数十乃至数百亿的参数,这使得其在准确性上的表现优于其它现有的机器学习算法,但是巨大的计算能力和存储需求令其难以搭载在资源有限的移动通信设备、物联网设备、可穿戴设备和机器人等小型设备上。为了减少对计算、存储器和通信带宽的需求,一些研究人员已经对CNN模型的压缩进行了大量努力,试图训练具有低精度的CNN模型数据,BinaryNet、XNOR-net和DeRaFa-net都是已经经过压缩处理的CNN模型,但是,压缩后的模型仍然是浮点精度数据,并不特别有利于硬件的设计。一种对硬件友好的量化方案是将模型数据量化为二次幂表示,实践已经表明使用对数数据表示可以将CNN量化为低至3比特而没有任何显着的精度损失。此外,研究人员还提出了增量网络量化的量化方案,将CNN模型量化到4比特而没有任何精度下降。将模型数据量化为二次幂表示的量化方案可以把大多数计算要求高的乘法运算转换为有效的逐位移位运算,这降低了计算和存储要求,本申请旨在提出一种采用该量化方案的神经网络硬件加速器,从硬件设计上简化移位运算的实现,进而减小深度神经网络复杂乘法运算对硬件资源的消耗。
发明内容
本发明的发明目的是针对上述背景技术的不足,提供了一种基于幂指数量化的深度神经网络硬件加速器,通过基于对神经网络参数幂指数量化的结果设计了实现移位运算的硬件加速器,避免使用复杂的乘法电路实现浮点乘法运算,降低了处理器功耗和芯片面积,解决了实现深度神经网络卷积运算的现有处理器电路复杂、需要庞大的存储空间且功耗大的技术问题。
发明为实现上述发明目的采用如下技术方案:
利用增量网络量化的方法训练深度神经网络,把权重量化为二的幂的形式(由软件完成),使神经网络的乘法运算可以通过移位实现。硬件加速器包括:AXI-4总线接口、输入缓存区、输出缓存区、权重缓存区、权重索引缓存区、编码模块、可配置状态控制器模块、PE阵列。PE阵列包含R*C个PE单元,每个PE单元用二进制移位操作代替浮点乘法运算;输入缓存区和输出缓存区设计成行缓存结构,输入缓存区和权重缓存区分别用于缓存通过AXI-4总 线从外部存储器DDR中读取的输入数据以及幂指数量化后的权重数据,AXI-4总线接口可以将加速器挂载到任意使用AXI-4协议的总线设备上工作;输出缓存区用于缓存PE阵列输出的计算结果;可配置状态控制器模块用于控制加速器工作状态,实现工作状态间的转换;编码器根据有序量化集对量化后的权重数据编码得到包含权重正负和权重在有序量化集中位置信息的权重索引值,该有序量化集中存放所有权重量化后的绝对值的可能取值(二的幂形式)。加速器计算时,PE单元从输入缓存区、权重索引缓存区读取数据进行计算,计算完成后将计算结果送至输出缓存区。
因设计中利用移位运算代替乘法运算,乘法器DSP使用较少,硬件资源充足,脉动阵列具有较高的并行度,可很大程度上提高加速器的吞吐量,所以PE阵列设计成了脉动阵列的形式。数据从输入缓存区中加载到PE阵列的最左列,每一个时钟周期,输入数据向右移动一步;PE阵列中每一列并行输出同一输出通道的不同点,不同列输出不同输出通道的数据。
PE单元利用权重索引值判断权重数据量化值的正负以及输入数据的移动位数及移动方向,当权重索引值为正数则权重数据量化值为正数,然后利用权重索引值的绝对值查找移位表(移位表中数据由有序量化集中各二的幂的指数确定,排列顺序与量化集中数据顺序一致)确定移动方向以及移动位数。
基于上述技术方案,输入缓存区的行数由深度神经网络当前层卷积核大小、卷积核步长、输出特征图大小确定。
基于上述技术方案,有序量化集的数据范围是由权重量化精度和未量化的绝对值权重的最大值确定的。有序量化集中的数据按一定的顺序排列,存放的数值是量化后的绝对值。例如权重量化成5bit,则量化集可能为{2,0,2 -1,2 -2,2 -3}。
基于上述技术方案,根据存储在PE单元中的移位表确定输入数据在进行移位运算时移位的方向和大小,移位表中对应的值为正则左移输入数据,为负的时候则右移输入数据,移位表中各元素的绝对值代表位移的大小。当权值索引值为正时直接进行移位运算,权值索引值为负时则对输入数据取反后再进行移位运算。
基于上述技术方案,编码后得到的权重索引值由两部分组成:符号和索引,符号代表量化后的权重数据的正负,索引代表权重数据的绝对值在有序量化集中的位置。
基于上述技术方案,AXI4总线接口将多个数据拼接成一个数据发送,提高运算速度。
本发明采用上述技术方案,具有以下有益效果:
(1)本发明在加速器计算时用移位运算替代乘法运算,通过脉动的PE阵列实现二进制移位操作,避免使用复杂的乘法电路,降低了对计算资源、存储资源以及通信带宽的要求,进而提高了加速器计算效率;
(2)PE单元通过查找对应于幂指数量化后的指数的移位表实现二进制移位运算,硬件易于实现且占用面积较小,减少了硬件资源的消耗且降低了能耗,实现了较高的计算性能;
(3)将本申请涉及的硬件加速器用于基于幂指数量化的神经网络,在幂指数量化结果的基础上通过二进制位移卷积运算,替代了传统的浮点乘运算,降低卷积运算的计算量,降低了神经网络参数的精度进而减少了神经网络参数的存储量,提高神经网络计算速率的同时减小神经网络复杂运算对硬件资源的消耗,为压缩后的神经网络在小型嵌入式终端设备 等硬件上的应用创造可能。
附图说明
图1是本发明公开的硬件加速器的结构示意图。
图2是本发明公开的PE单元的结构示意图。
图3是行缓存结构的示意图。
图4(a)是权重编码的示意图,图4(b)是权重索引值查表法的示意图。
图5是可配置状态控制器切换硬件加速器工作状态的示意图。
具体实施方式
下面结合附图对发明的技术方案进行详细说明。
本发明设计的深度神经网络加速器硬件结构如图1所示,以PE阵列大小为16*16、卷积核大小3*3、卷积核步长1为例阐述其工作方式:加速器通过AXI-4总线把输入数据和幂指数量化后的权重数据缓存到输入缓存区和权重缓存区,根据PE阵列的大小,AXI-4总线要从DDR中读取16个卷积核数据存储在权重缓存区,以便并行输入16个权重索引值到PE阵列。存入权重缓存区的卷积核数据被读取到编码模块,然后根据幂指数量化后的权重数据的正负以及在有序量化集中的位置,按照一定方式对权重数据进行编码,以便在PE单元中进行移位计算。计算时,输入缓存区按行并行输出16个数据,并行输入到PE阵列每行的第一个PE单元中进行计算,并依次传递给每列相邻的PE单元,输入数据缓存在每个PE单元的输入子缓存区中;权重索引值通过16个权重索引子缓冲区并行输入至PE阵列每列第一个PE单元中,并依次传递给每列相邻的PE单元,最终缓存在PE单元中的权重索引子缓存区。每个PE单元从其权重索引子缓存区读取权重索引值,通过查找表的方法即查找建立的移位表确定数据位移的方向和大小。
每个PE单元如图2所示,输入数据被缓存在输入子缓存区,编码后的卷积核数据(权重索引值)缓存在权重索引子缓存区;PE控制单元根据权重索引的信息生成用于配置PE单元移位运算方式的标志位S1S0,代表取反操作的标志位S0根据权重正负生成,代表移位方向的标志位S1根据权重在有序量化集中位置信息生成,并从权重索引子缓冲区和输入子缓冲区中读取数据进行数据处理;利用权重索引值的绝对值查找移位表,确定输入数据进行移位运算时位移的方向和大小。PE单元设置两个标志位S1、S0,S0为0时说明权重为负数,取与数据A符号相反的数据进行移位运算,反之,数据A直接进行移位运算;S1为0时说明移位表中对应的数值b为正则输出数据b进入数据左移单元进行移位运算,S1为1时说明数值b为负则输出数据b进入数据右移单元进行移位运算。移位数值b对应的绝对值代表位移的大小。
输入缓存区和输出缓存区采用了图3所示的行缓存结构。当输入特征图被分块存入输入缓存区时,相邻的分块之间共享(r-1)*n个数据,r为卷积核列数,n为输入特征图分块后的行数。为了重用数据,在片上存储器上设置n+m个行缓存区(m为卷积核步长),每个输入行缓存区的数据为M*W,其中,M代表输入特征图的通道个数,W代表输入特征图的宽度;每个输出行缓存区的数据为N*C,其中,N代表输出特征图的通道个数,C代表输出特征图的宽度。行缓存区为循环缓冲器,加速器从行缓存区读取n行数据进行计算,同时,加速器从外部 存储器DDR中读取数据加载到其余m行缓存区中,n行数据的计算和m行数据的缓存并行执行。n行数据计算完毕后,加速器跳过m行数据继续读取接着的n行数据进行计算,跳过的m行由外来数据覆盖。
编码后的卷积核数据(权重索引值)由两部分组成:如图4(a)所示的符号和图4(b)所示的索引,符号代表幂指数量化后的权重数据的正负,索引代表幂指数量化后的权重数据在有序量化集中的位置。编码后的卷积核数据通过查找表的方法确定位移的方向和大小。例如:量化集{2,0,2 -1,2 -2,2 -3},移位表{1,0,-1,-2,-3},量化后的权重数据2编码后为0,查找移位表后得到1,则向右移1位。
如图5所示,加速器工作时有五种状态,由可配置状态控制器切换,五个状态依次为:等待(idle)、发输入数据(map)、发卷积核数据(weight)、数据计算(cal)、计算结果发送(send)。初始化后加速器进入等待状态(idle状态),等待状态信号flag,当flag为001时,进入发送输入数据状态,当flag为010时,进入发送卷积核数据状态,当flag为10000时,进入数据计算状态状态。当计算结束后,自动跳入发送计算结果状态,并在发送完成后返回等待状态,等待下一次数据写入。
发送输入数据(map)状态:当flag为001时,加速器通过AXI-4总线从DDR中读取数据,从输入特征图中读取16行数据缓存到输入缓存区中,因把输入缓存区设计成了行缓存区的形式,则一个时钟周期内可以从输入缓存区的16个行缓存区中并行输出16个数据,并行输入到PE阵列的每行的输入子缓存区中,每个时钟周期,数据在PE阵列中向右移动。
发送卷积核数据(weight)状态:当flag为010时,加速器从DDR中读取16个卷积核数据(幂指数量化后的权重数据)存放在权重缓存区中,并对权重索引区的数据进行编码得到权重索引,存放权重索引在权重索引缓存区。一个时钟周期内,权重索引缓存区并行输出16个数据至PE阵列的每一列16个PE单元中,最终缓存在各PE单元的权重索引子缓存区中。
数据计算(cal)状态:当flag为010时,根据权重索引子缓存区中的权重数据表征的位置信息查找移位表确定输入数据的位移方向及大小,当进行了3*3*输入通道数次移位运算后,标志着所有数据都已计算完成,下个时钟将跳入发送计算结果状态。
计算结果发送(send)状态:计算结果从16个计算结果缓存区中依次读出,将每个计算结果缓存区中的第一个输出通道数据取出,每四个拼凑成一个64位输出数据,通过AXI-4总线接口发送到外部存储器DDR,依次将16个输出通道数据都发送到外部存储器DDR中,加速器跳回等待状态。
实施例仅为说明本发明的技术思想,不能以此限定本发明的保护范围,凡是按照本发明提出的技术思想在技术方案基础上所做的任何改动均落入本发明保护范围之内。

Claims (8)

  1. 一种基于幂指数量化的深度神经网络硬件加速器,其特征在于,包括:
    输入缓存器,缓存从外部存储器读入的输入特征图数据,
    权重缓存器,缓存从外部存储器读入的幂指数形式的卷积核权重数据,
    编码器,对权重数据编码得到表征权重正负以及权重在有序量化集中位置信息的权重索引值,
    权重索引缓存器,缓存有表征卷积核权重数据正负以及卷积核权重数据在有序量化集中位置信息的权重索引数据,
    PE阵列,从输入缓存器和权重索引缓存器读取输入特征图数据和权重索引数据,依据权重索引数据在有序量化集中的位置信息查找移位表确定移位方向和位数后对输入特征图数据进行移位操作,输出移位操作结果
    输出缓存区,缓存PE阵列输出的移位操作结果,及,
    状态控制器,生成切换加速器处于等待状态、发输入数据状态、发卷积核数据状态、数据计算状态、计算结果发送状态的切换指令。
  2. 根据权利要求1所述一种基于幂指数量化的深度神经网络硬件加速器,其特征在于,所述PE阵列为每一列并行输出同一输出通道的不同点而不同列输出不同输出通道数据的脉动阵列,输入特征图数据在初始时钟周期加载在PE阵列的最左列而在初始时钟周期之后的每一个时钟周期内向当前列的右侧移动一步。
  3. 根据权利要求1所述一种基于幂指数量化的深度神经网络硬件加速器,其特征在于,所述深度神经网络硬件加速器还包括AXI-4总线接口,输入缓存器和权重缓存器通过AXI-4总线接口从外部存储器读入输入特征图数据及幂指数形式的卷积核权重数据。
  4. 根据权利要求1所述一种基于幂指数量化的深度神经网络硬件加速器,其特征在于,所述输入缓存区和输出缓存区均为包含n+m个行缓存区的行缓存结构,PE阵列从输入缓存区读取n行数据进行移位操作,同时,输入缓存区从外部存储器读入m行输入特征图数据加载到其余m个行缓存区中,n为输入特征图分块后的行数,m为卷积核步长。
  5. 根据权利要求1所述一种基于幂指数量化的深度神经网络硬件加速器,其特征在于,所述有序量化集存储有所有卷积核权重数据幂指数量化值的绝对值。
  6. 根据权利要求2所述一种基于幂指数量化的深度神经网络硬件加速器,其特征在于,所述PE阵列中的PE单元,包括:
    输入子缓存区,缓存从输入缓存区读入的输入特征图数据,
    权重索引子缓存区,缓存从权重索引缓存器读入的权重索引数据,
    PE控制单元,从输入子缓存区和权重索引子缓存器读入输入特征图数据和权重索引数据,根据权重正负生成取反操作的标志位,根据权重在有序量化集中位置信息查找移位数据并生成与之对应的移位方向的标志位,
    第一数据选择器,其地址输入端接取反操作的标志位,其一数据输入端接输入特征图数据,其另一数据输入端接输入特征图数据的相反数,在权重为正数时输出输入特征图数据,在权重为负数时输出输入特征图数据的相反数,
    第二数据选择器,其地址输入端接移位方向的标志位,其数据输入端接移位数据,在移位数据为正数时输出对移位数据左移的指令,在移位数据为负数时输出对移位数据的相反数右移的指令,及,
    移位单元,接收第一数据选择器和第二数据选择器的输出信号,按照第二数据选择器输出的指令对输入特征图数据或其相反数进行移位操作。
  7. 根据权利要求3所述一种基于幂指数量化的深度神经网络硬件加速器,其特征在于,所述AXI-4总线接口数据位宽大于单个卷积核权重数据或输入特征图数据的位宽。
  8. 根据权利要求5所述一种基于幂指数量化的深度神经网络硬件加速器,其特征在于,所述移位表由有序量化集中各元素的幂指数顺序排列构成。
PCT/CN2020/071150 2019-06-25 2020-01-09 一种基于幂指数量化的深度神经网络硬件加速器 WO2020258841A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/284,480 US20210357736A1 (en) 2019-06-25 2020-01-09 Deep neural network hardware accelerator based on power exponential quantization

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910554531.8 2019-06-25
CN201910554531.8A CN110390383B (zh) 2019-06-25 2019-06-25 一种基于幂指数量化的深度神经网络硬件加速器

Publications (1)

Publication Number Publication Date
WO2020258841A1 true WO2020258841A1 (zh) 2020-12-30

Family

ID=68285777

Family Applications (2)

Application Number Title Priority Date Filing Date
PCT/CN2019/105532 WO2020258527A1 (zh) 2019-06-25 2019-09-12 一种基于幂指数量化的深度神经网络硬件加速器
PCT/CN2020/071150 WO2020258841A1 (zh) 2019-06-25 2020-01-09 一种基于幂指数量化的深度神经网络硬件加速器

Family Applications Before (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/105532 WO2020258527A1 (zh) 2019-06-25 2019-09-12 一种基于幂指数量化的深度神经网络硬件加速器

Country Status (3)

Country Link
US (1) US20210357736A1 (zh)
CN (1) CN110390383B (zh)
WO (2) WO2020258527A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
NL2035521A (en) * 2021-05-05 2023-08-17 Uniquify Inc Implementations and methods for processing neural network in semiconductor hardware

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110390383B (zh) * 2019-06-25 2021-04-06 东南大学 一种基于幂指数量化的深度神经网络硬件加速器
CN111062472B (zh) * 2019-12-11 2023-05-12 浙江大学 一种基于结构化剪枝的稀疏神经网络加速器及其加速方法
CN113095471B (zh) * 2020-01-09 2024-05-07 北京君正集成电路股份有限公司 一种提高检测模型效率的方法
CN113222126B (zh) * 2020-01-21 2022-01-28 上海商汤智能科技有限公司 数据处理装置、人工智能芯片
CN111240640B (zh) * 2020-01-21 2022-05-10 苏州浪潮智能科技有限公司 基于硬件环境的数据量化方法、装置及可读存储介质
CN111488983B (zh) * 2020-03-24 2023-04-28 哈尔滨工业大学 一种基于fpga的轻量级cnn模型计算加速器
CN113627600B (zh) * 2020-05-07 2023-12-29 合肥君正科技有限公司 一种基于卷积神经网络的处理方法及其系统
CN112073225A (zh) * 2020-08-25 2020-12-11 山东理工职业学院 一种基于校园网速设计的加速器系统以及流程
CN112200301B (zh) * 2020-09-18 2024-04-09 星宸科技股份有限公司 卷积计算装置及方法
CN112786021B (zh) * 2021-01-26 2024-05-14 东南大学 一种基于分层量化的轻量级神经网络语音关键词识别方法
KR20220114890A (ko) * 2021-02-09 2022-08-17 한국과학기술정보연구원 뉴럴 네트워크 연산 방법 및 이를 위한 장치
WO2022235517A2 (en) * 2021-05-05 2022-11-10 Uniquify, Inc. Implementations and methods for processing neural network in semiconductor hardware
CN113298236B (zh) * 2021-06-18 2023-07-21 中国科学院计算技术研究所 基于数据流结构的低精度神经网络计算装置及加速方法
KR20230020274A (ko) * 2021-08-03 2023-02-10 에스케이하이닉스 주식회사 반도체 메모리 장치와 반도체 메모리 장치의 동작 방법
CN113869494A (zh) * 2021-09-28 2021-12-31 天津大学 基于高层次综合的神经网络卷积fpga嵌入式硬件加速器
CN114139693B (zh) * 2021-12-03 2024-08-13 安谋科技(中国)有限公司 神经网络模型的数据处理方法、介质和电子设备
CN114565501B (zh) * 2022-02-21 2024-03-22 格兰菲智能科技有限公司 用于卷积运算的数据加载方法及其装置
CN114781632B (zh) * 2022-05-20 2024-08-27 重庆科技大学 基于动态可重构脉动张量运算引擎的深度神经网络加速器
CN115034372A (zh) * 2022-05-25 2022-09-09 复旦大学 用于DoA估计的TB-Net硬件加速实现方法
WO2024117562A1 (ko) * 2022-11-29 2024-06-06 한국전자통신연구원 곱셈 누적 연산 방법 및 장치
CN118519962B (zh) * 2024-07-19 2024-10-08 深存科技(无锡)有限公司 单边输入输出的脉动阵列加速器架构和通用型加速处理器

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107704916A (zh) * 2016-08-12 2018-02-16 北京深鉴科技有限公司 一种基于fpga实现rnn神经网络的硬件加速器及方法
US20180341495A1 (en) * 2017-05-26 2018-11-29 Purdue Research Foundation Hardware Accelerator for Convolutional Neural Networks and Method of Operation Thereof
CN109284822A (zh) * 2017-07-20 2019-01-29 上海寒武纪信息科技有限公司 一种神经网络运算装置及方法
CN109359735A (zh) * 2018-11-23 2019-02-19 浙江大学 深度神经网络硬件加速的数据输入装置与方法
CN109598338A (zh) * 2018-12-07 2019-04-09 东南大学 一种基于fpga的计算优化的卷积神经网络加速器
CN110390383A (zh) * 2019-06-25 2019-10-29 东南大学 一种基于幂指数量化的深度神经网络硬件加速器

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160379109A1 (en) * 2015-06-29 2016-12-29 Microsoft Technology Licensing, Llc Convolutional neural networks on hardware accelerators
CN107894957B (zh) * 2017-11-14 2020-09-01 河南鼎视智能科技有限公司 面向卷积神经网络的存储器数据访问与插零方法及装置
CN108171317B (zh) * 2017-11-27 2020-08-04 北京时代民芯科技有限公司 一种基于soc的数据复用卷积神经网络加速器
US10831702B2 (en) * 2018-09-20 2020-11-10 Ceva D.S.P. Ltd. Efficient utilization of systolic arrays in computational processing

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107704916A (zh) * 2016-08-12 2018-02-16 北京深鉴科技有限公司 一种基于fpga实现rnn神经网络的硬件加速器及方法
US20180341495A1 (en) * 2017-05-26 2018-11-29 Purdue Research Foundation Hardware Accelerator for Convolutional Neural Networks and Method of Operation Thereof
CN109284822A (zh) * 2017-07-20 2019-01-29 上海寒武纪信息科技有限公司 一种神经网络运算装置及方法
CN109359735A (zh) * 2018-11-23 2019-02-19 浙江大学 深度神经网络硬件加速的数据输入装置与方法
CN109598338A (zh) * 2018-12-07 2019-04-09 东南大学 一种基于fpga的计算优化的卷积神经网络加速器
CN110390383A (zh) * 2019-06-25 2019-10-29 东南大学 一种基于幂指数量化的深度神经网络硬件加速器

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
NL2035521A (en) * 2021-05-05 2023-08-17 Uniquify Inc Implementations and methods for processing neural network in semiconductor hardware

Also Published As

Publication number Publication date
CN110390383B (zh) 2021-04-06
CN110390383A (zh) 2019-10-29
US20210357736A1 (en) 2021-11-18
WO2020258527A1 (zh) 2020-12-30

Similar Documents

Publication Publication Date Title
WO2020258841A1 (zh) 一种基于幂指数量化的深度神经网络硬件加速器
WO2020258529A1 (zh) 一种基于bnrp的可配置并行通用卷积神经网络加速器
CN108805266B (zh) 一种可重构cnn高并发卷积加速器
CN110070178B (zh) 一种卷积神经网络计算装置及方法
CN108647773B (zh) 一种可重构卷积神经网络的硬件互连系统
CN109447241B (zh) 一种面向物联网领域的动态可重构卷积神经网络加速器架构
CN106991477B (zh) 一种人工神经网络压缩编码装置和方法
CN108647779B (zh) 一种低位宽卷积神经网络可重构计算单元
CN107256424B (zh) 三值权重卷积网络处理系统及方法
CN112465110B (zh) 一种卷积神经网络计算优化的硬件加速装置
WO2021057085A1 (zh) 一种基于混合精度存储的深度神经网络加速器
CN111860773B (zh) 处理装置和用于信息处理的方法
CN112596701B (zh) 基于单边雅克比奇异值分解的fpga加速实现方法
CN113298237A (zh) 一种基于fpga的卷积神经网络片上训练加速器
CN115018062A (zh) 一种基于fpga的卷积神经网络加速器
WO2023040389A1 (zh) 转数方法、存储介质、装置及板卡
Wu et al. An efficient lightweight CNN acceleration architecture for edge computing based-on FPGA
CN110766136B (zh) 一种稀疏矩阵与向量的压缩方法
CN113392963B (zh) 基于fpga的cnn硬件加速系统设计方法
CN115482456A (zh) 一种yolo算法的高能效fpga加速架构
CN112906886B (zh) 结果复用的可重构bnn硬件加速器及图像处理方法
CN112346704B (zh) 一种用于卷积神经网络的全流水线型乘加单元阵列电路
WO2021169914A1 (zh) 数据量化处理方法、装置、电子设备和存储介质
CN111445018B (zh) 基于加速卷积神经网络算法的紫外成像实时信息处理方法
CN116681108A (zh) 基于FPGA实现Tanh激励函数的计算方法及系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20831162

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20831162

Country of ref document: EP

Kind code of ref document: A1