CN107862381A

CN107862381A - A kind of FIR filter suitable for a variety of convolution patterns is realized

Info

Publication number: CN107862381A
Application number: CN201711101343.7A
Authority: CN
Inventors: 王中风; 袁炅; 林军
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2017-11-06
Filing date: 2017-11-06
Publication date: 2018-03-30

Abstract

The invention discloses a FIR filter applicable to multiple convolution modes and its hardware implementation. The structure can support the mainstream convolution operations in the current convolutional neural network, such as 3*3 and 5 with a step size of 1. *5 convolution calculation and 3*3 convolution operation with a step size of 2, etc., and use 6 parallel fast FIR algorithms to reduce hardware consumption, reduce convolution calculation complexity, and improve data throughput. The present invention has completed the derivation of the hardware structure of three parallel convolution operations with steps of 2, and combines it with the hardware structure of 6 parallel fast FIR filters on the basis of not increasing the adder and multiplier, so that the structure can be adapted to In each mode, hardware resources are greatly utilized. The present invention can realize most of the current mainstream convolutional neural network calculations through different configurations of the single hardware structure, improves hardware utilization, has high versatility, and simplifies the hardware implementation of the convolutional neural network.

Description

A FIR Filter Implementation Suitable for Various Convolution Modes

技术领域technical field

本发明涉及计算机及电子科学领域，特别涉及深度学习领域卷积神经网络的硬件实现，一种兼容步伐为1卷积计算与步伐为2卷积运算的通用架构及硬件实现。The present invention relates to the field of computer and electronic science, in particular to the hardware implementation of convolutional neural network in the field of deep learning, a general architecture and hardware implementation compatible with step 1 convolution calculation and step 2 convolution operation.

背景技术Background technique

卷积神经网络(CNN)由于其在图像，音频等领域卓越的表现，如今已成为当前最为流行，且应用最广泛的深度学习算法之一。随着近年来卷积神经网络的飞速发展，大卷积核在模型中的应用已经越来越少，目前各个模型中运用最广泛的是3*3与5*5的卷积运算，并且步伐为2的卷积运算也被越来越多的模型运用到。而针对步伐为2的卷积运算，却一直没有一个很好的硬件实现优化方案。传统的步伐为1的卷积运算可以通过快速FIR算法以提高并行度并减少乘法器资源。Convolutional Neural Network (CNN) has become one of the most popular and widely used deep learning algorithms due to its excellent performance in image, audio and other fields. With the rapid development of convolutional neural networks in recent years, the application of large convolution kernels in models has become less and less. At present, the most widely used convolution operations in various models are 3*3 and 5*5 convolution operations, and the pace The convolution operation of 2 is also used by more and more models. However, for the convolution operation with a step of 2, there has not been a good hardware implementation optimization solution. The traditional convolution operation with a stride of 1 can use the fast FIR algorithm to improve parallelism and reduce multiplier resources.

一个N抽头的FIR滤波器在时域的多项式表示为：The polynomial representation of an N-tap FIR filter in the time domain is:

或者在z域中可以表示为Or in the z domain it can be expressed as

若将长度为N的FIR滤波器系数序列{h(n)}作为N维离散卷积的系数，则该FIR滤波器可以实现一个N维的卷积运算。通过N个该滤波器的组合，可以实现卷积神经网络中N*N的卷积运算。而快速FIR算法可以实现高并行度，以及通过增加加法器而减少乘法器的方法来实现低复杂度。但是该方法对于步长为2的卷积运算并不合适，通过该方法进行计算并选择性的输出来实现步长为2的卷积会带来硬件资源的严重浪费，每个周期内都有约50％的硬件资源对计算结果是无影响的。所以一种既能够实现传统步长为1的卷积运算又能实现步长为2的卷积运算，且具有低复杂度、高并行度、高硬件资源利用率的通用硬件架构将成为一种需求。If the FIR filter coefficient sequence {h(n)} with a length of N is used as the coefficient of N-dimensional discrete convolution, the FIR filter can realize an N-dimensional convolution operation. Through the combination of N filters, the convolution operation of N*N in the convolutional neural network can be realized. The fast FIR algorithm can achieve high parallelism, and achieve low complexity by adding adders and reducing multipliers. However, this method is not suitable for the convolution operation with a step size of 2. Using this method to calculate and selectively output the convolution with a step size of 2 will cause a serious waste of hardware resources. There are About 50% of the hardware resources have no effect on the calculation results. Therefore, a general-purpose hardware architecture that can realize both the traditional convolution operation with a step size of 1 and the convolution operation with a step size of 2, and has low complexity, high parallelism, and high hardware resource utilization will become a need.

发明内容Contents of the invention

针对上述问题，本发明提出了一种在快速FIR算法框架上既可兼容步长为1又可兼容步长为2的卷积计算架构及其硬件实现。本发明在一种硬件架构上实现了三种计算模式，分别为6抽头6并行卷积计算，三个独立的3抽头3并行卷积运算，以及2个独立的步长为2的3抽头3并行卷积计算。本发明具备高通用性，通过对该硬件架构的不同配置，可以实现大部分当前主流的卷积运算。具体发明内容如下：Aiming at the above problems, the present invention proposes a convolution calculation framework compatible with a step size of 1 and a step size of 2 on the fast FIR algorithm framework and its hardware implementation. The present invention implements three calculation modes on a hardware architecture, which are 6-tap 6 parallel convolution calculation, three independent 3-tap 3 parallel convolution operations, and 2 independent 3-tap 3-tap with a step size of 2. Parallel convolution computation. The present invention has high versatility, and can realize most of the current mainstream convolution operations through different configurations of the hardware architecture. Concrete invention content is as follows:

一种可适用于多种卷积模式的FIR滤波器，其硬件架构包括：A FIR filter applicable to multiple convolution modes, its hardware architecture includes:

1)数据输入选择单元，针对不同的卷积模式，将输入数据进行重新选择排列输入至相应的卷积计算模块。1) The data input selection unit, for different convolution modes, reselects and arranges the input data and inputs it to the corresponding convolution calculation module.

2)卷积计算单元，基本组成单元是3并行的3抽头快速FIR滤波器，并插入了数据选择器控制数据流来针对不同的卷积运算。2) The convolution calculation unit, the basic unit is 3 parallel 3-tap fast FIR filters, and a data selector is inserted to control the data flow for different convolution operations.

3)卷积后计算单元，对卷积计算单元的的输出进行处理计算从而实现对卷积计算单元内多个独立组成单元的级联，形成一个多并行多抽头的快速FIR滤波器。3) The post-convolution calculation unit processes and calculates the output of the convolution calculation unit so as to realize the cascading of multiple independent constituent units in the convolution calculation unit to form a multi-parallel multi-tap fast FIR filter.

4)数据输出选择单元，针对不同的卷积模式，选择与其对应的计算结果作为模块输出。4) The data output selection unit, for different convolution modes, selects the corresponding calculation results as the module output.

本发明的第二种计算模式为三个独立的3抽头3并行快速FIR算法硬件结构，其中3抽头3并行快速FIR硬件结构是最基本的组成模块，通过公式推导可得每一个输出Y与输入X之间的关系：The second calculation mode of the present invention is three independent 3-tap 3-parallel fast FIR algorithm hardware structures, wherein the 3-tap 3-parallel fast FIR hardware structure is the most basic component module, and each output Y and input can be obtained by formula derivation The relationship between X:

Y₀＝H₀X₀-z^-3H₂X₂+z^-3[(H₁+H₂)(X₁+X₂)-H₁X₁]Y ₀ ＝H ₀ X ₀ -z ^-3 H ₂ X ₂ +z ^-3 [(H ₁ +H ₂ )(X ₁ +X ₂ )-H ₁ X ₁ ]

Y₁＝[(H₀+H₁)(X₀+X₁)-H₁X₁]-[H₀X₀-z^-3H₂X₂]Y ₁ ＝[(H ₀ +H ₁ )(X ₀ +X ₁ )-H ₁ X ₁ ]-[H ₀ X ₀ -z ^-3 H ₂ X ₂ ]

Y₂＝[H₀+H₁+H₂)(X₀+X₁+X₂)]-[(H₀+H₁)(X₀+X₁)-H₁X₁]-[(H₁+H₂)(X₁+X₂)-H₁X₁]Y ₂ ＝[H ₀ +H ₁ +H ₂ )(X ₀ +X ₁ +X ₂ )]-[(H ₀ +H ₁ )(X ₀ +X ₁ )-H ₁ X ₁ ]-[(H ₁ +H ₂ )(X ₁ +X ₂ )-H ₁ X ₁ ]

而对于步长为2的3抽头3并行卷积运算，可以推导出每一个输出Y与输入X之间的关系为：For a 3-tap 3 parallel convolution operation with a step size of 2, the relationship between each output Y and input X can be deduced as:

Y₀＝H₀X₀+z^-6(H₂X₄+H₁X₅)Y ₀ ＝H ₀ X ₀ +z ^-6 (H ₂ X ₄ +H ₁ X ₅ )

Y₁＝H₂X₀+H₁X₁+H₀X₂ Y ₁ ＝H ₂ X ₀ +H ₁ X ₁ +H ₀ X ₂

Y₂＝H₂X₂+H₁X₃+H₀X₄ Y ₂ =H ₂ X ₂ +H ₁ X ₃ +H ₀ X ₄

通过乘法器与加法器的复用，并且加入多路复用器的方式将步长为2的卷积运算与快速FIR算法相结合，从而实现具有高硬件资源利用率，高并行度，高通用性的卷积运算硬件架构。Through the multiplexing of multipliers and adders, and adding a multiplexer, the convolution operation with a step size of 2 and the fast FIR algorithm are combined to achieve high hardware resource utilization, high parallelism, and high versatility. The revolutionary hardware architecture of convolution operation.

附图说明Description of drawings

图1为步长为2的3并行卷积计算硬件结构图。Figure 1 is a hardware structure diagram of 3 parallel convolution calculations with a step size of 2.

图2为3抽头3并行快速FIR滤波计算硬件结构图。Fig. 2 is a 3-tap 3-parallel fast FIR filter calculation hardware structure diagram.

图3为本发明整体架构图。Fig. 3 is an overall structure diagram of the present invention.

图4为本发明一种适用于多种卷积模式的FIR滤波器硬件结构图。FIG. 4 is a hardware structure diagram of an FIR filter applicable to multiple convolution modes according to the present invention.

具体实施方式Detailed ways

下面将结合附图对本发明的具体实施作更进一步的说明，步长为2的卷积运算与步长为1的卷积运算在具体实施时有着很大的不同，因为每一次输入的数据与上一次数据间隔为2，这使得用快速FIR算法进行并行计算的时候，由于数据的非连续性，使得一半的计算结果都是无用数据。若不使用快速FIR算法，由输入与输出的关系：The specific implementation of the present invention will be described further below in conjunction with the accompanying drawings. The convolution operation with a step size of 2 and the convolution operation with a step size of 1 are very different in implementation, because each input data and The last data interval is 2, which makes half of the calculation results useless data due to the discontinuity of the data when the fast FIR algorithm is used for parallel calculations. If the fast FIR algorithm is not used, the relationship between input and output:

Y₂＝H₂X₂+H₁X₃+H₀X₄ Y ₂ =H ₂ X ₂ +H ₁ X ₃ +H ₀ X ₄

可以得到若将步长为2的卷积运算做成3并行，其硬件结构如图1所示。It can be obtained that if the convolution operation with a step size of 2 is made into 3 parallels, its hardware structure is shown in Figure 1.

该硬件结构一共用到了9个乘法器和6个加法器，每输入6个数据计算得到3个输出，相对于快速FIR算法在做到相同并行的条件下，实现了硬件资源的高利用率。The hardware structure uses 9 multipliers and 6 adders in total, and every 6 data inputs are calculated to obtain 3 outputs. Compared with the fast FIR algorithm, it achieves high utilization of hardware resources under the same parallel conditions.

一种快速FIR滤波器的硬件结构如图2所示。该硬件结构通过增加加法器减少乘法器的方法来降低硬件实现复杂度并做到了高并行度。对于传统步长为1的卷积运算，该硬件结构可以实现高效的运算。A hardware structure of a fast FIR filter is shown in Figure 2. The hardware structure reduces the complexity of hardware implementation and achieves high parallelism by adding adders and reducing multipliers. For the traditional convolution operation with a step size of 1, this hardware structure can realize efficient operation.

本发明通过将上述两种硬件结构的融合实现了对步伐为1与步伐为2的卷积运算的高效的支持，在高并行度的前提下具有复杂度低，硬件利用率高，通用性强等特点。图3为本发明的硬件架构，如图该架构包括四个模块：The present invention realizes the efficient support for the convolution operation with a step of 1 and a step of 2 by combining the above two hardware structures, and has low complexity, high hardware utilization rate and strong versatility under the premise of high parallelism Features. Fig. 3 is the hardware architecture of the present invention, as shown in the figure, the architecture includes four modules:

具体的硬件电路结构如图4所示。本发明通过复用乘法器与加法器并加入多路复用器的方法从而实现了将步长为2的3并行卷积计算硬件结构与三并行快速FIR滤波器的融合，在没有增加一个乘法器与加法器的条件下实现了对步伐为2与步伐为1两种卷积模式的兼容。将3个3并行快速FIR滤波器级联实现6并行快速FIR滤波器。本发明共有三种工作模式可以选择，第一种为3个独立的3并行3抽头快速FIR滤波器，通过设置抽头系数H，可以是实现3*3的卷积运算。第二种为2个独立的步长为2的3抽头卷积运算。第三种为单个6并行快速FIR滤波器。通过对第三种模式的最后一位抽头系数置0可以实现5*5的卷积运算。只需将数据输入X，抽头系数H，以及工作模式M输入至该硬件电路，本发明会根据输入的模式对X与H进行自动分配，输入至每个子模块进行运算，并将得出的结果通过多路复用器选择与模式相匹配的数据作为结果输出。The specific hardware circuit structure is shown in Figure 4. In the present invention, by multiplexing multipliers and adders and adding multiplexers, the fusion of three parallel convolution computing hardware structures with a step size of 2 and three parallel fast FIR filters is realized without adding a multiplication Compatible with two convolution modes with a stride of 2 and a stride of 1 are realized under the condition of an adder and an adder. Three 3-parallel fast FIR filters are cascaded to realize a 6-parallel fast FIR filter. The present invention has three working modes to choose from. The first one is three independent 3-parallel 3-tap fast FIR filters. By setting the tap coefficient H, a 3*3 convolution operation can be realized. The second is two independent 3-tap convolution operations with a stride of 2. The third is a single 6-parallel fast FIR filter. A 5*5 convolution operation can be realized by setting the last tap coefficient of the third mode to 0. Just input data X, tap coefficient H, and working mode M into the hardware circuit, the present invention will automatically allocate X and H according to the input mode, input them to each sub-module for calculation, and obtain the result The data matching the pattern is selected by the multiplexer as the result output.

当然本发明不仅仅适用于卷积神经网络的计算中，在数字图像处理，数字信号处理，无线通信等领域也有很多适用场景，并且通过乘法器与加法器复用并插入多路复用器的方式对一些其他滤波器的改进可以实现对更多卷积计算模式的匹配。对于本技术领域的普通技术人员来说，在不脱离本发明原理的前提下，还可以作出若干改进和润饰，这些改进和润饰也应视为本发明的保护范围。本实施例中为明确的各组成部分均可用现有技术加以实现。Of course, the present invention is not only applicable to the calculation of the convolutional neural network, but also has many applicable scenarios in the fields of digital image processing, digital signal processing, wireless communication, etc., and the multiplier and adder are multiplexed and inserted into the multiplexer The improvement of some other filters can realize the matching of more convolution calculation modes. For those skilled in the art, without departing from the principle of the present invention, some improvements and modifications can also be made, and these improvements and modifications should also be regarded as the protection scope of the present invention. All components that are not specified in this embodiment can be realized by existing technologies.

Claims

1. A kind of FIR filter applicable to multiple convolution modes, its hardware architecture includes:

1) The data input selection unit, for different convolution modes, reselects and arranges the input data and inputs it to the corresponding convolution calculation module.

2) The convolution calculation unit, the basic unit is 3 parallel 3-tap fast FIR filters, and a data selector is inserted to control the data flow for different convolution operations.

3) The post-convolution calculation unit processes and calculates the output of the convolution calculation unit so as to realize the cascading of multiple independent constituent units in the convolution calculation unit to form a multi-parallel multi-tap fast FIR filter.

4) The data output selection unit, for different convolution modes, selects the corresponding calculation results as the module output.