CN107645287B

CN107645287B - 6 parallel rapid FIR filter

Info

Publication number: CN107645287B
Application number: CN201710396331.5A
Authority: CN
Inventors: 王中风; 王昊楠; 林军
Original assignee: Nanjing Fengxing Technology Co ltd
Current assignee: Nanjing Fengxing Technology Co., Ltd.
Priority date: 2017-05-24
Filing date: 2017-05-24
Publication date: 2020-12-22
Anticipated expiration: 2037-05-24
Also published as: CN107645287A

Abstract

The invention discloses a size-configurable convolution hardware implementation based on a 6 parallel fast finite impulse response filter cascade structure, which can complete convolution calculation of four sizes, namely 3 × 3, 5 × 5, 7 × 7 and 11 × 11, reduce the complexity of convolution calculation and improve the throughput rate under the 6 parallel structure. The invention introduces a 2-parallel and 3-parallel fast FIR filter algorithm structure, and then generates a 6-parallel fast FIR Filter Algorithm (FFA) according to the way of 2-parallel structure cascade 3-parallel substructure. On the basis of 6 parallel FFAs, a configurable sub-filter is used for designing a fast convolution hardware architecture capable of completing convolution calculations of four sizes, namely 3 x 3, 5 x 5, 7 x 7 and 11 x 11. Compared with the traditional 6-parallel FIR filter, under the condition of the same throughput rate, the algorithm can save 50% of multiplication operation on the basis of adding a plurality of addition operations. And because the area and the power consumption of the multiplier are far larger than those of the adder in the aspect of hardware implementation, the area and the power consumption of the structure can be saved by 50%. The present invention can be applied to situations where convolution calculations of various typical sizes (3 x 3, 5 x 5, 7 x 7 and 11 x 11) are required, such as convolutional neural networks, video image processing, wireless communications, etc., to increase the effective throughput of the original filter or to reduce the power consumption of the original filter.

Description

6 parallel rapid FIR filter

Technical Field

The invention relates to the field of integrated circuits and machine learning, in particular to a 6-parallel fast FIR filter structure which is used for realizing the hardware implementation of a convolution calculation general circuit with all four sizes of 3 x 3, 5 x 5, 7 x 7 and 11 x 11 in a convolution neural network.

Background

Convolutional Neural Networks (CNNs) are one of the most studied and most widely used machine learning algorithms currently. Convolution calculation is the most calculation resource-consuming part in the CNN, the operation of a rolling machine is represented by multiply-accumulate calculation in hardware implementation, and a multiplier is very resource-consuming in hardware, the occupied area and the power consumption of the multiplier are ten times of those of an adder, so that the hardware implementation optimization for the convolution operation is significant. Most convolution networks use convolution kernels of two sizes, 3 x 3 or 5 x 5, while a small fraction of convolution kernels of two sizes, 7 x 7 and 11 x 11, are larger, while the other sizes have not been used effectively.

Polynomial representation of an N-tap FIR filter in the time domain

In the z domain are

Where the sequence x (N) is an infinitely long input sequence and the sequence h (N) contains FIR filter coefficients of length N. It can be seen that if { h (N) } is regarded as a coefficient of an N-dimensional discrete convolution, the FIR filter implements one time of N × N convolution calculation.

The fast Finite Impulse Response (FIR) algorithm (FFA) is obtained by applying the algorithm strength reduction mode to an FIR filter, and the core idea is to achieve the effect of reducing the hardware complexity by using the mode of sharing a substructure.

Disclosure of Invention

The main innovative contents of the invention are as follows:

based on the existing parallel fast Finite Impulse Response (FIR) algorithm and the FFA cascading scheme with large block size, the hardware implementation of the 6 parallel Fast FIR Algorithm (FFA) is proposed for the first time;

designing a general-purpose fast convolution hardware circuit compatible with convolution kernels with sizes commonly used by all four convolution neural networks of 3 x 3, 5 x 5, 7 x 7 and 11 x 11 on the basis of the 6 parallel fast convolution kernels;

the theoretical analysis of the invention is as follows:

in the z-domain, the polynomial representation of an N-tap FIR filter is

First, we discuss a 2-parallel fast FIR filter in a one-stage structure.

The input sequence { x (0), x (1), x (2), x (3), … } may be split into odd and even terms as follows

X(z)＝x(0)+x(1)z^-1+x(2)z^-2+x(3)z^-3+…

＝x(0)+x(2)z^-2+x(4)z^-4+…

+z^-1[x(1)+x(3)z^-2+x(5)z^-4+…]

＝X₀+z^-1X₁

Wherein X₀And X₁Z transform of x (2k +1), respectively x (2 k). Similarly, the filter coefficient H (z) of order N can be split into two parts

H(z)＝H₀+z^-1H₁

Wherein H₀(z²) And H₁(z²) All have the length of

Corresponding to the even sub-filter and the odd sub-filter. And the output sequence y (n) is also expressed as two parts of the parity term, and is calculated as follows

Y(z)＝Y₀+z^-1Y₁

＝(X₀+z^-1X₁)(H₀+z^-1H₁)

＝(X₀H₀+z^-2X₁H₁)+z^-1(X₁H₀+X₀H₁)

Wherein

Y₀＝X₀H₀+z^-2X₁H₁

Y₁＝X₁H₀+X₀H₁

A Fast FIR Algorithm (FFA) is applied to obtain a first-level structure, namely a 2-parallel fast FIR filter structure, and a plurality of 2-parallel FFA structures can be obtained, wherein a typical structure is as follows

Y₀＝X₀H₀+z^-2X₁H₁

Y₁＝(H₀+H₁)(X₀+X₁)-X₀H₀-X₁H₁

We discuss the 3-parallel-speed FIR filter structure below, where for a three-phase polynomial decomposition, the input sequence x (n) and the filter coefficient sequence H (n) can be decomposed into

X(z)＝X₀(z³)+z^-1X₁(z³)+z^-2X₂(z³)

H(z)＝H₀(z³)+z^-1H₁(z³)+z^-2H₂(z³)

Wherein X₀(z³)，X₁(z³)，X₂(z³) Corresponding to the time domain expressions x (3k), x (3k +1) and x (3k +2), respectively, and H₀(z³)，H₁(z³)，H₂(z³) Corresponding to three sub-filters. The output expression of the system is as follows

Y(z)＝Y₀(z³)+z^-1Y₁(z³)+z^-2Y₂(z³)

＝(X₀+z^-1X₁+z^-2X₂)(H₀+z^-1H₁+z^-2H₂)

In theory, a large number of optimized 3-parallel fast FIR filter structures are available, the matrix form of which can be expressed as

Y＝QHP·X

Where P and Q correspond to the pre-processing matrix and post-processing matrix, respectively, and the H matrix corresponds to the sub-filter matrix. So the hardware implementation block diagram of 3 parallel FFA can be easily made according to the above formula, taking the most common 3 parallel FFA structure as an example, see fig. 1.

The structure of the 6 parallel FFA can be realized by sleeving any type of 3 parallel substructures in any type of 2 parallel structures and cascading the most typical two FFA structures, and the output expression is

Y＝Y₀+z^-1Y₁+z^-2Y+z^-3Y₃+z^-4Y₄+z^-5Y₅

＝(X′₀+z^-1X′₁)((H′₀+z^-1H′₁))

＝[X′₀H′₀+z^-2X′₁H′₁]+z^-1[(X′₀+X′₁)(H′₀+H′₁)-X′₀H′₀-X′₁H′₁]

First, a 2-parallel fast FIR filter structure is used, in which

X′₀＝(X₀+z^-2X₂+z^-4X₄)

X′₁＝(X₁+z^-2X₃+z^-4X₅)

H′₀＝(H0+z^-2H₂+z^-4H₄)

H′₁＝(H₁+z^-2H₃+z^-4H₅)

Then each sub-entry corresponds to a 3-parallel FFA, and the output structure is the same, so that the three sub-filters output as

X′₀H′₀＝a₀+a₁+a₂＝a₀+z-²b₁+z-⁴b₂

X′₁H′₁＝a₃+a₄+a₅＝a₃+z-²b₄+z-⁴b₅

(X′₀+X′₁)(H′₀+H′₁)＝a₆+a₇+a₈＝a₆+z^-2b₇+z^-4b₈

It should be noted here that the three terms of the expression of the three sub-filter outputs are with z⁰、z^-2And z^-4The output expression for bringing it into the parent structure, i.e. the 2-parallel structure, has

Y₀＝a₀+z^-6a₅

Y₁＝-a₀-a₃+a₆

Y₂＝a₁+a₃

Y₃＝-a₁-a₄+a₇

Y₄＝a₂+a₄

Y₅＝-a₂-a₅+a₈

The circuit of 6 parallel fast FIR filters can be made according to the output expression. The 6 parallel general convolution kernel comprises 3 parallel FIR filters, the sub-filter part of the circuit can simultaneously realize independent three-channel 3 x 3 convolution calculation, the whole filter can realize single-channel 5 x 5 convolution calculation, and the reconfigurable 2-order FIR sub-filter can realize hardware realization compatible with convolution calculation of all four sizes, namely 3 x 3, 5 x 5, 7 x 7 and 11 x 11. The function of mode selection can be completed by adding a MUX element, and the specific circuit schematic diagram is shown in FIG. 2, and the specific circuit schematic diagram of the reconfigurable 2-step FIR sub-filter is shown in FIG. 3.

In the output module, the output module outputs 6 output results in parallel at a time. 36 multiplications and 30 additions are needed to calculate 6 output results by applying the traditional direct 6-order FIR filter, and 18 multiplications and 42 additions are needed to calculate 6 output results by applying the 6-parallel fast FIR filter of the invention. Because the area and power consumption consumed by the multiplier are far larger than those of the adder in the hardware implementation, compared with the traditional direct FIR filter, the 6-parallel fast FIR filter introduced by the invention can save 50% of hardware resources. And on the basis, a general circuit supporting convolution calculation of all four sizes applied to the convolutional neural network is realized.

Drawings

FIG. 1 is a block diagram of a 3 parallel fast FIR filter;

FIG. 2 is a detailed circuit diagram of a generic 6 parallel fast FIR filter;

FIG. 3 is a circuit schematic of a 2 nd order reconfigurable FIR sub-filter;

fig. 4 is a schematic diagram of the blocks of a 6 parallel fast FIR filter.

Detailed Description

When the mode selection A module inputs 0 and the mode selection B module inputs 0, the circuit performs three-channel 3 multiplied by 3 convolution calculation and inputs a sequence x_i{n}＝{x_i0，x_i1，x_i2H, a sequence of convolution coefficients_i{n}＝{h_i0，h_i1，h _i21, 23, the input mode is

X0←x₀₀，X2←x₀₁，X4←x₀₂；H00←h₀₀，H01←h₀₁，H02←h₀₂；

X6←x₁₀，X7←x₁₁，X8←x₁₂；H10←h₁₀，H11←h₁₁，H12←h₁₂；

X1←x₂₀，X3←x₂₁，X5←x₂₂；H20←h₂₀，H21←h₂₁，H22←h₂₂；

When the mode selection A module inputs 1 and the mode selection B module inputs 0, the circuit performs single-channel 5 × 5 convolution calculation, the single-channel input sequence still converts the input data into 6-channel parallel input through the serial-to-parallel pre-processing circuit, and the input sequence of the general convolution kernel is x { n } - { x { (x) } at the moment₀，x₁，x₂，x₃，x₄，x₅H, parameter sequence h { n } - { h₀，h₁，h₂，h₃，h ₄0, here the coefficient h in the convolution of 6 x 6 is used ingeniously₅The special case of 0 realizes 5 × 5 convolution calculation, so the input mode is

X0←x₀；H00←h₀

X2←x₂；H01←h₂

X4←x₄；H02←h₄

X6←z；H10←h₀+h₁

X7←z；H11←h₂+h₃

X8←z；H12←h₄

X1←x₁；H20←h₁

X3←x₃；H21←h₃

X5←x₅；H22←0

When the mode selection A module inputs 1 and the mode selection B module inputs 1, the circuit realizes convolution calculation of a single channel 11 multiplied by 11, input data of the single channel is still converted into 6 channels of parallel input through the pre-processing circuit from serial to parallel, and the input sequence is x { n } - { x { (n) } x }₀，x₁，…，x₅}，{x₆，x₇，…，x₁₁H, parameter sequence h { n } - { h₀，h₁，…，h ₁₀0, here by the coefficient h in a convolution of 12 x 12₁₁The special case of 0 is used to realize 11 × 11 convolution calculation, and the input mode is

X0←{x₀，x₆}；H00←{h₀，h₆}

X2←{x₂，x₈}；H01←{h₂，h₈}

X4←{x₄，x₁₀}；H02←{h₄，h₁₀}

X6←z；H10←{h₀+h₁，h₆+h₇}

X7←z；H11←{h₂+h₃，h₈+h₉}

X8←z；H12←{h₄+h₅，h₁₀}

X1←{x₁，x₇}；H20←{h₁，h₇}

X3←{x₃，x₉}；H21←{h₃，h₉}

X5←{x₅，x₁₁}；H22←{h₅，0}

In mode selectionWhen the module A inputs 1 and the mode selects the module B to input 1, the circuit realizes a 7 multiplied by 7 single-channel convolution mode through the change of the input mode, and a single-channel input sequence still converts input data into parallel data through a pre-processing circuit from serial to parallel₆Parallel input, input sequence is x { n } - { x }₀，x₁，…，x₅}，{x₆，x₇，…，x₁₁H, parameter sequence h { n } - { h₀，h₁，…，

h

₆0, 0, 0, 0, 0}, using the convolution factor h in a 12 × 12 convolution₇，…，h_nThe 7 × 7 convolution calculation is realized in the special case of 0, and the input mode is

X0←{x₀，x₆}；H00←{h₀，h₆}

X2←{x₂，x₈}；H01←{h₂，0}

X4←{x₄，x₁₀}；H02←{h₄，0}

X6←z；H10←{h₀+h₁，h₆}

X7←z；H11←{h₂+h₃，0}

X8←z；H12←{h₄+h₅，0}

X1←{x₁，x₇}；H20←{h₁，0}

X3←{x₃，x₉}；H21←{h₃，0}

X5←{x₅，x₁₁}；H22←{h₅，0}

In summary, if only 3 × 3 and 5 × 5 modes are supported, the structure of the invention uses 18 multipliers, 42 adders and 7 delay units, which can save 50% of hardware resources; and by using a 2-order filter in a sub-filter structure, the efficient hardware implementation of convolution calculation of all 4 types of convolution neural networks with common sizes can be completed, and by using 35 multipliers, 59 adders and 25 delay units, under the condition that the circuit integration scale is quite high nowadays, the design of an efficient general type neural network convolution kernel is realized, and the convolution calculation of all four types of convolution kernels of 3 × 3, 5 × 5, 7 × 7 and 11 × 11 can be supported.

Claims

1. A6 parallel fast FIR filter, a structure of 3 parallel fast FIR filters in cascade, comprising:

a mode selection module for selecting one of the four convolution calculation modes of 3 × 3, 5 × 5, 7 × 7 and 11 × 11;

the data input module is used for carrying out parallel input of a corresponding mode on serial input data and sending the serial input data into a corresponding mode input channel;

the fast convolution module is used for carrying out fast convolution calculation operation for reducing complexity on parallel input data;

the data output module is used for outputting parallel data of a corresponding mode;

wherein the fast convolution module further comprises:

2, cascading 3 parallel fast FIR filter substructures;

the 2-stage parallel structure of the primary structure comprises 3 pre-adders, 9 post-adders, 1 data register and 3 secondary 3-stage parallel fast FIR filter substructures;

3 secondary 3 parallel fast FIR filter substructures, each of which comprises 3 pre-adders, 7 post-adders, 2 data registers, and 18 second-order reconfigurable FIR subfilters;

and each of the 6 second-order reconfigurable FIR sub-filters comprises 2 multipliers, 1 adder, 1 data register and 1MUX unit of selecting 1 from 2.

2. In a 6-parallel fast FIR filter according to claim 1, a method of implementing a 5 x 5 fast convolution algorithm;

a method of implementing a 7 x 7 fast convolution algorithm;

a method of implementing 11 x 11 fast convolution algorithm.

3. In a 6 parallel fast FIR filter according to claim 1, a general reconfigurable FIR sub-filter for realizing two mode selections of 1 st order and 2 nd order is provided.