CN107645287A

CN107645287A - A kind of size based on 6 parallel rapid finite impact response filter cascade structures can configure convolution hardware and realize

Info

Publication number: CN107645287A
Application number: CN201710396331.5A
Authority: CN
Inventors: 王中风; 王昊楠; 林军
Original assignee: Nanjing University
Current assignee: Nanjing Fengxing Technology Co Ltd
Priority date: 2017-05-24
Filing date: 2017-05-24
Publication date: 2018-01-30
Anticipated expiration: 2037-05-24
Also published as: CN107645287B

Abstract

Convolution hardware is can configure the invention discloses a kind of size based on 6 parallel rapid finite impact response filter cascade structures to realize, the structure can complete the convolutional calculation of tetra- kinds of sizes of 3*3,5*5,7*7 and 11*11, convolutional calculation complexity is reduced, and throughput is improved under 6 parallel organizations.The present invention first describes 2 parallel with 3 parallel quick FIR filter algorithm structures, and the mode that 3 parallel minor structures are then cascaded according to 2 parallel organizations produces 6 parallel quick FIR filter algorithms (FFA).On the basis of 6 parallel FFA, with configurable subfilter, the fast convolution hardware structure that can complete tetra- kinds of size convolutional calculations of 3*3,5*5,7*7 and 11*11 is devised.Compared to 6 traditional parallel FIR filters, under the conditions of identical throughput, this algorithm can save 50% multiplication operation simultaneously on the basis of some add operations are increased.And due to being realized in hardware, the area and power consumption of multiplier are much larger than adder, therefore this framework can save 50% area and power consumption.The present invention can be used in needs the occasion of a variety of typical sizes (3*3,5*5,7*7 and 11*11) convolutional calculations, such as convolutional neural networks, Computer Vision, radio communication etc., the effective throughput of original filter can be improved, or reduces the power consumption of original filter.

Description

A kind of size based on 6 parallel rapid finite impact response filter cascade structures can Convolution hardware is configured to realize

Technical field

The present invention relates to integrated circuit and machine learning field, more particularly to a kind of 6 parallel quick FIR filter structures, The universal circuit of convolutional calculation of the whole four kinds of sizes of 3*3,5*5,7*7 and 11*11 in convolutional neural networks is carried out using it Hardware is realized.

Background technology

Convolutional neural networks (CNN) are that current research obtains one of machine learning algorithm at most and being most widely used. Convolutional calculation is the most part of consumption calculations resource in CNN, and machine operation is rolled up in hardware realization and shows as repeatedly multiplying accumulating calculating, And multiplier is that very consumption resource, its footprint area and power consumption is ten several times of adder within hardware, thus it is directed to volume The hardware of product operation realizes that optimization just seems highly significant.The convolutional network of the overwhelming majority all employ both chis of 3*3 or 5*5 Very little convolution kernel, the larger sized convolution kernel of small part have two kinds of 7*7 and 11*11, and other sizes were not used effectively also then.

The FIR filter of one N tap is shown as in the polynomial table of time domain

It is in z domains

Wherein sequence { x (n) } is the list entries of an endless, and sequence { h (n) } contains the FIR that length is N and filtered Device coefficient.It can seem, if { h (n) } to be considered as to the coefficient of N-dimensional discrete convolution, FIR filter realizes N × N volume Product calculates.

The mode of algorithm intensity reduction is applied in finite impulse response (FIR) wave filter, has just obtained quick FIR algorithm (FFA), its core concept is to reach the effect of reduction hardware complexity using the mode of shared minor structure.

The content of the invention

The fundamental novel features of the present invention have：

● based on existing parallel rapid finite impulse response (FIR) algorithm, and the FFA concatenated schemes of chunk sizes, The hardware for proposing 6 parallel quick FIR algorithms (FFA) first is realized；

● on the basis of 6 parallel fast convolution cores, devise all four kinds of a kind of compatible 3*3,5*5,7*7 and 11*11 The universal fast convolution hardware circuit of convolutional neural networks Commonly Used Size convolution kernel；

The theory analysis of the present invention is as follows：

In z domains, the polynomial table of the FIR filter of a N tap is shown as

First, we discuss 2 parallel quick FIR filters in primary structure.

List entries { x (0), x (1), x (2), x (3) ... } can be split as odd term and even item two parts are as follows

X (z)=x (0)+x (1) z^-1+x(2)z^-2+x(3)z^-3+L

=x (0)+x (2) z^-2+x(4)z^-4+L

+z^-1[x(1)+x(3)z^-2+x(5)z^-4+L]

=X₀+z^-1X₁

Wherein X₀And X₁Respectively x (2k) x (2k+1) z-transform.Similarly, exponent number is that N filter coefficient H (z) can be with It is split as two parts

H (z)=H₀+z^-1H₁

Wherein H₀(z²) and H₁(z²) length is allCorresponding to even number subfilter and odd number subfilter.And will be defeated Go out sequences y (n) and also illustrate that into two parts of odd even item, be calculated as follows

Y (z)=Y₀+z^-1Y₁

=(X₀+z^-1X₁)(H₀+z^-1H₁)

=(X₀H₀+z^-2X₁H₁)+z^-1(X₁H₀+X₀H₁)

Wherein

Y₀=X₀H₀+z^-2X₁H₁

Y₁=X₁H₀+X₀H₁

The parallel quick FIR filter structure in primary structure i.e. 2 is obtained using quick FIR algorithm (FFA), can be obtained a lot 2 parallel FFA structures of kind, more typical structure are as follows

Y₀=X₀H₀+z^-2X₁H₁

Y₁=(H₀+H₁)(X₀+X₁)-X₀H₀-X₁H₁

We discuss 3 and scanning frequency FIR filter structure below, for the Factoring Polynomials of three-phase, list entries x (n) and Filter coefficient sequence H (n) can be broken down into

X (z)=X₀(z³)+z^-1X₁(z³)+z^-2X₂(z³)

H (z)=H₀(z³)+z^-1H₁(z³)+z^-2H₂(z³)

Wherein X₀(z³), X₁(z³), X₂(z³) correspond respectively to time-domain expression x (3k), x (3k+1) and x (3k+2), and H₀ (z³), H₁(z³), H₂(z³) correspond to three subfilters.The output expression formula of so system is as follows

Y (z)=Y₀(z³)+z^-1Y₁(z³)+z^-2Y₂(z³)↓

=(X₀+z^-1X₁+z^-2X₂)(H₀+z^-1H₁+z^-2H₂)

In theory, 3 parallel quick FIR filter structures of a variety of optimizations can be obtained, its matrix form can table It is shown as following form

Y=QHPX

Wherein P and Q corresponds respectively to preconditioning matrix and post processing matrix, and H-matrix then corresponds to subfilter matrix.Institute Realize block diagram can easily make 3 parallel FFA hardware according to above formula, using 3 the most commonly used parallel FFA structures as Example, is shown in Fig. 1.

6 parallel FFA structure, can by applying mechanically any type of 3 parallel minor structures in the parallel organization of any type 2, Cascaded with most typical two kinds of FFA structures, then exporting expression formula is

Y=Y₀+z^-1Y₁+z^-2Y+z^-3Y₃+z^-4Y₄+z^-5Y₅

=(X '₀+z^-1X′₁)((H′₀+z^-1H′₁))

=[X '₀H′₀+z^-2X′₁H′₁]+z^-1[(X′₀+X′₁)(H′₀+H′₁)-X′₀H′₀-X′₁H′₁]

First by the structure of 2 parallel quick FIR filters, wherein

X′₀=(X₀+z^-2X₂+z^-4X₄)

X′₁=(X₁+z^-2X₃+z^-4X₅)

H′₀=(H₀+z^-2H₂+z^-4H₄)

H′₁=(H₁+z^-2H₃+z^-4H₅)

Then now each subitem correspond to a 3 parallel FFA, and its export structure is identical, then makes three subfilters Export and be

X′₀H′₀=a₀+a₁+a₂=a₀+z^-2b₁+z^-4b₂

X′₁H′₁=a₃+a₄+a₅=a₃+z^-2b₄+z^-4b₅

(X′₀+X′₁)(H′₀+H′₁)=a₆+a₇+a₈=a₆+z^-2b₇+z^-4b₈

Herein it should be noted that three of three subfilter output expression formulas are with z⁰、z^-2With z^-4It is 3 parallel defeated Go out structure, being taken to the father's structure i.e. output expression formula of 2 parallel organizations has

Y₀=a₀+z^-6a₅

Y₁=-a₀-a₃+a₆

Y₂=a₁+a₃

Y₃=-a₁-a₄+a₇

Y₄=a₂+a₄

Y₅=-a₂-a₅+a₈

The circuit of 6 parallel quick FIR filters can be then made according to output expression formula.The 6 parallel general convolution kernel bag Containing 33 parallel FIR filters, then the circuit neutron filter segment can realize the independent convolution meter of triple channel 3 × 3 simultaneously Calculate, and overall wave filter can then realize the convolutional calculation of single channel 5 × 5, and by using the rank FIR subfilters of restructural 2, can To realize that compatible all hardware of four kinds of sizes 3 × 3,5 × 5,7 × 7 and 11 × 11 convolutional calculations is realized.By adding MUX members Part can completes the function of model selection, and physical circuit schematic diagram is shown in Fig. 2, and the rank FIR subfilter physical circuits of restructural 2 show Intention is shown in Fig. 3.

In the output module, the parallel output 6 of output module one time output result.Filtered with traditional rank FIR of Direct-type 6 Ripple device, which calculates 6 output results, needs 36 multiplication, 30 sub-additions, and 6 are calculated with the 6 parallel quick FIR filters of the present invention Individual output result needs 18 multiplication, 42 sub-additions.In being realized in hardware, the area and power consumption of multiplier consumption are much big In adder, therefore compared to traditional Direct-type FIR Filter, the 6 parallel quick FIR filters that the present invention introduces can save Save 50% hardware resource.And all four kinds of size rolls for supporting to be applied in convolutional neural networks are realized on this basis The universal circuit that product calculates.

Brief description of the drawings

Fig. 1 is 3 parallel quick FIR filter structure figures；

Fig. 2 is the physical circuit figure of universal 6 parallel quick FIR filters；

Fig. 3 is the circuit diagram of 2 rank restructural FIR subfilters；

Fig. 4 is 6 parallel quick FIR filter modules schematic diagrames.

Embodiment

0 is inputted in model selection A modules, when model selection B modules input 0, the circuit carries out the convolution meter of triple channel 3 × 3 Calculate, list entries x_i{ n }={ x_i0, x_i1, x_i2, convolution coefficient sequence h_i{ n }={ h_i0, h_i1, h_i2, i=1,2,3, now input Pattern is

X0←x₀₀, X2 ← x₀₁, X4 ← x₀₂；H00←h₀₀, H01 ← h₀₁, H02 ← h₀₂；

X6←x₁₀, X7 ← x₁₁, X8 ← x₁₂；H10←h₁₀, H11 ← h₁₁, H12 ← h₁₂；

X1←x₂₀, X3 ← x₂₁, X5 ← x₂₂；H20←h₂₀, H21 ← h₂₁, H22 ← h₂₂；

1 is inputted in model selection A modules, when model selection B modules input 0, the circuit carries out the convolution meter of single channel 5 × 5 To calculate, single channel list entries converts input data into 6 tunnels still through the preposing signal process circuit of transformation from serial to parallel and inputted parallel, Now the list entries of general convolution kernel is x { n }={ x₀, x₁, x₂, x₃, x₄, x₅, argument sequence h { n }={ h₀, h₁, h₂, h₃, h₄, 0 }, it dexterously make use of make coefficient h in 6 × 6 convolution here₅=0 special circumstances realize 5 × 5 convolutional calculations, so defeated Entering pattern is

X0←x₀；H00←h₀

X2←x₂；H01←h₂

X4←x₄；H02←h₄

X6←z；H10←h₀+h₁

X7←z；H11←h₂+h₃

X8←z；H12←h₄

X1←x₁；H20←h₁

X3←x₃；H21←h₃

X5←x₅；H22←0

1 is inputted in model selection A modules, when model selection B modules input 1, the volume of the circuit realiration single channel 11 × 11 Product calculates, and it is parallel that single channel list entries still through the preposing signal process circuit of transformation from serial to parallel converts input data into 6 tunnels Input, list entries is x { n }={ x₀, x₁..., x₅, { x₆, x₇..., x₁₁, argument sequence h { n }={ h₀, h₁..., h₁₀, 0 }, Here with making coefficient h in 12 × 12 convolution₁₁=0 special circumstances realize 11 × 11 convolutional calculations, and input pattern is

X0←{x₀, x₆}；H00←{h₀, h₆}

X2←{x₂, x₈}；H01←{h₂, h₈}

X4←{x₄, x₁₀}；H02←{h₄, h₁₀}

X6←z；H10←{h₀+h₁, h₆+h₇}

X7←z；H11←{h₂+h₃, h₈+h₉}

X8←z；H12←{h₄+h₅, h₁₀}

X1←{x₁, x₇}；H20←{h₁, h₇}

X3←{x₃, x₉}；H21←{h₃, h₉}

X5←{x₅, x₁₁}；H22←{h₅, 0 }

1 is inputted in model selection A modules, when model selection B modules input 1, the circuit passes through the change of input pattern, reality Existing 7 × 7 single channel convolution patterns, single channel list entries is still through the preposing signal process circuit of transformation from serial to parallel by input data It is converted into 6 tunnels to input parallel, list entries is x { n }={ x₀, x₁..., x₅, { x₆, x₇..., x₁₁, argument sequence h { n }= {h₀, h₁..., h₆, 0,0,0,0,0 }, here with making convolution coefficient h in 12 × 12 convolution₇..., h₁₁=0 special circumstances To realize 7 × 7 convolutional calculations, input pattern is

X0←{x₀, x₆}；H00←{h₀, h6 }

X2←{x₂, x₈}；H01←{h₂, 0 }

X4←{x₄, x₁₀}；H02←{h₄, 0 }

X6←z；H10←{h₀+h₁, h₆}

X7←z；H11←{h₂+h₃, 0 }

X8←z；H12←{h₄+h₅, 0 }

X1←{x₁, x₇}；H20←{h₁, 0 }

X3←{x₃, x₉}；H21←{h₃, 0 }

X5←{x₅, x₁₁}；H22←{h₅, 0 }

In summary, if only supporting 3 × 3,5 × 5 both of which, our structure uses 18 multipliers, and 42 add Musical instruments used in a Buddhist or Taoist mass, 7 delay units, can save 50% hardware resource；And by using 2 rank wave filters in subfilter structure, I Can complete the hardware-efficients of all convolutional calculations of 4 kinds of convolutional neural networks Commonly Used Sizes and realize, using 35 multipliers, 59 adders, 25 delay units, in the case where nowadays circuit collection is at a relatively high on a large scale, realize efficient general type nerve The design of network convolution kernel, it can support the convolutional calculation of 3 × 3,5 × 5,7 × 7 and 11 × 11 whole four kinds of convolution kernels.

Claims

1. a kind of 6 parallel quick FIR filters, the structure being made up of 3 parallel quick FIR filter cascades, including：

Mode selection module, for selecting to carry out one kind in tetra- kinds of convolutional calculation patterns of 3*3,5*5,7*7 and 11*11；

Data input module, for carrying out corresponding modes parallelization input to serial input data, and it is sent into corresponding modes input Passage；

Fast convolution module, the fast convolution for parallel input data reduce complexity calculate operation；

Data outputting module, for exporting the parallel data of corresponding modes.

2. according to claim 16 parallel quick FIR filters, the method for realizing 5*5 fast convolution algorithms；

The method for realizing 7*7 fast convolution algorithms；

The method for realizing 11*11 fast convolution algorithms.

3. according to claim 16 parallel quick FIR filters, the general of 1 rank and the selection of 2 rank both of which is realized Type restructural FIR subfilters.

4. according to claim 16 parallel quick FIR filters, wherein, the fast convolution module also includes：

● the 2 parallel quick parallel quick FIR filter minor structures of FIR filter structure cascade 3；

● the parallel organization of primary structure 2 includes 3 preposition adders, 9 rearmounted adders, 1 data register, 3 two levels 3 Parallel quick FIR filter minor structure；

● 3 two levels 3 quick FIR filter minor structure parallel, respectively comprising 3 preposition adders, 7 rearmounted adders, 2 numbers According to register, 18 second order restructural FIR subfilters；

● 6 second order restructural FIR subfilters, each includes 2 multipliers, 1 adder, 1 data register, and 1 Individual 2 select 1MUX units.