CN106951961B

CN106951961B - A kind of convolutional neural networks accelerator that coarseness is restructural and system

Info

Publication number: CN106951961B
Application number: CN201710104029.8A
Authority: CN
Inventors: 袁哲; 刘勇攀; 杨华中; 岳金山; 李金阳
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2017-02-24
Filing date: 2017-02-24
Publication date: 2019-11-26
Anticipated expiration: 2037-02-24
Also published as: CN106951961A

Abstract

The present invention provides a kind of convolutional neural networks accelerator that coarseness is restructural and system, the accelerator includes multiple processing unit clusters, each processing unit cluster includes several basic computational ele- ments, several basic computational ele- ments are connected respectively to a female addition unit by a sub- addition unit connection, the sub- addition unit of the multiple processing unit cluster；Every sub- addition unit be used to generate adjacent several basic addition units part and, mother's addition unit is for the sub- addition unit that adds up.The present invention is in such a way that coarseness can be reconfigured, different weight and picture track are linked by SRAM or other interconnection units, to realize different convolution kernel processing structures, different size of network and convolution kernel can be efficiently supported, while largely reducing the expense reconfigured.

Description

A kind of convolutional neural networks accelerator that coarseness is restructural and system

Technical field

It is restructural more particularly, to a kind of coarseness the present invention relates to high energy efficiency hardware accelerator design field Convolutional neural networks accelerator and system.

Background technique

Convolutional neural networks (Convolutional Neural Network, CNN) are a kind of feedforward neural networks, it Artificial neuron can respond the surrounding cells in a part of coverage area, have outstanding performance for large-scale image procossing.Convolution Neural network has become in the most common algorithm in the fields such as image recognition, speech recognition, and this kind of methods need very more Calculation amount needs the accelerator of design specialized.Also there is good application prospect in movable equipment.But due to movable equipment It is resource-constrained, at present in GPU and FPGA (Field Programmable Gate Array, field programmable gate array) platform The accelerator of upper design is difficult to use on these requirements low-power consumption, resource-constrained platform.

Since convolutional neural networks have the network structure and convolution kernel of a variety of sizes, dedicated convolutional network accelerator is answered This efficiently supports these different size of networks and convolution kernel.Traditional accelerator is in order to support the diversity of convolutional network It may be generally divided into two major classes；First major class is instruction type accelerator, and different convolution kernel calculating operations is disassembled into one Item instruction takes out correct weighted data and image data in synchronization, and this method needs a large amount of on piece bandwidth and on piece Storage is that comparison is efficient, but weighted data can not be stored entirely on piece when processing big network handling small network, so energy Amount efficiency decline is serious；Second major class supports different size of network and convolution by the way of fine granularity reconfigurable circuit Core is arranged an address to each processing unit, sends data to every time accordingly for example, by using the mode of reconstruct network-on-chip Location, although this mode is more efficient than instruction type accelerator when handling different convolutional neural networks, fine granularity reconstruct electricity Road brings many additional energy and reconfigures expense.

In large-scale calculations field, reconfigurable system is a research hotspot of current architecture, it is by general place Manage the flexibility of device and the height of ASIC (Application Specific Integrated Circuits, specific integrated circuit) Effect property combines well, is towards solution more satisfactory in large-scale calculations.Traditional DSP (Digital Signal Processing, Digital Signal Processing) with arithmetic speed is low, hardware configuration is not restructural, exploitation upgrade cycle is long Not the disadvantages of portable, when towards large-scale calculations, this disadvantage is with regard to more obvious.ASIC is in performance, area and power consumption Etc. there is greater advantage, but the complexity of changeable application demand and rapid growth makes the design and validation difficulty of ASIC Greatly, the development cycle is long, is difficult to meet the requirement that product is quickly applied.In the programmable logic device, although Xilinx company Virtex-6 Series FPGA is realized using the DSP48E1slice of 600MHz (multiplies accumulating fortune 1 × 1012 time more than 1000GMACS Calculation/second) performance, but when towards large-scale calculations, the circuit scale needed to configure is excessive, and comprehensive and setup time is too long, And actual operating frequency is not high, it is difficult to keep it is high performance simultaneously, pursue flexibility and low-power consumption target.

Therefore, it is badly in need of designing a kind of dedicated accelerator architecture of low-power consumption high energy efficiency to meet the movable equipment of low-power consumption Use.

Summary of the invention

It is restructural that the present invention provides a kind of coarseness for overcoming the above problem or at least being partially solved the above problem Convolutional neural networks accelerator and system pass through SRAM (Static Random Access in such a way that coarseness can be reconfigured Memory, i.e. static random access memory) or other interconnection units link different weight and picture track, to realize difference Convolution kernel processing structure, can efficiently support different too small networks and convolution kernel, while largely reducing and reconfiguring Expense.

According to an aspect of the present invention, a kind of convolutional neural networks accelerator that coarseness is restructural is provided, including more A processing unit cluster, each processing unit cluster includes several basic computational ele- ments, and several basic computational ele- ments pass through One sub- addition unit connection, the sub- addition unit of the multiple processing unit cluster are connected respectively to a female addition unit；It is described every A sub- addition unit be used to generate adjacent several basic addition units part and, mother's addition unit is for cumulative described Sub- addition unit.

Preferably, the basic computational ele- ment includes 3 × 3 convolution units.

Preferably, the processing unit cluster is 4, the orthogonal thereto matrix arrangement of 4 processing unit clusters；It is described every A processing unit cluster includes 4 basic computational ele- ments, the orthogonal thereto matrix arrangement of 4 basic computational ele- ments.

Preferably, each basic computational ele- ment includes 9 multipliers in nine grids arrangement, it further include 1 Adder, the input register of 3 multipliers in the same row are shift register.

Preferably, basic computational ele- ment adjacent in each every row of processing unit cluster matrix is interconnected by weight Unit connection weight track, two neighboring basic computational ele- ment connects picture track by image interconnection unit in each column；

The weight interconnection unit is used for each basic computational ele- ment connection weight track, by SRAM control selections, Weighted data is selected from weight track to each basic computational ele- ment；

Described image interconnection unit is for connecting basic computational ele- ment and image data, from picture track under the control of SRAM 3 continuous data are selected in the output set of road and a upper basic computational ele- ment.

Preferably, multiplier and adder are being closed when not used in each processing unit cluster, the sub- addition Unit and female addition unit are powering off when not used.

A kind of convolutional neural networks acceleration system that coarseness is restructural accelerates including several parallel convolutional neural networks Device.

The application proposes a kind of convolutional neural networks accelerator that coarseness is restructural and system, can be reconfigured using coarseness Mode, link different weight and picture track, by SRAM or other interconnection units to realize different convolution kernel processing Structure can efficiently support different too small networks and convolution kernel, while largely reduce the expense reconfigured.Pass through one kind The restructural accelerator hardware framework of coarseness can support heterogeneous networks with few expense that reconfigures, and design efficiently branch The computing unit for holding coarseness reconstruction structure, the interconnection architecture for supporting coarseness reconfigurable reconstruct big volume with small convolution kernel The mechanism of product core reconfigures speed and improves 10 compared to traditional reconfigurable FPGA⁵Times, energy efficiency has reached 18.8 times.Phase Than the ASIC that can be reconfigured in traditional fine granularity, reconfiguration time reduces 81.0%, and average energy efficiency improves 80.0%.

Detailed description of the invention

Fig. 1 is the convolutional neural networks accelerator structure schematic diagram restructural according to the coarseness of the embodiment of the present invention；

It is that coarseness matches the operating mode signal for postponing support different size convolution kernel that Fig. 2, which is according to the embodiment of the present invention, Figure；

Fig. 3 is the schematic equivalent circuit configured after 5x5 mode according to the accelerator architecture of the embodiment of the present invention；

Fig. 4 is that ASIC accelerator, traditional reconfigurable FPGA, of the invention thick is reconstructed according to fine granularity in the embodiment of the present invention The restructural convolutional neural networks accelerator efficiency comparison schematic diagram of granularity.

Specific embodiment

With reference to the accompanying drawings and examples, specific embodiments of the present invention will be described in further detail.Implement below Example is not intended to limit the scope of the invention for illustrating the present invention.

Fig. 1 shows a kind of convolutional neural networks accelerator that coarseness is restructural, including multiple processing unit clusters, described Each processing unit cluster includes several basic computational ele- ments, and several basic computational ele- ments are connected by a sub- addition unit, The ADDB1-ADDB4 of the sub- addition unit as shown in figure 1, the sub- addition unit of the multiple processing unit cluster are connected respectively to one Female addition unit, the mother addition unit ADDB0 as shown in Figure 1, the sub- addition unit and female addition unit structure phase Together；Every sub- addition unit be used to generate adjacent several basic addition units part and, mother's addition unit is used In the sub- addition unit that adds up.

Granularity refers to the bit wide size of the restructural component of system (or reconfigurable processing unit) operation data, arithmetic element Granularity be divided into fine granularity, coarseness, combination grain；In the present embodiment, the basic computational ele- ment includes 3 × 3 convolution lists Member, 3 × 3 convolution units are most common neural network convolution kernel.Since fine-grained restructural meeting be brought a large amount of chip Area and power dissipation overhead.Therefore, the present invention proposes that a kind of pair of 3x3 convolution kernel does special optimization, while being reconstructed by coarseness The accelerator architecture of method support other types convolution kernel.Since accelerator has done special optimization to 3x3, it can be efficient Handle the convolution kernel of 3x3.Since the convolution kernel of 3x3 is big in common neural network proportion, it can be obviously improved efficiency, led to The method being reconfigured for crossing coarse grain combines these 3x3 convolution units and constitutes bigger core.Therefore, the side restructural using coarseness Method supports other convolution kernels, and expense can be reconfigured by substantially reducing under the premise of not losing too many performance.

In the present embodiment, the processing unit cluster is 4, the orthogonal thereto matrix arrangement of 4 processing unit clusters；It is described Each processing unit cluster includes 4 basic computational ele- ments, NE11, NE12, NE21, NE22 as shown in Figure 1 and sub- addition list First ADDB1 forms first processing units cluster, and NE13, NE14, NE23, NE24 and sub- addition unit ADDB2 form second processing list First cluster, NE31, NE32, NE41, NE42 and sub- addition unit ADDB3 form third processing unit cluster, NE33, NE34, NE43, NE44 and sub- addition unit ADDB4 forms fourth processing unit cluster, 4 basic computational ele- ments in each processing unit cluster Orthogonal thereto matrix arrangement；As shown in figure 1 shown in (e), the sub- addition unit include four inputs (input 0 as shown in Figure 1, 3) and buffer input 1, input 2, input；Four inputs are separately connected the first processing units cluster, second processing list First cluster, third processing unit cluster, fourth processing unit cluster；The buffer exports the (addition i.e. in figure as sub- addition unit Device output).

Preferably, each basic computational ele- ment includes 9 multiplier MUL in nine grids arrangement, it further include 1 A adder ADD；The 9 multiplier MUL and adder ADD can closed when not used to save power consumption.It is same The input register of three multiplier MUL on column is shift register, and image data can move from the top down.Meanwhile substantially Computing unit has output port, image data can be removed this unit.

As shown in figure 1 shown in (d), adjacent basic computational ele- ment passes through weight in each every row of processing unit cluster matrix Interconnection unit FC connection weight track, two neighboring basic computational ele- ment passes through image interconnection unit IC connection picture track in each column Road；

The weight interconnection unit FC is used to pass through SRAM (Static to each basic computational ele- ment connection weight track Random Access Memory, static random access memory) control selections, weighted data is selected from weight track to every A basic computational ele- ment；

Described image interconnection unit is for connecting basic computational ele- ment and picture track, since each basic computational ele- ment has Three column, so the collection that image interconnection unit exports under the control of SRAM from picture track and a upper basic computational ele- ment image Three continuous data are selected in conjunction.When needing to reconfigure chip, it is only necessary to data are loaded into configuration SRAM, it can be complete It is reconfigured at chip.

Preferably, multiplier and adder are being closed when not used in each processing unit cluster, the sub- addition Unit and female addition unit are powering off when not used, to save power consumption.

The operating mode schematic diagram for supporting different size convolution kernel is postponed as shown in Fig. 2, matching for coarseness, the present invention supports 1x1 to 12x12 convolution kernel size may be configured to 16 (1x1) to the processing of (3x3) core or 4 (4x4) and arrive (6x6) core With 1 (7x7)-(12x12) core.Such as the core of a 5x5, it will there are 4 basic computational ele- ments and a sub- addition unit to constitute, Wherein the partial product device in 4 basic computational ele- ments there are three in can be powered down, and guarantee the size of 5x5 core, while save function Consumption.

As shown in figure 3, configuring the schematic equivalent circuit after 5x5 mode for accelerator architecture；It is reconfigured by coarseness, By taking the core of 5x5 as an example, which can form an efficient operating structure, and two kinds of data-reusing modes are efficiently utilized, It to substantially reduce the carrying of data, is promoted and calculates efficiency, the first multiplexing is multiplexing in convolution kernel, such as the volume of a 5x5 Product core, is reconstructed by coarseness, has 4 pixels that can be reused between adjacent convolution kernel, do not need to be loaded into again. Meanwhile being reconstructed by coarseness, each image data can be public by N number of convolution kernel, until N number of convolution kernel has all been handled.It is this The internuclear multiplexing of convolution decreases moving for image data, and after N number of convolution kernel has all been handled, whole image data move down one Row, basis repeats the above process herein.Data-reusing in the convolution kernel in another direction is realized simultaneously.

As shown in figure 4, it is restructural to reconstruct ASIC accelerator, traditional reconfigurable FPGA, coarseness of the invention for fine granularity Convolutional neural networks accelerator efficiency comparison schematic diagram, be respectively that fine granularity reconstructs ASIC accelerator, traditional reconfigurable in figure FPGA, accelerator of the invention are in AlexNet depth convolutional network, Clarifai network model, Overfeat algorithm, VGG16 Efficiency comparison schematic diagram when being applied in depth convolutional neural networks；It can be seen from the figure that the present invention can be weighed compared to tradition Structure FPGA reconfigures speed and improves 10⁵Times, energy efficiency has reached 18.8 times.It can be reconfigured compared to traditional fine granularity ASIC, reconfiguration time reduce 81.0%, and average energy efficiency improves 80.0%.

A kind of convolutional neural networks acceleration system that coarseness is restructural is additionally provided in the present embodiment, including several parallel Convolutional neural networks accelerator, due between different units no data exchange, this framework parallel after bring income be line Property.

Finally, the present processes are only preferable embodiment, it is not intended to limit the scope of the present invention.It is all Within the spirit and principles in the present invention, any modification, equivalent replacement, improvement and so on should be included in protection of the invention Within the scope of.

Claims

1. a kind of convolutional neural networks accelerator that coarseness is restructural, which is characterized in that described including multiple processing unit clusters Each processing unit cluster includes several basic computational ele- ments, and several basic computational ele- ments are connected by a sub- addition unit, The sub- addition unit of the multiple processing unit cluster is connected respectively to a female addition unit；Every sub- addition unit is for producing The part of raw adjacent several basic addition units and, mother's addition unit is for the sub- addition unit that adds up；The place Managing cluster of cells is 4, the orthogonal thereto matrix arrangement of 4 processing unit clusters；Each processing unit cluster includes 4 basic meters Calculate unit, the orthogonal thereto matrix arrangement of 4 basic computational ele- ments.

2. the restructural convolutional neural networks accelerator of coarseness according to claim 1, which is characterized in that described basic Computing unit includes 3 × 3 convolution units.

3. the restructural convolutional neural networks accelerator of coarseness according to claim 2, which is characterized in that described each Basic computational ele- ment includes 9 multipliers in nine grids arrangement, further includes 1 adder, 3 multiplication in same row The input register of device is shift register.

4. the restructural convolutional neural networks accelerator of coarseness according to claim 1, which is characterized in that described each Adjacent basic computational ele- ment is adjacent in each column by weight interconnection unit connection weight track in the every row of processing unit cluster matrix Two basic computational ele- ments connect picture track by image interconnection unit；

The weight interconnection unit is used for each basic computational ele- ment connection weight track, by SRAM control selections, from power Weighted data is selected in heavy rail road to each basic computational ele- ment；

Described image interconnection unit for connecting basic computational ele- ment and picture track, under the control of SRAM from picture track and 3 continuous data are selected in the output set of a upper basic computational ele- ment.

5. the restructural convolutional neural networks accelerator of coarseness according to claim 3, which is characterized in that described each Multiplier and adder are being closed when not used in processing unit cluster, and the sub- addition unit and female addition unit are breaking when not used Electricity.

6. a kind of convolutional neural networks acceleration system that coarseness is restructural, which is characterized in that including several parallel such as right It is required that 1 to 5 any convolutional neural networks accelerator.