CN106250103A

CN106250103A - A kind of convolutional neural networks cyclic convolution calculates the system of data reusing

Info

Publication number: CN106250103A
Application number: CN201610633040.9A
Authority: CN
Inventors: 刘波; 朱智洋; 陈壮; 阮星; 龚宇; 曹鹏; 杨军
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2016-08-04
Filing date: 2016-08-04
Publication date: 2016-12-21

Abstract

The invention discloses a kind of convolutional neural networks cyclic convolution towards coarseness reconfigurable system and calculate the system of data reusing, including master controller and link control module, input data reusing module, convolution loop calculation process array, data transmission path four part.During convolution loop computing, it is in the nature multiple two dimension input data matrix the biggest with multiple two dimension modulus matrix multiples, typically these matrix sizes, is multiplied and occupies the most of the time of whole convolutional calculation.The present invention utilizes coarse-grained reconfigurable array system to complete convolutional calculation process, after receiving convolution algorithm request instruction, the mode utilizing depositor round is fully excavated convolution loop and is calculated the input data reusability of process, improve data user rate and reduce bandwidth memory access pressure, and designed array element is configurable, convolution algorithm when different cyclic convolution scale and step-length can be completed.

Description

A kind of convolutional neural networks cyclic convolution calculates the system of data reusing

Technical field

The present invention relates to imbedded reconfigurable design field, a kind of convolution god towards coarseness reconfigurable system Calculate the system of data reusing through network cyclic convolution, can be used for high-performance reconfigurable system, it is achieved convolutional neural networks is carried out Big number of cycles convolution algorithm, uses data with existing as far as possible, is reused data, improves arithmetic speed, reduces digital independent Bandwidth pressure.

Background technology

Reconfigurable processor architecture is a kind of preferably application acceleration platform, owing to hardware configuration can be according to program Data flow diagram reorganize, reconfigurable arrays has been demonstrated that it has good performance for scientific algorithm or multimedia application Improvement.

Convolution algorithm has purposes widely in image processing field, such as in image filtering, image enhaucament, graphical analysis Will use convolution algorithm Deng when processing, image convolution computing essence is a kind of matrix operations, is characterized in that operand is big, and Data-reusing rate is high, is extremely difficult to the requirement of real-time by computed in software image convolution.

Convolutional neural networks is as a kind of feedforward compensator, it is possible to automatically learn there being label data in a large number And therefrom extract complex characteristic, the advantage of convolutional neural networks is to have only to input picture is carried out less pretreatment with regard to energy Enough from pixel image, identify visual pattern, and to there being more diverse identification object also to have preferable recognition effect, with Time convolutional neural networks identification ability be not easily susceptible to the distortion of image or the impact of simple geometry conversion.The most refreshing as multilamellar Through an important directions of network research, the focus of convolutional neural networks always research for many years.

Convolution mask is placed on the upper left corner of image lattice, then convolution mask must be with the segmentation in the upper left corner in image lattice Matrix overlaps.Their coincidence item correspondence is multiplied, the most all sues for peace, just obtained first result points.Then, then will Convolution mask moves to right string, can obtain second result points.The most so, convolution mask travels through one time in image lattice, The convolution of a two field picture can be obtained the most completely.The reusability of data is the highest, but the caching of traditional approach or direct from outside Directly read, owing to being limited by digital independent bandwidth, and there is no configurable arrays, complete multilamellar convolution loop computing, Inefficient.

Summary of the invention

Goal of the invention: for problems of the prior art with not enough, the present invention provides a kind of and can weigh towards coarseness The convolutional neural networks cyclic convolution of construction system calculates the system of data reusing, can accelerate wanting of big quantity convolutional calculation Ask, reduce the pressure to broadband, and convolution algorithm array is configurable.The calculated performance of convolutional neural networks provides with hardware Taking of source is convolutional neural networks needs two aspects trading off in coarseness reconfigurable architecture realizes, based on can The design object of the convolutional neural networks of reconstruction processing array is to meet on the premise of application performance requires, making full use of and can weigh The calculating resource that structure array provides and storage resource, utilize input image data to reuse structure, utilize in cyclic convolution computing Height reuses rate, in addition the configurability of coarse-grained reconfigurable array, in digital independent bandwidth, in the case of calculating resource limit, Complete convolutional calculation, reach one the most compromise.

Technical scheme: a kind of convolutional neural networks cyclic convolution towards coarseness reconfigurable system calculates data reusing System, passes including master controller and link control module, input data reusing module, convolution loop calculation process array and data Transmission path.

Described master controller and link control module, complete the reception of extraneous convolution algorithm request, computing array configuration letter Breath loads, and result of calculation returns and monitoring to circular flow state, control external memory storage and input data reusing module it Between data transmission.

Described input data reusing module, be connect outer input data memorizer and cyclic convolution calculation process array it Between data reusing module, complete input data reusing, wherein module top half is image array width quantity FIFO, lower half Part is image array width quantity shift register.FIFO constantly loads input data from extraneous memorizer, respectively correspondence volume The long-pending string calculated, when shift register moves according to convolution step-length, FIFO is that wherein string changed by shift register, the completeest Become a convolution algorithm, reach the effect of data reusing.Shift register is used for utilizing top half FIFO part to provide and updates Adjacent region data.Owing to multiple shift registers use annular addressing mode, the data from FIFO will always replace annular shifting Data the oldest in bit register, are transferred to computing array data afterwards and complete convolution algorithm.

This module realizes specifically comprising the following steps that

Data once input S (1≤S < maximum image matrix width), and individual 32 bit data are to FIFO, when convolution algorithm was used Data in one depositor, FIFO will be transferred to shift register the data of oneself, and shift register need to update string K (1 ≤ K < maximum image matrix width, K is this convolutional calculation convolution kernel matrix width) individual 32 bit data, add that original K-1 arranges Data, shift register is transferred to convolutional calculation matrix K*K data, continues afterwards to move according to step-length backward, same String need to be updated, it is achieved enter to input data reusing.

Described cyclic convolution calculation process array, obtains required input data from input data reusing module, completes volume Long-pending calculating, and the function after having calculated, data sent.

Described data transmission path, has been master controller and interface control module, and cyclic convolution calculation process array is defeated Enter the data transmission channel between data reusing module.

Further, master controller and link control module include main control and connect controller, connect controller and prefetch Judge and data reusing configuration control action, prefetch judge should for judging convolution algorithm to be carried out time required data the most accurate Standby in place, if data are in place, cyclic convolution calculation process array performs convolution loop and calculates, if it did not, that waits for number According in place.Data in caching are read by external memory storage, and the present invention uses direct memory access mode to read, when needing When wanting external data to input, master controller sends to outside memory read data order, and master controller is not the most to storage afterwards Reading is controlled, connect controller can send out one stop signal to master controller, master controller is abandoned address bus, data Bus and the right to use about control bus, when the data of input data reusing module need to update, just by connecting controller, Directly read the data in external memory.

Cyclic convolution calculation process array include array configure module, including array configuration module, storage processing unit and Calculation processing unit, this module application is when matched data reuse module, according to convolutional calculation scale and step-length, array configuration mould Computing array is configured by block, utilize array can calculating resource, has calculated every time and has reconfigured array the most afterwards, counted Calculate processing unit to be adjusted according to calculating scale, carry out convolution algorithm next time.

Described convolution algorithm processes array Configuration Control Unit, after interface control module loads configuration information, and computing battle array Arranging the size according to cyclic convolution circulation scale and step information, can make convolved image matrix size variable is to maximum from 1 Exploitation between image array width, computing array can be reconfigured by convolution algorithm each time, and convolution kernel is advised When mould is less, convolution array is also available with whole convolutional calculation matrix, shortens the total duration of convolutional calculation with this.

Storage computing unit structure storage instruction and data reusing module tight association, it is in the driving of loop control parts Under, from address queue, take address or be directly calculated address by address generating unit, sending reading to data reusing module Request of data, returns data and writes in data queue, under the control of loop ends parts, and data in read shift register.

Calculation processing unit realizes the calculating in data flow process and selects function, and circulation subscript is constantly from depositor Obtaining data in group, and pass the data to calculation processing unit array, calculation processing unit array closes according to fixing connection System carries out computing, and the result of computing stores the position specified.

The application of cyclic convolution calculation process array continues pile line operation, and this operation cyclic mapping configures module to array, Array configuration module configures the initial value of loop control variable, final value and step value, and the execution of cyclic program need not outside control System, constitutes streamline link, completes cyclic convolution scheduling on streamline between each computing array unit.

Accompanying drawing explanation

Fig. 1 is the coarse-grained reconfigurable array system assumption diagram of convolutional calculation in the embodiment of the present invention；

Fig. 2 is input data reusing module data robin scheduling hardware structure diagram in the embodiment of the present invention；

Fig. 3 is the structured flowchart of storage processing unit in coarseness restructural convolutional calculation array in the embodiment of the present invention；

Fig. 4 is the structured flowchart of coarseness restructural convolutional calculation array computation processing unit in the embodiment of the present invention；

Fig. 5 be in the embodiment of the present invention cyclic convolution at the existing flow chart of reconfigurable arrays interior-excess.

Detailed description of the invention

Below in conjunction with specific embodiment, it is further elucidated with the present invention, it should be understood that these embodiments are merely to illustrate the present invention Rather than restriction the scope of the present invention, after having read the present invention, the those skilled in the art's various equivalences to the present invention The amendment of form all falls within the application claims limited range.

Convolutional neural networks cyclic convolution towards coarseness reconfigurable system calculates the system of data reusing, including master control Device processed and link control module, input data reusing module, convolution loop calculation process array and data transmission path.

Master controller and link control module, complete the reception of extraneous convolution algorithm request, and computing array configuration information adds Carry, result of calculation return and the monitoring to circular flow state, control number between external memory storage and input data reusing module According to transmission.

Input data reusing module, is to connect between outer input data memorizer and cyclic convolution calculation process array Data reusing module, wherein module top half is image array width quantity FIFO, and the latter half is image array width number Amount shift register.

Cyclic convolution calculation process array, obtains required input data from input data reusing module, completes convolution meter Calculate, and the function after having calculated, data sent.

Data transmission path, has been master controller and interface control module, cyclic convolution calculation process array, inputs number According to the data transmission channel between reuse module.

Master controller and link control module include main control and connect controller, connect controller and prefetch judgement and number According to reusing configuration control action, prefetch judge should for judging convolution algorithm to be carried out time required data whether prepare in place, If data are in place, cyclic convolution calculation process array performs convolution loop and calculates, if it did not, that to wait for data in place. Data in caching are read by external memory storage, and the present invention uses direct memory access mode to read, when needs outside During data input, master controller sends to outside memory read data order, afterwards master controller the most storage is not read into Row control, connect controller can send out one stop signal to master controller, master controller abandon to address bus, data/address bus and About controlling the right to use of bus, when the data of input data reusing module need to update, just by connecting controller, directly read Take the data in external memory.

As it is shown in figure 1, concrete computing array figure and the coarse-grained reconfigurable array figure of data stream.Configurable PE unit accounts for According to main part, also in that reconfigurable arrays has been the concrete part of convolutional calculation, remainder primarily to The instruction transmission started and terminate is come in.As seen in Figure 1, in configurable arrays, storage processing unit is directly connected to defeated Entering data reusing module (such as Fig. 2), according to step-length and convolution kernel size values, input data reusing module is by needed for convolution algorithm Data stream transmitting is to calculation processing unit, and router configuration data stream arrives each calculating by internet route and processes single Unit, is simultaneously connected with controller and undertakes a convolutional calculation and complete, spread out of by data message, and calculation processing unit is joined again Put, start the newest computing.

The data robin scheduling hardware chart of input data reusing module is as in figure 2 it is shown, (K is volume with convolution kernel size as K*K Long-pending core width) as a example by, between external memory storage and shift register, adding FIFO, data once input S 32 bit data To FIFO, data in convolution algorithm used a depositor, FIFO will be transferred to shift register the data of oneself, moves Bit register need to update string K 32 bit data, adds original K-1 column data, and shift register is transferred to volume K*K data Long-pending calculating matrix, such input image data is reused structure, is provided support for high efficiency convolution algorithm.

As it is shown on figure 3, correspondence is the structured flowchart of storage processing unit, when input channel receives address signal, Now the most corresponding storage processing unit position in an array, these storage processing unit complete the life of the address of corresponding data Become, generate address and corresponding will can use the data in input image data reuse module, now data are exported to calculating Processing unit.The generation of loop control operational data corresponding address, and the end of convolution algorithm, synchronize computed information It is transferred in external memory storage.And cycle criterion structure data not to or not enough time, terminate current operation, information passed to External memory storage, carries out data renewal.

As shown in Figure 4, corresponding is the structure chart of calculation processing unit, and calculation processing unit is receiving input data Time, application internal multiplier and adder complete convolution algorithm, complete once-through operation, according to Configuration Control Unit, reconfigure fortune Calculation processing unit required for calculation, completes configurable control, when outer loop size, during step-length conversion, is still able to very well Complete computing.

In conjunction with Fig. 1, Fig. 2, the concrete steps that convolution loop calculates are as it is shown in figure 5, comprise the steps:

1) if needing coarse-grained reconfigurable array system to complete a large amount of convolution algorithm, first have to this convolution control volume System sends request, when primary processor receives request, will send instruction to connecting processing unit；

2) connect processing unit and first determine whether that in input data reusing module, desired data right and wrong are the most in place, without Waiting signal will be sent, with directmemoryaccess, buffer is carried out data transmission simultaneously；

3) after data are the most continuous, the operational order that notice is waiting, control circulation and start, convolution loop calculation process battle array Configuring control unit in row to configure array, the memory access configuration module in computing array will calculate position residing for number play Putting, computing array carries out convolutional calculation to the data of this position afterwards, and flowing water is carried out the most rearwards.

4) Y (maximum image matrix width) individual FIFO caching by directly storage reading manner continuous renewal depositor in With crossing data, when entering back into this position, data complete to update, uninterruptedly carry out computing, arrive without each convolution algorithm External memory goes to access data.

5) connect controller control circulation to complete, when calculating completes, final data is exported in external memory storage, specifically Convolution algorithm array completes.

When specifically carrying out big number of cycles convolution algorithm, when computation resources are limited, the method for application data reusing, add Upper configurable reconfigurable arrays, streamline completes convolution algorithm, and we improve operation efficiency and speed.It is provided with having a competition Test, respectively contrast verification system A, contrast verification system B.Wherein, contrast verification system A, the most traditional does not supports that array is joined The reconfigurable system put and reuse.Contrast verification system B, support data pre-fetching the most proposed by the invention and the restructural reused System.Choosing the input data matrix of 16x16, the convolution matrix of 3x3, step-length is 1, is provided with 10 input data, 10 volumes Long-pending weight matrix, carries out convolution algorithm simultaneously.Test result indicate that, contrast verification system B can obtain contrast verification system A The performance boost of average 1.76 times.

Claims

1. the convolutional neural networks cyclic convolution towards coarseness reconfigurable system calculates a system for data reusing, its feature It is: include that master controller and link control module, input data reusing module, convolution loop calculation process array and data pass Transmission path；

Described master controller and link control module, complete the reception of extraneous convolution algorithm request, and computing array configuration information adds Carry, result of calculation return and the monitoring to circular flow state, control number between external memory storage and input data reusing module According to transmission；

Described input data reusing module, is to connect between outer input data memorizer and cyclic convolution calculation process array Data reusing module, wherein module top half is image array width quantity FIFO, and the latter half is image array width number Amount shift register；

Described cyclic convolution calculation process array, obtains required input data from input data reusing module, completes convolution meter Calculate, and the function after having calculated, data sent.

2. data transmission path described in, has been master controller and interface control module, cyclic convolution calculation process array, input Data transmission channel between data reusing module.

3. as claimed in claim 1 towards the convolutional neural networks cyclic convolution calculating data reusing of coarseness reconfigurable system System, it is characterised in that: master controller and link control module include main control and connect controller, connect controller have pre- Take judgement and data reusing configuration control action, prefetch judge should for judging convolution algorithm to be carried out time required data whether Preparing in place, if data are in place, cyclic convolution calculation process array performs convolution loop and calculates, if it did not, that waits for Data are in place；Data in caching are read by external memory storage, use direct memory access mode to read, when outside needs During portion's data input, master controller sends to outside memory read data order, and storage is not read by master controller afterwards Be controlled, connect controller can send out one stop signal to master controller, master controller is abandoned address bus, data/address bus With about control bus the right to use, input data reusing module data need update time, just by connection controller, directly Read the data in external memory.

4. as claimed in claim 1 towards the convolutional neural networks cyclic convolution calculating data reusing of coarseness reconfigurable system System, it is characterised in that: cyclic convolution calculation process array include array configure module, including array configure module, storage Processing unit and calculation processing unit, this module application when coupling input data reusing module, according to convolutional calculation scale and Step-length, computing array is configured by array configuration module, utilize array can calculating resource, calculated once every time after Reconfiguring array, calculation processing unit is adjusted according to calculating scale, carries out convolution algorithm next time；Cyclic convolution computing Processing array application and continue pile line operation, this operation cyclic mapping to array configures module, and array configuration module configures to be followed The initial value of ring control variable, final value and step value, the execution of cyclic program need not external control, each computing array unit it Between constitute streamline link, complete cyclic convolution scheduling on streamline.

5. as claimed in claim 1 towards the convolutional neural networks cyclic convolution calculating data reusing of coarseness reconfigurable system System, it is characterised in that: described input data reusing module realize specifically comprise the following steps that

Data once input S 32 bit data to FIFO, data in convolution algorithm used a depositor, FIFO will from Oneself data are transferred to shift register, and shift register need to update string K 32 bit data, adds original K-1 column data, moves Bit register is transferred to convolutional calculation matrix K*K data, continues afterwards to move according to step-length backward, and same need to update one Row, it is achieved enter to input data reusing.