CN109409512A - A kind of neural computing unit, computing array and its construction method of flexibly configurable - Google Patents

A kind of neural computing unit, computing array and its construction method of flexibly configurable Download PDF

Info

Publication number
CN109409512A
CN109409512A CN201811133940.2A CN201811133940A CN109409512A CN 109409512 A CN109409512 A CN 109409512A CN 201811133940 A CN201811133940 A CN 201811133940A CN 109409512 A CN109409512 A CN 109409512A
Authority
CN
China
Prior art keywords
data
buffer
state
computing unit
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811133940.2A
Other languages
Chinese (zh)
Other versions
CN109409512B (en
Inventor
任鹏举
樊珑
赵博然
宗鹏陈
陈飞
郑南宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN201811133940.2A priority Critical patent/CN109409512B/en
Publication of CN109409512A publication Critical patent/CN109409512A/en
Application granted granted Critical
Publication of CN109409512B publication Critical patent/CN109409512B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

The present invention discloses neural computing unit, computing array and its construction method of a kind of flexibly configurable, and neural computing unit includes: configurable memory module, configurable control module and can time-multiplexed multiply-add computing module;Configurable memory module includes: characteristic pattern data buffer storage buffer, step length data caching buffer and weight data caching buffer;Configurable control module includes: counter module and state machine module;Multiply-add computing module includes: multiplier and accumulator.The present invention can support any type of convolutional calculation, and support more size convolution kernel parallel computations, sufficiently excavate the flexibility and data reusability of convolutional neural networks computing unit, be greatly reduced by data-moving bring system power dissipation, improve the computational efficiency of system.

Description

A kind of neural computing unit, computing array and its building of flexibly configurable Method
Technical field
The invention belongs to neural network hardware framework field, in particular to the neural computing list of a kind of flexibly configurable Member, computing array and its construction method.
Background technique
Flexible hardware computing architecture has the hardware realization of convolutional neural networks important influence.Convolutional layer is made For structure most important in convolutional neural networks, have the characteristics that computationally intensive, data reusability is strong.Convolutional layer passes through weight This feature is shared, the complexity of network model is reduced, considerably reduces the quantity of parameter, avoids tional identification algorithm The feature extraction and data reconstruction processes of middle complexity.
In convolutional neural networks, the main function of convolutional layer is that same group of input feature vector diagram data and different outputs is logical One group of convolution kernel in road carries out convolution, then obtains quantity and the same number of output characteristic pattern of output channel, completes to feature The feature extraction of figure.As the demand that convolutional neural networks are constantly developed with study and to neural network gradually increases, mind Type through network model is more and more, and the depth of network is also gradually deepened, and the convolution mode of convolutional layer also becomes complicated and changeable.
Therefore, strong flexibility, calculated performance height and the neural computing unit structure that can be recycled are for convolutional layer Hardware realization be of great significance.The hardware realization of current most of convolutional layer computing units is all merely able to complete a type The convolution mode of type can not support the calculating in network model with different type convolutional layer, and can not make full use of volume The data reusing of lamination.
Summary of the invention
The purpose of the present invention is to provide neural computing unit, computing array and its buildings of a kind of flexibly configurable Method, this method can effectively enhance flexibility of the convolutional layer in hardware realization, improve the computational efficiency of system, give full play to The data reusing of convolutional layer, to reduce the power consumption of system to a certain extent and reduce the use of storage resource.
In order to achieve the above purpose, this programme uses the following technical solution:
A kind of neural computing unit of flexibly configurable, comprising: configurable memory module, configurable control module and It can time-multiplexed multiply-add computing module;
Configurable memory module includes: characteristic pattern data buffer storage buffer, step length data caching buffer and weight data Cache buffer;
Configurable control module includes: counter module and state machine module;
Multiply-add computing module includes: multiplier and accumulator.
Further, characteristic pattern data buffer storage buffer is used for the Partial Feature diagram data used when storing convolutional calculation, And the feature diagram data there are data sharing is recycled, the maximum length of buffer is L1, size max {K1A1, K2A2..., KiAi, wherein K is the size of convolution kernel in convolutional layer, and A is the input channel of computing unit domestic demand mapping Number, i are the serial number of convolutional layer in target network;
Step length data caching buffer is used to mention when convolution kernel slides and updates step length data to characteristic pattern caching buffer For the data for needing to update, the maximum length of buffer is L2, and size is max { S1A1, S2A2..., SiAi, wherein S is volume The step-length of convolution kernel in lamination;
Weight data caching buffer is used to store weight data and can reuse data, the length is L3, size are max { K1A1B1, K2A2B2..., KiAiBi, wherein B is the output channel number of computing unit domestic demand mapping.
Further, it is counted in counter module comprising input data counter, input weight counter, output data Device, output channel counter and output characteristic pattern size counter;
In state machine module, there is corresponding characteristic pattern buffer state machine and weight for different convolution kernel sizes Buffer state machine, state machine carry out jumping for state according to the numerical value of counter each in counter module.
Further, neural computing unit is equipped with characteristic pattern data-in port and weight data input port;
The input terminal of characteristic pattern data-in port connection first selector;Two output ends of first selector connect respectively The input terminal of step length data caching buffer and the first input end of second selector are connect, step length data caches the output of buffer Second input terminal of end connection second selector, the input of the output end connection features diagram data caching buffer of second selector End;
The input terminal of weight data input port connection weight data buffer storage buffer;
The output end of output end and weight data the caching buffer of characteristic pattern data buffer storage buffer is separately connected multiplication Two input terminals of device;The output end of multiplier passes through register, accumulator, the 4th selector Connection Neural Network computing unit Output end.
Further, in state machine module, whens different convolution kernel sizes, all has corresponding characteristic pattern buffer state machine With weight buffer state machine, state machine carries out jumping for state according to the numerical value of counter each in counter module;
The state of characteristic pattern data buffer storage buffer includes: init state, data ready, wait state, Quan Xun Ring status updates data mode, half cycle state and non-recurrent state;
The state that weight data caches buffer includes: init state, data ready, wait state, complete alternation State and non-recurrent state.
Further, init state, the state are the reset condition for not having any data to enter computing unit;
Data ready, the state are to have the lazy weight for entering data into computing unit but input data to start It calculates;
Wait state, the state are when carrying out convolution algorithm parallel there are different size convolution kernels, to guarantee to export result It is less to be responsible for the lesser computing unit calculation amount of convolution kernel size for the synchronism of data, therefore it is larger to need to wait for convolution kernel size Computing unit;
Complete alternation state, the state are the data in the case that the data that buffer is currently exported reuse if it exists It can also complete to recycle back to the tail portion of buffer allocation space while entering multiply-add computing module;
Data mode is updated, which exists only in characteristic pattern data buffer storage buffer, and the data currently exported have been not required to In the case where reusing, which takes out from step length data caching buffer new while entering multiply-add computing module Data are input to the tail portion of the data cached buffer of characteristic pattern;
Half cycle state, the state exist only in characteristic pattern data buffer storage buffer and follow update data mode it Afterwards, the data currently exported can also be back to the previous position of more new data in buffer while entering multiply-add computing module It sets;
Not recurrent state, the state are the number in the case that the data that buffer is currently exported have not needed recycling According to multiply-add computing module is only entered, it is no longer restored to former buffer.
A kind of computing array is generated by the multiple configurable computing units of exampleization, carries out region division to computing array, no It can provide different convolution layer parameters with region, complete the parallel computation of variety classes convolution mode.
A kind of computing array is connected in a manner of row fixed data stream by the neural computing unit of flexibly configurable It generates;The scale of computing array is required to determine by the calculated performance of hardware resource, target network model and system;Computing array Width be K, the size of K should meet: greater than or equal to convolution kernel in network model full-size Kmax, and be greater than or equal to There are the sum of convolution kernel sizes that parallel computation is needed when different size convolution kernels in same convolutional layer;The fundamental length of computing array For H, the size of H is the minimum dimension that all convolutional layers export in characteristic pattern in network model, and array physical length is according to specific Hardware resource is how many and system-computed performance requirement is with 2nIt is extended for multiple;When there are different size convolution kernels in convolutional layer When needing parallel computation, it is assumed that convolution kernel size is respectively K1、K2、…、Ki, wherein existingComputing array is carried out It laterally divides, is divided into i region, each region scale is respectively K1*H、K2*H、…、Ki* H, different zones input different convolution class Shape parameter, the computing unit in each region voluntarily configure storage and computing module, complete the parallel computation of more size convolution kernels.
A kind of construction method of the neural computing unit of flexibly configurable, comprising the following steps:
Step 1: according to the model extraction network parameter of target network;
Step 2: it acts as storages in conjunction with the configurable memory module in step 1 design neural computing unit Partial Feature diagram data and weight data for calculating, comprising: characteristic pattern data buffer storage buffer, step length data caching Buffer and weight data cache buffer;
Step 3: in conjunction with the configurable control module in step 1 design neural computing unit, for different convolution Mode configurable control module configures different cache sizes to memory module, generates a variety of works when being respectively buffered in convolutional calculation Operation mode simultaneously controls caching and works in corresponding mode;Configurable control module structure includes: counter module and state machine mould Block;
Step 4: in conjunction with the multiply-add computing module in step 1 design neural computing unit, for calculating characteristic pattern Be multiplied with weight and added up to obtain convolution results part and, comprising: time-multiplexed multiplier and adder can be carried out;
Step 5: the configurable control in conjunction with Step 2: three and four, by external input port to neural computing unit Molding block provide convolution kernel size k, convolution kernel step-length s, output characteristic pattern size h, computing unit mapping input channel number a and Five kinds of convolution layer parameters of output channel number b;Configurable control module is that configurable memory module configures needed for this layer of convolutional calculation The spatial cache size wanted, and control configurable memory module and export corresponding data to multiply-add computing module;Different convolutional layers are logical It crosses and provides the part calculating that corresponding deconvolution parameter can complete convolution on same neural computing unit to computing unit.
Further, step 1 specifically: according to target network model, extract required parameter, comprising: each convolutional layer Convolution kernel size KiAnd sliding step Si, the output characteristic pattern size H of each convolutional layeri, each convolutional layer computing unit domestic demand The input of mapping and output channel number AiAnd Bi, wherein i is the number of plies of convolutional layer;
In step 2: characteristic pattern data buffer storage buffer is used to store partial pixel data used when convolutional calculation, And the pixel data there are data sharing is recycled, buffer length is max { K1A1, K2A2..., KiAi};Step Long data buffer storage buffer is used to need to update to characteristic pattern caching buffer offer when convolution kernel slides and updates step length data Data, buffer length max { S1A1, S2A2..., SiAi};Weight data caching buffer is for storing weight data simultaneously Data can be reused, buffer length is max { K1A1B1, K2A2B2..., KiAiBi};
In step 3: being counted in counter module comprising input data counter, input weight counter, output data Device, output channel counter and output characteristic pattern size counter;In state machine module, whens different convolution kernel sizes, all has Corresponding characteristic pattern buffer state machine and weight buffer state machine, state machine is according to the number of counter each in counter module Value carries out jumping for state, these states include: init state, data ready, wait state, complete alternation state, Update data mode, half cycle state and non-recurrent state.
Further, the different conditions in state machine determine the different working modes of buffer, specifically:
Init state, the state are the reset condition for not having any data to enter computing unit;
Data ready, the state are to have the lazy weight for entering data into computing unit but input data to start It calculates;
Wait state, the state are when carrying out convolution algorithm parallel there are different size convolution kernels, to guarantee to export result It is less to be responsible for the lesser computing unit calculation amount of convolution kernel size for the synchronism of data, therefore it is larger to need to wait for convolution kernel size Computing unit;
Complete alternation state, the state are the data in the case that the data that buffer is currently exported reuse if it exists It can also complete to recycle back to the tail portion of buffer allocation space while entering multiply-add computing module;
Data mode is updated, which exists only in characteristic pattern data buffer storage buffer, and the data currently exported have been not required to In the case where reusing, which takes out from step length data caching buffer new while entering multiply-add computing module Data are input to the tail portion of the data cached buffer of characteristic pattern;
Half cycle state, the state exist only in characteristic pattern data buffer storage buffer and follow update data mode it Afterwards, the data currently exported can also be back to the previous position of more new data in buffer while entering multiply-add computing module It sets;
Not recurrent state, the state are the number in the case that the data that buffer is currently exported have not needed recycling According to multiply-add computing module is only entered, it is no longer restored to former buffer.
Further, step 4 specifically: multiply-add computing module includes multiplier and accumulator;Pass through time-multiplexed side Method, improves the working frequency of N times of multiplier and accumulator, and N number of neural computing unit can share a multiplier and add up Device;The cumulative quantity of accumulator is equal to the convolution kernel size of current convolutional layer.
Further, computing array is generated by the multiple configurable computing units of exampleization, region division is carried out to array, no It can provide different convolution layer parameters with region, complete the parallel computation of variety classes convolution mode.
Further, in step 5, external input port provides the convolution kernel size k of current convolutional layer, convolution kernel step-length S, characteristic pattern size h, five kinds of the input channel number a of computing unit internal maps, output channel number b input signals are exported;Control Module is each caching buffer configuration storage space size, and the length of characteristic pattern data buffer storage buffer is configured to k*a, step-length number It is configured to s*a according to the length of caching buffer, the length of weight data caching buffer is configured to k*a*b;Control module is each The upper limit value of counter configuration upper limit value size, input data counter and output data counter is configured to k*a, inputs weight The upper limit value of counter is configured to k*a*b, and the upper limit value of output channel counter is configured to b, and output characteristic pattern size counts The upper limit value of device is configured to h;After input counter is in upper limit value, computing unit starts to be calculated, and each output counts Device is counted accordingly and is controlled jumping for state machine;When each output counter is in upper limit value, indicate It completes once to the convolutional calculation of all or part of output channel of the convolutional layer.
Further, according to the calculated performance of hardware resource and system require, can the multiple configurable computing units of exampleization simultaneously Convolutional calculation array is generated by being connected with each other, the convolutional calculation of different type convolutional layer can be completed by the array;For portion In subnetwork model, in same convolutional layer there are two types of or two or more sizes convolution kernel, region can be carried out to array and drawn Point, different zones provide different deconvolution parameters, to guarantee that computing array all areas export the synchronism of result, pass through calculating The difference of different convolution kernel sizes can find out the time difference of produce output result between the computing unit of different zones, calculation amount compared with Computing unit in few region will wait calculation amount compared with the computing unit in multizone until the time difference is zero to start to count again It calculates, to guarantee the synchronism of array output result, completes the parallel computation of variety classes convolution mode.
Compared with the existing technology, the invention has the following advantages: the present invention discloses a kind of nerve of flexibly configurable Network query function unit, computing array and its construction method, the parameter needed when going out and design according to target network model extraction first, By the internal structure of parameter designing neural computing unit, requirement for the different convolution modes that external input provides should The control module of computing unit can configure some or all of corresponding Pattern completion convolution to storage and computing module and calculate.It is logical Cross example metaplasia at several configurable neural computing units and carry out arrangement produce complete convolutional calculation array, can Region division is carried out to array and different zones input different deconvolution parameters, the parallel of different type convolution mode can be completed It calculates.The present invention devises a kind of hardware structure of convolutional layer in convolutional neural networks, in the premise for guaranteeing system-computed performance Under, it can support the convolution mode of convolutional layer in heterogeneous networks model, greatly improve the flexibility of system, in computing unit Each operating mode of caching takes full advantage of the data reusing of convolutional neural networks, effectively reduces since data-moving is produced Raw system power dissipation, and the burden of storage aspect is alleviated to a certain extent, multiple computing unit composition computing arrays can It supports that different size convolution kernels are calculated parallel, has sufficiently excavated the algorithm degree of parallelism sum number of convolutional layer in convolutional neural networks According to reusability.
Detailed description of the invention
Fig. 1 is a kind of overall structure diagram of the neural computing unit of flexibly configurable of the present invention;
Fig. 2 is control module schematic diagram in a kind of neural computing unit of flexibly configurable of the present invention;
Fig. 3 is the state machine diagram of characteristic pattern data buffer storage buffer in control module;
Fig. 4 is that the multiple computing units of exampleization of the present invention generate computing array schematic diagram.
Specific embodiment
The present invention is further described in detail with reference to the accompanying drawing,
Refering to Figure 1, a kind of neural computing unit of flexibly configurable of the present invention, comprising: configurable storage Module, configurable control module and can time-multiplexed multiply-add computing module;Configurable memory module includes: that feature diagram data is slow Deposit buffer, step length data caching buffer and weight data caching buffer;Configurable control module includes: counter module And state machine module;Multiply-add computing module includes: multiplier and accumulator.
Characteristic pattern data buffer storage buffer is used for the Partial Feature diagram data used when storing convolutional calculation, and to there are numbers It is recycled according to shared feature diagram data, the maximum length of buffer is L1, and size is max { K1A1, K2A2..., KiAi, wherein K is the size of convolution kernel in convolutional layer, and A is the input channel number of computing unit domestic demand mapping, and i is target network The serial number of middle convolutional layer;Step length data caches buffer and is used to cache when convolution kernel slides and updates step length data to characteristic pattern Buffer provides the data for needing to update, and the maximum length of buffer is L2, and size is max { S1A1, S2A2..., SiAi, Middle S is the step-length of convolution kernel in convolutional layer;Weight data caching buffer is for storing weight data and can carry out to data Recycling, the length is L3, size is max { K1A1B1, K2A2B2..., KiAiBi, wherein B is the mapping of computing unit domestic demand Output channel number.
Neural computing unit is equipped with characteristic pattern data-in port and weight data input port;Feature diagram data is defeated The input terminal of inbound port connection first selector 1;Two output ends of first selector 1 are separately connected step length data caching The input terminal of buffer and the first input end of second selector 2, step length data cache the second choosing of output end connection of buffer Select the second input terminal of device 2, the input terminal of the output end connection features diagram data caching buffer of second selector 2;Weight number According to the input terminal of input port connection weight data buffer storage buffer;The output end and weight number of characteristic pattern data buffer storage buffer Two input terminals of multiplier are separately connected according to the output end of caching buffer;The output end of multiplier passes through register, adds up The output end of device, 4 Connection Neural Network computing unit of the 4th selector.The output end of characteristic pattern data buffer storage buffer also passes through Third selector connector input terminal.
The internal structure that Fig. 2 show buffer control module is please referred to, mainly by counter module and state machine module It constitutes;Include input data counter, input weight counter, output data counter, output channel number in counter module Counter and output characteristic pattern size counter;In state machine module, all there is corresponding feature for different convolution kernel sizes Figure buffer state machine and weight buffer state machine, state machine carry out shape according to the numerical value of counter each in counter module State jumps.
A kind of state machine diagram that Fig. 3 show characteristic pattern data buffer storage buffer in control module is please referred to, these State include: init state S0, data ready S1, wait state S6, complete alternation state S2, update data mode S3, Half cycle state S4 and non-recurrent state S5, different conditions determine the different operation mode of the buffer;External input to calculate Array control signal to buffer control module provide convolution kernel size, convolution window sliding step, export characteristic pattern size, Array Mapping outputs and inputs five kinds of information of port number, and then the corresponding upper limit value of each counter can be obtained, and passes through counter Numerical value come control corresponding convolution kernel size state machine carry out state transition, completion respectively stored under different convolution kernel sizes The operating of buffer.The state of weight data caching buffer includes: init state, data ready, wait state, complete Recurrent state and non-recurrent state.
Different conditions for please referring to shown in Fig. 4 in state machine determine the different working modes of buffer, specifically:
Init state, the state are the reset condition for not having any data to enter computing unit;
Data ready, the state are to have the lazy weight for entering data into computing unit but input data to start It calculates;
Wait state, the state are when carrying out convolution algorithm parallel there are different size convolution kernels, to guarantee to export result It is less to be responsible for the lesser computing unit calculation amount of convolution kernel size for the synchronism of data, therefore it is larger to need to wait for convolution kernel size Computing unit;
Complete alternation state, the state are the data in the case that the data that buffer is currently exported reuse if it exists It can also complete to recycle back to the tail portion of buffer allocation space while entering multiply-add computing module;
Data mode is updated, which exists only in characteristic pattern data buffer storage buffer, and the data currently exported have been not required to In the case where reusing, which takes out from step length data caching buffer new while entering multiply-add computing module Data are input to the tail portion of the data cached buffer of characteristic pattern;
Half cycle state, the state exist only in characteristic pattern data buffer storage buffer and follow update data mode it Afterwards, the data currently exported can also be back to the previous position of more new data in buffer while entering multiply-add computing module It sets;
Not recurrent state, the state are the number in the case that the data that buffer is currently exported have not needed recycling According to multiply-add computing module is only entered, it is no longer restored to former buffer.Change multiple computing units phase in a manner of row fixed data stream Connection generates computing array, the scale of computing array by the calculated performance of hardware resource, target network model and system require Lai It determines;The width of computing array is K, the size of K should meet: greater than or equal to convolution kernel in network model full-size Kmax, And more than or equal to there are the sum of convolution kernel sizes for needing parallel computation when different size convolution kernels in same convolutional layer;Calculate battle array The fundamental length of column is H, and the size of H is the minimum dimension that all convolutional layers export in characteristic pattern in network model, and array is practical Length can how many according to particular hardware resource and system-computed performance requirement is with 2nIt is extended for multiple;When existing in convolutional layer When different size convolution kernels need parallel computation, it is assumed that convolution kernel size is respectively K1、K2、…、Ki, wherein existing Lateral division is carried out to computing array, is divided into i region, each region scale is respectively K1*H、K2*H、…、Ki* H, different zones Different convolution type parameters are inputted, the computing unit in each region voluntarily configures storage and computing module, completes more size convolution The parallel computation of core.
A kind of construction method of the neural computing unit of flexibly configurable of the present invention, comprising the following steps:
Step 1: according to the model extraction network parameter of target network;
Step 2: it acts as storages in conjunction with the configurable memory module in step 1 design neural computing unit Partial Feature diagram data and weight data for calculating, comprising: characteristic pattern data buffer storage buffer, step length data caching Buffer and weight data cache buffer;
Step 3: in conjunction with the configurable control module in step 1 design neural computing unit, for different convolution Mode configurable control module configures different cache sizes to memory module, generates a variety of works when being respectively buffered in convolutional calculation Operation mode simultaneously controls caching and works in corresponding mode;Configurable control module structure includes: counter module and state machine mould Block;
Step 4: in conjunction with the multiply-add computing module in step 1 design neural computing unit, for calculating characteristic pattern Be multiplied with weight and added up to obtain convolution results part and, comprising: time-multiplexed multiplier and adder can be carried out;
Step 5: the configurable control in conjunction with Step 2: three and four, by external input port to neural computing unit Molding block provide convolution kernel size k, convolution kernel step-length s, output characteristic pattern size h, computing unit mapping input channel number a and Five kinds of convolution layer parameters of output channel number b;Configurable control module is that configurable memory module configures needed for this layer of convolutional calculation The spatial cache size wanted, and control configurable memory module and export corresponding data to multiply-add computing module;Different convolutional layers are logical It crosses and provides the part calculating that corresponding deconvolution parameter can complete convolution on same neural computing unit to computing unit.
The present invention gives neural computing unit, computing array and the construction methods of a kind of flexibly configurable, different The convolution mode of type only need to provide part deconvolution parameter to computing unit, and computing unit can voluntarily configure storage inside and meter Module is calculated, the convolutional calculation of corresponding modes can be completed to computing unit input feature vector figure and weight data, is greatlyd improve Flexibility when convolutional layer passes through hardware realization in convolutional neural networks, it is real convenient for rapid deployment of the convolutional layer on hardware It is existing.
In the present invention, external input provides convolution layer parameter to the control module of computing unit, and control module is empty to storage Between carry out reasonable disposition and control memory module and work in corresponding modes;Respectively caching is respectively provided with a variety of Working moulds in memory module Formula, there are shared data can be maximized utilization;Different convolutional layers can built by configurable computing unit it is same It is completed on a computing array, the operation that different size convolution kernel parallel-convolutions calculate can be completed on array by dividing region; The present invention can support any type of convolutional calculation, and support more size convolution kernel parallel computations, sufficiently excavate convolutional Neural net The flexibility and data reusability of network computing unit, are greatly reduced by data-moving bring system power dissipation, improve the meter of system Calculate efficiency.

Claims (10)

1. a kind of neural computing unit of flexibly configurable characterized by comprising configurable memory module can configure Control module and can time-multiplexed multiply-add computing module;
Configurable memory module includes: characteristic pattern data buffer storage buffer, step length data caching buffer and weight data caching buffer;
Configurable control module includes: counter module and state machine module;
Multiply-add computing module includes: multiplier and accumulator.
2. a kind of neural computing unit of flexibly configurable according to claim 1, which is characterized in that characteristic pattern number The Partial Feature diagram data used when being used for according to caching buffer and store convolutional calculation, and to there are the characteristic pattern numbers of data sharing According to being recycled, the maximum length of buffer is L1, and size is max { K1A1, K2A2..., KiAi, wherein K is convolution The size of convolution kernel in layer, A are the input channel number of computing unit domestic demand mapping, and i is the serial number of convolutional layer in target network;
Step length data caches buffer need to for providing when convolution kernel slides and updates step length data to characteristic pattern caching buffer The data to be updated, the maximum length of buffer are L2, and size is max { S1A1, S2A2..., SiAi, wherein S is convolutional layer The step-length of middle convolution kernel;
Weight data caching buffer is for storing weight data and can reuse to data, and the length is L3, greatly Small is max { K1A1B1, K2A2B2..., KiAiBi, wherein B is the output channel number of computing unit domestic demand mapping.
3. a kind of neural computing unit of flexibly configurable according to claim 1, which is characterized in that counter mould It is special comprising input data counter, input weight counter, output data counter, output channel counter and output in block Levy figure size counter;
In state machine module, there is corresponding characteristic pattern buffer state machine and weight buffer shape for different convolution kernel sizes State machine, state machine carry out jumping for state according to the numerical value of counter each in counter module.
4. a kind of neural computing unit of flexibly configurable according to claim 1, which is characterized in that convolutional calculation Unit is equipped with characteristic pattern data-in port and weight data input port;
The input terminal of characteristic pattern data-in port connection first selector;Two output ends of first selector are separately connected step The input terminal of long data buffer storage buffer and the first input end of second selector, the output end that step length data caches buffer connect Connect the second input terminal of second selector, the input terminal of the output end connection features diagram data caching buffer of second selector;
The input terminal of weight data input port connection weight data buffer storage buffer;
The output end of output end and weight data the caching buffer of characteristic pattern data buffer storage buffer is separately connected multiplier Two input terminals;The output end of multiplier by register, accumulator, the 4th selector Connection Neural Network computing unit it is defeated Outlet.
5. a kind of neural computing unit of flexibly configurable according to claim 3, which is characterized in that state machine mould In block, whens different convolution kernel sizes, all has corresponding characteristic pattern buffer state machine and weight buffer state machine, state machine Jumping for state is carried out according to the numerical value of counter each in counter module;
Characteristic pattern according to caching buffer state include: init state, data ready, wait state, complete alternation state, Update data mode, half cycle state and non-recurrent state;
The state that weight data caches buffer includes: init state, data ready, wait state, complete alternation state And not recurrent state.
6. a kind of neural computing unit of flexibly configurable according to claim 5, which is characterized in that
Init state, the state are the reset condition for not having any data to enter computing unit;
Data ready, the state are to have the lazy weight for entering data into computing unit but input data in terms of starting It calculates;
Wait state, the state are when carrying out convolution algorithm parallel there are different size convolution kernels, to guarantee to export result data Synchronism, it is less to be responsible for the lesser computing unit calculation amount of convolution kernel size, therefore needs to wait for the larger-size meter of convolution kernel Calculate unit;
Complete alternation state, the state be in the case that the data that currently export of buffer reuse if it exists, the data into It can also complete to recycle back to the tail portion of buffer allocation space while entering multiply-add computing module;
Data mode is updated, which exists only in characteristic pattern data buffer storage buffer, and the data currently exported have not needed weight In the case where multiple utilization, which takes out new data from step length data caching buffer while entering multiply-add computing module It is input to the tail portion of the data cached buffer of characteristic pattern;
Half cycle state, the state exist only in characteristic pattern data buffer storage buffer and follow after updating data mode, when The data of preceding output can also be back to the prior location of more new data in buffer while entering multiply-add computing module;
Not recurrent state, the state are in the case that the data that buffer is currently exported have not needed recycling, and the data are only Into multiply-add computing module, it is no longer restored to former buffer.
7. a kind of computing array, which is characterized in that pass through the nerve of flexibly configurable described in any one of claims 1 to 6 Network query function unit generates, and carries out region division to computing array, and different zones can provide different convolution layer parameters, complete not With the parallel computation of type convolution mode.
8. a kind of computing array according to claim 7, which is characterized in that by described in any one of claims 1 to 6 The neural computing unit of flexibly configurable is connected generation in a manner of row fixed data stream;The scale of computing array is by hard The calculated performance of part resource, target network model and system requires to determine;The width of computing array is K, and the size of K should expire Foot: greater than or equal to convolution kernel in network model full-size Kmax, and more than or equal to there are different rulers in same convolutional layer The sum of the convolution kernel size of parallel computation is needed when very little convolution kernel;The fundamental length of computing array is H, and the size of H is network model In minimum dimension in all convolutional layers output characteristic patterns, array physical length is according to particular hardware resource is how many and system-computed Performance requirement is with 2nIt is extended for multiple;When needing parallel computation there are different size convolution kernels in convolutional layer, it is assumed that convolution Core size is respectively K1、K2、…、Ki, wherein existingLateral division is carried out to computing array, is divided into i region, Each region scale is respectively K1*H、K2*H、…、Ki* H, different zones input different convolution type parameters, the calculating in each region Unit voluntarily configures storage and computing module, completes the parallel computation of more size convolution kernels.
9. a kind of construction method of the neural computing unit of flexibly configurable, which comprises the following steps:
Step 1: according to the model extraction network parameter of target network;
Step 2: it acts as storages to be used in conjunction with the configurable memory module in step 1 design neural computing unit The Partial Feature diagram data and weight data of calculating, comprising: characteristic pattern data buffer storage buffer, step length data caching buffer and Weight data caches buffer;
Step 3: in conjunction with the configurable control module in step 1 design neural computing unit, for different convolution modes Configurable control module configures different cache sizes to memory module, generates a variety of Working moulds when being respectively buffered in convolutional calculation Formula simultaneously controls caching and works in corresponding mode;Configurable control module structure includes: counter module and state machine module;
Step 4: in conjunction with the multiply-add computing module in step 1 design neural computing unit, for calculating characteristic pattern and power Value be multiplied and added up to obtain convolution results part and, comprising: time-multiplexed multiplier and adder can be carried out;
Step 5: in conjunction with Step 2: three and four, by external input port to the configurable control mould of neural computing unit Block provides convolution kernel size k, convolution kernel step-length s, output characteristic pattern size h, the input channel number a of computing unit mapping and output Five kinds of convolution layer parameters of port number b;Configurable control module is that configurable memory module configures required for this layer of convolutional calculation Spatial cache size, and control configurable memory module and export corresponding data to multiply-add computing module;Different convolutional layers pass through to Computing unit provides the part calculating that corresponding deconvolution parameter can complete convolution on same neural computing unit.
10. a kind of construction method of the neural computing unit of flexibly configurable according to claim 9, feature exist In,
Step 1 specifically: according to target network model, extract required parameter, comprising: the convolution kernel size K of each convolutional layeri And sliding step Si, the output characteristic pattern size H of each convolutional layeri, the input of each convolutional layer computing unit domestic demand mapping and defeated Port number A outiAnd Bi, wherein i is the number of plies of convolutional layer;
In step 2: characteristic pattern data buffer storage buffer is used to store partial pixel data used when convolutional calculation, and right There are the pixel datas of data sharing to be recycled, and buffer length is max { K1A1, K2A2..., KiAi};Step-length number It is used to provide the number for needing to update to characteristic pattern caching buffer when convolution kernel slides and updates step length data according to caching buffer According to buffer length max { S1A1, S2A2..., SiAi};Weight data caching buffer is used to store weight data and can Data are reused, buffer length is max { K1A1B1, K2A2B2..., KiAiBi};
In step 3: comprising input data counter, input weight counter, output data counter, defeated in counter module Channel counter and output characteristic pattern size counter out;In state machine module, whens different convolution kernel sizes, all has corresponding Characteristic pattern buffer state machine and weight buffer state machine, state machine according to the numerical value of counter each in counter module come Carry out state jumps, these states include: init state, data ready, wait state, complete alternation state, update Data mode, half cycle state and non-recurrent state.
CN201811133940.2A 2018-09-27 2018-09-27 Flexibly configurable neural network computing unit, computing array and construction method thereof Active CN109409512B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811133940.2A CN109409512B (en) 2018-09-27 2018-09-27 Flexibly configurable neural network computing unit, computing array and construction method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811133940.2A CN109409512B (en) 2018-09-27 2018-09-27 Flexibly configurable neural network computing unit, computing array and construction method thereof

Publications (2)

Publication Number Publication Date
CN109409512A true CN109409512A (en) 2019-03-01
CN109409512B CN109409512B (en) 2021-02-19

Family

ID=65465369

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811133940.2A Active CN109409512B (en) 2018-09-27 2018-09-27 Flexibly configurable neural network computing unit, computing array and construction method thereof

Country Status (1)

Country Link
CN (1) CN109409512B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109656623A (en) * 2019-03-13 2019-04-19 北京地平线机器人技术研发有限公司 It executes the method and device of convolution algorithm operation, generate the method and device of instruction
CN110070178A (en) * 2019-04-25 2019-07-30 北京交通大学 A kind of convolutional neural networks computing device and method
CN110084739A (en) * 2019-03-28 2019-08-02 东南大学 A kind of parallel acceleration system of FPGA of the picture quality enhancement algorithm based on CNN
CN111610963A (en) * 2020-06-24 2020-09-01 上海西井信息科技有限公司 Chip structure and multiply-add calculation engine thereof
CN112346704A (en) * 2020-11-23 2021-02-09 华中科技大学 Full-streamline type multiply-add unit array circuit for convolutional neural network
CN112418418A (en) * 2020-11-11 2021-02-26 江苏禹空间科技有限公司 Data processing method and device based on neural network, storage medium and server
CN113138957A (en) * 2021-03-29 2021-07-20 北京智芯微电子科技有限公司 Chip for neural network inference and method for accelerating neural network inference
WO2021144126A1 (en) * 2020-01-15 2021-07-22 Graphcore Limited Control of data transfer between processors
CN113592067A (en) * 2021-07-16 2021-11-02 华中科技大学 Configurable convolution calculation circuit for convolution neural network
CN113807506A (en) * 2020-06-11 2021-12-17 杭州知存智能科技有限公司 Data loading circuit and method
WO2023115529A1 (en) * 2021-12-24 2023-06-29 华为技术有限公司 Data processing method in chip, and chip
US11977969B2 (en) 2020-06-11 2024-05-07 Hangzhou Zhicun Intelligent Technology Co., Ltd. Data loading

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110029471A1 (en) * 2009-07-30 2011-02-03 Nec Laboratories America, Inc. Dynamically configurable, multi-ported co-processor for convolutional neural networks
CN105681628A (en) * 2016-01-05 2016-06-15 西安交通大学 Convolution network arithmetic unit, reconfigurable convolution neural network processor and image de-noising method of reconfigurable convolution neural network processor
CN106940815A (en) * 2017-02-13 2017-07-11 西安交通大学 A kind of programmable convolutional neural networks Crypto Coprocessor IP Core
CN107918794A (en) * 2017-11-15 2018-04-17 中国科学院计算技术研究所 Neural network processor based on computing array

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110029471A1 (en) * 2009-07-30 2011-02-03 Nec Laboratories America, Inc. Dynamically configurable, multi-ported co-processor for convolutional neural networks
CN105681628A (en) * 2016-01-05 2016-06-15 西安交通大学 Convolution network arithmetic unit, reconfigurable convolution neural network processor and image de-noising method of reconfigurable convolution neural network processor
CN106940815A (en) * 2017-02-13 2017-07-11 西安交通大学 A kind of programmable convolutional neural networks Crypto Coprocessor IP Core
CN107918794A (en) * 2017-11-15 2018-04-17 中国科学院计算技术研究所 Neural network processor based on computing array

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110334798B (en) * 2019-03-13 2021-06-08 北京地平线机器人技术研发有限公司 Feature data extraction method and device and instruction generation method and device
CN109656623A (en) * 2019-03-13 2019-04-19 北京地平线机器人技术研发有限公司 It executes the method and device of convolution algorithm operation, generate the method and device of instruction
CN110334798A (en) * 2019-03-13 2019-10-15 北京地平线机器人技术研发有限公司 Characteristic extracting method and device, instruction generation method and device
CN109656623B (en) * 2019-03-13 2019-06-14 北京地平线机器人技术研发有限公司 It executes the method and device of convolution algorithm operation, generate the method and device of instruction
CN110084739A (en) * 2019-03-28 2019-08-02 东南大学 A kind of parallel acceleration system of FPGA of the picture quality enhancement algorithm based on CNN
CN110070178A (en) * 2019-04-25 2019-07-30 北京交通大学 A kind of convolutional neural networks computing device and method
WO2021144126A1 (en) * 2020-01-15 2021-07-22 Graphcore Limited Control of data transfer between processors
US11625357B2 (en) 2020-01-15 2023-04-11 Graphcore Limited Control of data transfer between processors
CN113807506A (en) * 2020-06-11 2021-12-17 杭州知存智能科技有限公司 Data loading circuit and method
US11977969B2 (en) 2020-06-11 2024-05-07 Hangzhou Zhicun Intelligent Technology Co., Ltd. Data loading
CN111610963A (en) * 2020-06-24 2020-09-01 上海西井信息科技有限公司 Chip structure and multiply-add calculation engine thereof
CN112418418A (en) * 2020-11-11 2021-02-26 江苏禹空间科技有限公司 Data processing method and device based on neural network, storage medium and server
CN112346704B (en) * 2020-11-23 2021-09-17 华中科技大学 Full-streamline type multiply-add unit array circuit for convolutional neural network
CN112346704A (en) * 2020-11-23 2021-02-09 华中科技大学 Full-streamline type multiply-add unit array circuit for convolutional neural network
CN113138957A (en) * 2021-03-29 2021-07-20 北京智芯微电子科技有限公司 Chip for neural network inference and method for accelerating neural network inference
CN113592067A (en) * 2021-07-16 2021-11-02 华中科技大学 Configurable convolution calculation circuit for convolution neural network
CN113592067B (en) * 2021-07-16 2024-02-06 华中科技大学 Configurable convolution calculation circuit for convolution neural network
WO2023115529A1 (en) * 2021-12-24 2023-06-29 华为技术有限公司 Data processing method in chip, and chip

Also Published As

Publication number Publication date
CN109409512B (en) 2021-02-19

Similar Documents

Publication Publication Date Title
CN109409512A (en) A kind of neural computing unit, computing array and its construction method of flexibly configurable
AU2019442319B2 (en) Structural topology optimization method based on material-field reduction series expansion
CN110390384B (en) Configurable general convolutional neural network accelerator
CN108171317B (en) Data multiplexing convolution neural network accelerator based on SOC
CN111667051A (en) Neural network accelerator suitable for edge equipment and neural network acceleration calculation method
Liu et al. Social learning discrete Particle Swarm Optimization based two-stage X-routing for IC design under Intelligent Edge Computing architecture
CN110458279A (en) A kind of binary neural network accelerated method and system based on FPGA
CA3124369A1 (en) Neural network processor
CN108537331A (en) A kind of restructural convolutional neural networks accelerating circuit based on asynchronous logic
CN106844703A (en) A kind of internal storage data warehouse query processing implementation method of data base-oriented all-in-one
Jin et al. A parallel optimization method for stencil computation on the domain that is bigger than memory capacity of GPUs
CN109472361A (en) Neural network optimization
CN101882238A (en) Wavelet neural network processor based on SOPC (System On a Programmable Chip)
CN108228970B (en) Structural dynamics analysis explicit different step length parallel computing method
CN106415526B (en) Fft processor and operation method
CN106294278B (en) Adaptive hardware for dynamic reconfigurable array computing system is pre-configured controller
CN105574809A (en) Matrix exponent-based parallel calculation method for electromagnetic transient simulation graphic processor
CN103399927A (en) Distributed computing method and device
CN107064930A (en) Radar foresight imaging method based on GPU
Vartziotis et al. Improved GETMe by adaptive mesh smoothing
CN111079078B (en) Lower triangular equation parallel solving method for structural grid sparse matrix
CN114911619A (en) Batch parallel LU decomposition method of small and medium-sized dense matrix based on GPU for simulation system
CN103838680A (en) Data caching method and device
Sait et al. Optimization of FPGA-based CNN accelerators using metaheuristics
CN111339688B (en) Method for solving rocket simulation model time domain equation based on big data parallel algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant