CN110232441A

CN110232441A - A kind of stacking-type based on unidirectional systolic arrays is from encoding system and method

Info

Publication number: CN110232441A
Application number: CN201910528794.1A
Authority: CN
Inventors: 李丽; 黄延; 傅玉祥; 陈沁雨; 李伟
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2019-06-18
Filing date: 2019-06-18
Publication date: 2019-09-13
Anticipated expiration: 2039-06-18
Also published as: CN110232441B

Abstract

Stacking-type based on unidirectional systolic arrays of the invention is from the hardware realization of encryption algorithm reasoning, including signal control module, input/output control module, data address generation module and computing array module；Signal control module: receiving commencing signal, controls each intermodule communication, generates end signal；Input/output control module: reading the data of the outer DDR of piece when input and is stored in on-chip SRAM by ad hoc fashion, and on-chip SRAM data are write back DDR by ad hoc fashion when output；Data address generation module: source data or result data address are generated；Computing array module: the reasoning operation of neural network algorithm is carried out in a manner of unidirectional systolic arrays.Present invention support batch processing, support water operation realize that part calculates hiding, the speed-up ratio height of time and memory access time by ping-pong operation, and scalability is good.

Description

A kind of stacking-type based on unidirectional systolic arrays is from encoding system and method

Technical field

The present invention relates to field of artificial intelligence more particularly to a kind of stacking-type based on unidirectional systolic arrays are self-editing Code system and method.

Background technique

Stack noise reduction self-encoding encoder is typical standard neural network, has two main points, one be it is a series of from Dynamic encoder, the other is multilayer perceptron (MLP).The reasoning process of stack noise reduction self-encoding encoder is actually equivalent to more The feed forward process of layer perceptron, if the output of certain layer of j-th of neuron is y in network_j, operand has n, i-th of operand For x_i, respective weights w_ij, it is biased to b_i, then have:

For such computation-intensive algorithm, powerful calculation power is needed to be supported.Before 2007, it is limited to work as When the factors such as network size and data volume, general cpu chip can provide enough calculating power.Later, fast with GPU Speed development, parallel computation characteristic adapts to the requirement of intelligent algorithm big data parallel computation just, therefore GPU becomes master Stream.Structurally, there is the transistor of accounting 70% to be used to construct Cache (Cache) and control unit in CPU, patrol It is few to collect arithmetic element (ALU module), it is difficult to meet the calculation power demand of intelligent algorithm；The far super CPU of the computing capability of GPU, But the hardware configuration of GPU does not have programmability, if intelligent algorithm varies widely, GPU can not be configured flexibly firmly Part structure.In addition, the energy consumption of GPU and CPU is all bigger.

Nowadays, the appearance with more and more application scenarios with advances in technology, people are to artificial intelligence chip Demand is gradually promoted, and Artificial Intelligence Development faces new problem, for example pilotless automobile needs real-time, extremely low delay Reaction, this characteristic determine that we cannot use big power consumption, high-cost GPU.

How under acceptable power consumption, cost limitation, solves the problems, such as the huge calculation amount of deep learning, make nerve net Network performance is more preferable, power consumption is lower, a scalability more preferably current manual's intelligence big technical problem.

Summary of the invention

Present invention aims to overcome that existing technical problem makes full use of and deposits to improve neural network computing efficiency Resource and computing resource are stored up, the calculating speed of reasoning is accelerated, provides a kind of stacking-type based on unidirectional systolic arrays from encoding System is specifically realized by the following technical scheme:

The stacking-type based on unidirectional systolic arrays is from coded system, comprising:

Signal control module: receiving commencing signal, controls each intermodule communication, generates end signal；

Memory module: including DDR memory outside piece and on-chip SRAM memory；

Input/output control module: the data and sequence that piece outer DDR memory is read when input are stored in on-chip SRAM storage The data sequence of on-chip SRAM memory is write back DDR chip external memory when output by device；

Data address generation module: the address of source data or result data is generated；

Computing array module: the reasoning operation of neural network algorithm is carried out in a manner of unidirectional systolic arrays.

The stacking-type based on unidirectional systolic arrays from coded system it is further design be, the neural network All results of algorithm share same set of storage resource, and the storage location that the intermediate result of algorithm generation occupies can cover.

The stacking-type based on unidirectional systolic arrays from coded system it is further design be, computing array module It include: the unidirectional systolic arrays that scale is 32x32, each independent computing unit includes 16 fixed-point multiplication devices, adds in the array Musical instruments used in a Buddhist or Taoist mass, divider, support Relu function calculating linear activation primitive computing unit and support tanh function and The nonlinear activation function computing unit that sigmoid function calculates realizes the calculating multiplied accumulating with neural network activation primitive.

The stacking-type based on unidirectional systolic arrays from coded system it is further design be, the unidirectional pulsation The unidirectional microseismic data transmission mode that the mode of array is pulsation between using column, in the ranks broadcasts, specifically: operand is with behavior Unit simultaneous transmission is to each computing unit in a column, and weight is to arrange each calculating list sequentially entered in a column for unit Member supports the multiple multiplexing of weight and operand.

Using the stacking-type based on unidirectional systolic arrays from coded system from coding method, including walk as follows It is rapid:

Step 1) signal control module receives algorithm commencing signal, controls input/output control module for input data It is transferred in SRAM memory in a particular order from DDR memory；

Step 2) controls data address generation module generating source data address to signal control module, according to source data The operand stored in SRAM memory is passed to computing array module by location, is generated and is passed to input data useful signal；

Step 3) computing array module receives the input data useful signal and reads in operand from SRAM memory Afterwards, start to carry out ANN Reasoning calculating, in calculating process: for each column, different neuron respective weights are from top to bottom It flows in each computing unit；For every a line, it is broadcast to each calculating of computing array from left to right with batch input data In unit, calculating process is completed in each computing unit；

Step 4) computing array module generates output data useful signal, and signal control module is controlled after receiving the signal Data address generation module processed generates result data address, and result data is passed to SRAM memory according to result data address In；

Step 5) signal control module control input/output control module writes result data from on-chip SRAM memory Enter in the outer DDR memory of piece, generate end signal, completes the calculating of primary complete ANN Reasoning.

The further design from coding method is that input data includes operand and weight in the step 1), Each address bit of storage unit can store 4 16 fixed-point datas in SRAM memory, and operand and weight are storing Sequential storage in unit.

Beneficial effects of the present invention:

Stacking-type based on unidirectional systolic arrays of the invention supports nerve net from the hardware realization of encryption algorithm reasoning Network layers number and neuronal quantity are configurable, support the selection of three kinds of different interlayer activation primitives, support flowing water and table tennis behaviour Make, support batch processing and it is flexible in application, scalability is good.

Detailed description of the invention

Fig. 1 is typical stacking-type autoencoder network model schematic.

Fig. 2 is schematic diagram of the stacking-type based on unidirectional systolic arrays from coded system.

Fig. 3 is unidirectional systolic arrays data flow schematic diagram.

Fig. 4 is operand storage mode schematic diagram.

Fig. 5 is weight storage mode schematic diagram.

Fig. 6 is output storage mode schematic diagram.

Specific embodiment

The present invention is described in detail with reference to the accompanying drawing.

As shown in Figure 1, the present embodiment, by taking typical standard neural network as an example, the connection type before layer and layer is complete Connection, each neuron receive the input from upper one layer of all neuron, each neuron and all nerves of next layer Member is connected, and input is transmitted by the connection of Weight and the biasing of each neuron, and the output of neuron is by current Neuron weight, the biasing of Current neural member and the output of upper one layer of neuron determine.

The stacking-type based on unidirectional systolic arrays of the present embodiment is from coded system mainly by signal control module, input Output control module, data address generation module and computing array module composition.

Relationship between each module is referring to fig. 2, wherein signal control module is responsible for receiving commencing signal, controls each module Between communicate, generate end signal.It specifically including: receiving commencing signal, control input/output control module is passed to source data, Data address generation module generating source data address is controlled, source data is read from storage unit by incoming calculating battle array according to address Column module carries out operation, after operation, receives output data useful signal, control data address generation module generates result Data address, control input/output control module spread out of result data, generate end signal.

Input/output control module is responsible for the communication between on-chip SRAM and piece external storage DDR, specifically includes to receive and ask After seeking signal, reads the data of the outer DDR of piece and be passed to on-chip SRAM by specific regular and sequence, used for computing array.Entirely After portion calculates, end signal is received, reads DDR outside the data and incoming piece of on-chip SRAM by specific rule and sequence.

Data address generation module: it before calculating, generates source data (including operand and weight) address and exports, count After calculation, generates output data (i.e. result data) address and export；

The computing array module design structure of unidirectional systolic arrays, the array are completed all of ANN Reasoning and are multiplied Accumulating operation specifically includes and receives the input data useful signal from signal control module, starts ANN Reasoning fortune It calculates, calculating finishes, produce output result useful signal and input signal control module.

A concrete case is provided below in conjunction with Fig. 3 to realize.In the case, the reasoning and calculation module of neural network by The unidirectional systolic arrays composition of one 32x32, memory module is by 128 data storage cells and 32 constant storage unit groups At.Wherein, it is 64 that data storage cell, which is bit wide, and depth is the SRAM of 8k；Constant storage unit is that bit wide is 64, deep Degree is the SRAM of 1k.Computational accuracy uses 16 fixed-point numbers, operand 128, and hidden layer neuron number is 32, lot number (batch) it is set as 3.

The specific steps of the present embodiment are as follows:

Step 1) signal processing module receives algorithm commencing signal, controls input/output control module for input data It is transferred in SRAM in a particular order from DDR.Wherein, the storage of input data (including operand and weight) in sram Mode is as shown in Figure 5 and Figure 6, and each address bit of storage unit (64) can store 4 16 fixed-point datas, operand With weight sequential storage in the memory unit.

After step 2) is to step 1) data end of transmission, signal processing module controls data address generation module and generates The operand stored in SRAM is passed to computing array module according to source data address, generates and be passed to input by source data address Data valid signal.

Step 3) computing array module receives input data useful signal and after the operand that SRAM is read in, and starts ANN Reasoning calculating is carried out, calculating process data flow, referring to fig. 4:

Set X⁽¹⁾,X⁽²⁾,X⁽³⁾Respectively batch 1, batch 2, the operand of batch 3, the operand of three batches is all It is the vector of length 128；W₁, W₂, W₃..., W₃₂Respectively hidden layer neuron 1, neuron 2, neuron 3 ... ..., neuron Weight corresponding to 32, they are all the vectors that length is 128；The computing unit of i-th row jth column is expressed as PE (i, j).Example Such as, W₁=(W_{1_1},W_{1_2},W_{1_3},…,W_{1_128})。

When calculating, for each column, different neuron respective weights flow to each computing unit (MLU) from top to bottom In；It for every a line, is broadcast in each computing unit from left to right with batch input data, calculating process is single in each calculating It is completed in member, the main step that calculates is to multiply accumulating.

By taking the MLU (1,1) of the first row as an example, input data (operand)By row sequence into Enter systolic arrays, and is broadcasted to same a line, respective weights W_{1_1},W_{1_2},W_{1_3},…,W_{1_128}Column major order enters array, and to same One column flowing, operand and weight multiply accumulating operation in the inner completion of MLU (1,1), and operand and multiplying accumulating for weight complete it Afterwards, operation result and the corresponding biasing b of current MLU₁It will do it add operation and pass through activation primitive computing unit (AU), i.e., it is complete The calculating exported at the 1st neuron.Similarly, (1,2) MLU, MLU (1,3) to MLU (1,32) complete the 2nd to the 32nd successively The calculating of neuron output.The MLU of MLU and the third line for the second row, calculating process is identical with the MLU of the first row, but Postponing a clock cycle obtains result.

So far, computing array can complete whole calculating of ANN Reasoning.

Step 4) computing array module generates output data useful signal, and signal control module is controlled after receiving the signal Data address generation module processed generates result data address, and result data is passed to SRAM according to address

Step 5) signal control module control input/output control module module is responsible for result from on-chip SRAM memory In the outer DDR memory of middle write-in piece, end signal is generated, completes the calculating of primary complete ANN Reasoning.

More than, it is merely preferred embodiments of the present invention, but scope of protection of the present invention is not limited thereto, appoints In the technical scope disclosed by the present invention, any changes or substitutions that can be easily thought of, all by what those familiar with the art It is covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with scope of protection of the claims It is quasi-.

Claims

1. a kind of stacking-type based on unidirectional systolic arrays is from coded system, characterized by comprising:

Memory module: including DDR memory outside piece and on-chip SRAM memory；

Input/output control module: the data and sequence that piece outer DDR memory is read when input are stored in on-chip SRAM memory, defeated The result data sequence of on-chip SRAM memory is write back into DDR chip external memory when out；

2. the stacking-type according to claim 1 based on unidirectional systolic arrays is from coded system, it is characterised in that: the mind Same set of storage resource is shared through all results of network algorithm, and the storage location that the intermediate result of algorithm generation occupies can be covered Lid.

3. the stacking-type according to claim 1 based on unidirectional systolic arrays is from coded system, it is characterised in that: calculate battle array Column module includes: the unidirectional systolic arrays that scale is 32x32, and each independent computing unit includes 16 fixed-point multiplications in the array Device, adder, divider, support Relu function calculating linear activation primitive computing unit and support tanh function and The nonlinear activation function computing unit that sigmoid function calculates realizes the calculating multiplied accumulating with neural network activation primitive.

4. the stacking-type according to claim 1 based on unidirectional systolic arrays is from coded system, it is characterised in that: the list To the mode of systolic arrays be using pulsation, the in the ranks unidirectional microseismic data transmission mode broadcasted between column, specifically: operand with Behavior unit simultaneous transmission is to each computing unit in a column, and weight is to arrange each calculating sequentially entered in a column for unit Unit supports the multiple multiplexing of weight and operand.

5. using the stacking-type based on unidirectional systolic arrays as described in claim 1-4 from coded system from coding method, It is characterized by comprising following steps:

Step 1) signal control module receives algorithm commencing signal, and control input/output control module is by input data from DDR Memory is transferred in SRAM memory in a particular order；

Step 2 waits for that signal control module controls data address generation module generating source data address, will according to source data address The operand stored in SRAM memory is passed to computing array module, generates and is passed to input data useful signal；

Step 3) computing array module receives the input data useful signal and after SRAM memory reading operand, opens Begin to carry out ANN Reasoning calculating, in calculating process: for each column, different neuron respective weights flow to from top to bottom In each computing unit；For every a line, it is broadcast in each computing unit of computing array from left to right with batch input data, Calculating process is completed in each computing unit；

Step 4) computing array module generates output data useful signal, and signal control module controls data after receiving the signal Address generating module generates result data address, and result data is passed in SRAM memory according to result data address；

Step 5) signal control module controls input/output control module and piece is written from on-chip SRAM memory in result data In outer DDR memory, end signal is generated, completes the calculating of primary complete ANN Reasoning.

6. according to claim 5 from coding method, it is characterised in that: input data includes operand in the step 1) And weight, each address bit of storage unit can store 4 16 fixed-point datas, operand and weight in SRAM memory Sequential storage in the memory unit.