CN109978143A

CN109978143A - It is a kind of based on the stacking-type self-encoding encoder of SIMD framework and coding method

Info

Publication number: CN109978143A
Application number: CN201910251530.6A
Authority: CN
Inventors: 李丽; 马博涵; 傅玉祥; 张衡; 李伟
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2019-03-29
Filing date: 2019-03-29
Publication date: 2019-07-05
Anticipated expiration: 2039-03-29
Also published as: CN109978143B

Abstract

It is of the invention based on the stacking-type self-encoding encoder of SIMD framework and coding method, self-encoding encoder includes DMA interface module, ANN Reasoning module and neural metwork training module；DMA interface module mainly passes through the dma mode data that DDR is read in outside piece and is stored on piece SRAM by partitioned mode, and last operation result is write back DDR by dma mode；The reasoning computing module of neural network carries out categorical reasoning operation to new sample using trained weight and biasing；The training module of neural network is mainly responsible for the weight and biasing for successively updating neural network forward from neural network the last layer.The utility model has the advantages that the neural network number of plies supported of self-encoding encoder of the invention there is no limit, therefore the reasoning and training of Large Scale Neural Networks are supported, and it realizes that part calculates the cover of time and memory access time by ping-pong operation, there is good Practical significance and broad application prospect.

Description

It is a kind of based on the stacking-type self-encoding encoder of SIMD framework and coding method

Technical field

The present invention relates to the hardware realization field of intelligent algorithm more particularly to a kind of stacking-types based on SIMD framework Self-encoding encoder and coding method.

Background technique

The development of electronic computer since with nineteen forty-one, technology can create machine intelligence, " artificial intelligence " (Artificial Intelligence) word was proposed in DARTMOUTH association in 1956, since then, was ground The persons of studying carefully have been developed numerous theoretical and principle, the concepts of artificial intelligence and have also been extended therewith.Before 2007, it is limited to algorithm at that time With the factors such as data, to chip, especially strong demand, general cpu chip can provide enough meters to artificial intelligence not yet Calculation ability.Later due to the fast development of HD video and game industry, graphics processor (GPU) chip is achieved rapidly Development.Because GPU has more logical unit for handling data, belong to high parallel organization, in processing graph data and In terms of complicated algorithm advantageously than CPU, and because the model parameter of AI deep learning is more, data scale is big, computationally intensive, this GPU becomes the mainstream of AI chip at that time instead of CPU in a period of time afterwards.Under the huge tide of artificial intelligence, also have very much Manufacturer's handling machine learning algorithm uses field programmable gate array (FPGA), and FPGA is high by its flexibility, in industry Internet and industrial robot apparatus field have huge developing market.In addition to two kinds of intelligent algorithms of GPU and FPGA add Fast chip, Google are proposed a application specific processor TPU for the design of specific intelligent algorithm, and chip area is with respect to FPGA Smaller with GPU, power consumption is also lower.

Communication network is the basis of artificial intelligence outburst, and with the arriving of 5G communication era, all things on earth interconnection will generate magnanimity Data, large-scale neural network needs powerful calculation power.As a kind of important neural network algorithm, stacking-type encodes certainly Algorithm has a wide range of applications in plurality of application scenes such as recognition of face, geography information mappings.The present invention is based on a restructural Intelligence accelerates core, proposes a kind of stacking-type of SIMD framework from the hardware realization of encryption algorithm, some hard with GPU, FPGA etc. Part accelerated mode is compared, and the implementation resource utilization is high, and hardware realization speed is fast.As the typical case in intelligent algorithm Algorithm, the implementation method have good reference and broad application prospect.

Summary of the invention

Present invention aims to overcome that above-mentioned the deficiencies in the prior art, are effectively reduced the training time of neural network, sufficiently Using storage resource, accelerate trained and reasoning calculating speed, provides a kind of stacking-type based on SIMD framework from encoding Device and coding method, are specifically realized by the following technical scheme:

The stacking-type self-encoding encoder based on SIMD framework includes: based on neural network

DMA interface module is stored on piece SRAM by partitioned mode by the dma mode data that DDR is read in outside piece, and will Last operation result writes back DDR by dma mode；

ANN Reasoning module, trained weight and biasing carry out categorical reasoning fortune to new sample for use It calculates；Neural metwork training module, will be after training sample propagated forward according to gradient descent algorithm；From the last layer of neural network Backpropagation updates the weight and biasing of neural network.

The stacking-type self-encoding encoder based on SIMD framework it is further design be, the storage of every layer of neural network SRAM, which contains, 4N source data storage bank, then the SRAM is divided into four parts, and there is N number of bank in each part, It is respectively as follows:

The first part of SRAM, storage input x_j；

The second part and Part III of SRAM stores weight W_ij；

The Part IV of SRAM stores the calculated result of every layer of neural network.

Constant memory, storage biasing b_i。

According to the above-mentioned stacking-type self-encoding encoder based on SIMD framework, it is self-editing to provide a kind of stacking-type based on SIMD framework Code method, this method includes algorithm reasoning process and algorithm training process, and algorithm reasoning process includes:

Step 1-1) initialization all neurons of first layer input x_j, biasing b_i, first neuron of first layer and mind All interneuronal weight W through the network second layer_ij；

Step 1-2) output of first neuron of second layer neural network, the meter multiplied accumulating are calculated according to formula (1) Calculation process is completed by the structure of the parallel multiply-add tree in 32 tunnelsIt calculates, after the completion of calculating, by second neuron Weight W_ijMove in the Part III of SRAM；

H in formula (1)_iIndicate the calculated result of every layer of neural network, a_iWhat is indicated is that weight multiplies accumulating and h with what is inputted_is What () indicated is sigmoid activation primitive；

Step 1-3) progress ping-pong operation moves in weight, the output calculating of the completion neural network second layer, and ties calculating The Part IV of fruit deposit SRAM；

Step 1-4) by the input of the neural network second layer exported as third layer, calculate the defeated of neural network third layer Out, the first part of covering deposit SRAM.

Step 1-5) according to this access and calculation, obtain neural network the last layer as a result, and by result from It is read in SRAM and writes back DDR according to dma mode；

Algorithm training process includes propagated forward and backpropagation, and the propagated forward includes the following steps:

Step 2-1-1) initialization first layer input x_jAnd biasing b_i, the weight W of first neuron of first layer_ij；

Step 2-1-2) basisAnd h_i=s (a_i) calculate first neuron of the second layer Output, which is completed by the structure of the parallel multiply-add tree in 32 tunnelsIt calculates, has been calculated Cheng Hou, by the weight W of second neuron_ijIt moves in the Part III of SRAM, calculates the output result of second neuron；

Step 2-1-3) use ping-pong operation to move in weight, the output of 512 neurons of the neural network second layer is calculated It completes, is stored in the Part IV of SRAM, and data are write back into DDR according to dma mode；

Step 2-1-4) by the input of the neural network second layer exported as third layer, calculate neural network third layer Output, the first part of covering deposit SRAM；

Step 2-1-5) complete above-mentioned steps, obtain neural network the last layer as a result, and result is read from SRAM It takes and writes back DDR according to dma mode；

In the backpropagation, label data is defined as Std, delta is defined as delta, specifically includes following step It is rapid:

Step 2-2-1) from DDR according to dma mode neural network label data Std is read in, and calculate resulting nerve net Network the last layer data subtract each other to obtain the error delta of neural network the last layer；

Step 2-2-2) the transposition weight of neural network layer second from the bottom is rattled according to dma mode and reads in each neuron Weight W_ji, by weight W_jiIt is stored in the second part and Part III of SRAM, biasing and weight are updated according to formula (2), until most The weight of later layer and biasing are completed to update；

Covering is stored in the part of the SRAM where former weight and biasing after the completion of updating, and will update the biasing and power finished DDR is written according to dma mode in weight；

Step 2-2-3) the delta delta that calculates preceding layer in the same way, it calculates and updates weight and biasing, The biasing finished and weight will be updated, DDR is written according to dma mode；

Step 2-2-4) successively to previous Es-region propagations, all layers of neural network of weight and biasing are updated, and write back DDR completes the primary training of neural network.

The stacking-type based on SIMD framework from coding method it is further design be, the step 1-5) if mind It is odd-level through the total number of plies of network, then reads the result of the last layer from the first part of SRAM；If the total layer of neural network Number is even level, then the result of the last layer is read from the Part IV of SRAM.Advantages of the present invention is as follows:

There is no limit for the neural network number of plies that stacking-type self-encoding encoder based on SIMD framework of the invention is supported, therefore props up It holds the reasoning and training of Large Scale Neural Networks, and realizes that part calculates covering for time and memory access time by ping-pong operation Lid, there is good Practical significance and broad application prospect.

Detailed description of the invention

Fig. 1 is stacking-type single self-encoding encoder schematic diagram from encryption algorithm.

Fig. 2 is that multiple single self-encoding encoders stack the schematic diagram for becoming self-encoding encoder entirety.

Fig. 3 is flow chart of the stacking-type based on SIMD framework from coding method.

Fig. 4 is that stacking-type calculates from encryption algorithm reasoning part and training part propagated forward part and realizes schematic diagram.

Fig. 5 is stacking-type from encryption algorithm storage mode schematic diagram.

Specific embodiment

Below in conjunction with attached drawing, technical solution of the present invention is described in detail.

As shown in Figure 1, being divided into input layer, hidden layer, output layer, multiple single encodes the self-encoding encoder of the present embodiment certainly To form stacking-type self-encoding encoder as shown in Figure 2 after device storehouse, stacking-type self-encoding encoder by one layer of input, multilayer hidden layer and Whether one layer of output layer composition, finally need Softmax classifier to define according to actual needs.

The self-encoding encoder is mainly made of DMA interface module, ANN Reasoning module and neural metwork training module. The present invention is by Pingpang Memory to every layer of operation result of neural network and to every layer of neural network each neuron weight Pingpang Memory maximally utilizes resource, while carrying out data carrying, conformity calculation knot according to the subregion of SRAM Fruit improves algorithm arithmetic speed.

It is described in detail, and built a based on SystemC language with one embodiment of the present of invention realization below Cycle accurate system integration project model verified.Neural network shares 7 layers, every layer from front to back of neural network in embodiment Neuron number be respectively as follows: 1024,512,256,128,256,512,1024, input, weight, biasing of neural network etc. Data are 32 floating numbers of IEEE754 standard, if with 4PE (Processing Element, wherein containing 4 complex multiplication Musical instruments used in a Buddhist or Taoist mass, 4 complex adders, 1 real add musical instruments used in a Buddhist or Taoist mass, 1 real multipliers, 1 one surmount function) it is (right for computing array 32 bank are answered, each bank depth hypothesis is set as 4K, and bank bit wide is 64), then a bank a address stores 2 Source data.Technical solution of the present invention will be further introduced with this embodiment and in conjunction with attached drawing below.

Hardware algorithm implementation flow chart needs the weight first by all layers and interlayer as shown in figure 3, before the algorithm starts It is stored in DDR, is used in order to which training updates weight, training is as follows with reasoning process detailed step after transposition:

The reasoning link process of stacking-type from encryption algorithm is as follows:

S1: the input x of 1024 neurons of first layer is initialized_j, biasing b_i, first neuron of first layer and nerve The interneuronal weight W of 512 of the network second layer_ij, as shown in figure 5, x will be inputted_jIt is stored in the 0-7 bank, weight It is stored in 8-15 bank, biases b_iIt is stored in constant storage.

S2: according toAnd h_i=s (a_i) output of first neuron of the second layer is calculated, The calculating process entirety hardware structure multiplied accumulating by the structure of the parallel multiply-add tree in 32 tunnels as shown in figure 4, completedIt calculates.After the completion of calculating, by the weight W of second neuron_ijMove in Part III bank_3.

S3: table tennis moves in weight, and the output of the neural network second layer is calculated and is completed.It is stored in the Part IV of SRAM bank_4。

S4: input of the output of the neural network second layer as third layer calculates the output of neural network third layer, covering It is stored in the first part bank_1 of SRAM.

S5: according to this access and calculation, obtain neural network the last layer as a result, and by result from SRAM It reads and writes back DDR (if the total number of plies of neural network is odd-level, from the first part bank_1 of SRAM according to dma mode It reads；If the total number of plies of neural network is even level, read from the Part IV bank_4 of SRAM).

The training link process of stacking-type from encryption algorithm is as follows:

Algorithm training link is divided into propagated forward and backpropagation, and propagated forward is not uniquely both with algorithm reasoning link The calculated result by every layer is needed to write back DDR by dma mode, to use for backpropagation, backpropagation uses ladder Spend descent algorithm.

Propagated forward:

S1: the input x of first layer is initialized_jAnd biasing b_i, the weight W of first neuron of first layer_ij。

S2: according toAnd h_i=s (a_i) output of first neuron of the second layer is calculated, it should The calculating process entirety hardware structure multiplied accumulating by the structure of the parallel multiply-add tree in 32 tunnels as shown in figure 4, completedIt calculates.After the completion of calculating, by the weight W of second neuron_ijIt moves in Part III bank_3, calculates The output result of second neuron.

S3: table tennis moves in weight, and the output of 512 neurons of the neural network second layer is calculated and is completed.It is stored in SRAM's In Part IV bank_4 i.e. the 24-31 bank, and data are write back into DDR according to dma mode.

S4: input of the output of the neural network second layer as third layer calculates the output of neural network third layer, covering It is stored in first part bank_1 i.e. the 0-7 bank of SRAM.

S5: according to this access and calculation, obtain the 7th layer of neural network the last layer i.e. neural network as a result, And result is read from SRAM and writes back DDR according to dma mode, the total number of plies of this example neural network is 7 layers, be odd-level, then from It is read in bank_1, that is, 0-7 bank of first part of SRAM.

Backpropagation (gradient decline):

Label data is defined as Std, delta is defined as delta.

S6: reading in neural network label data Std from DDR according to dma mode, with calculate resulting neural network last The 7th layer data of layer subtracts each other to obtain the error delta of neural network the last layer.

S7: the transposition weight of neural network layer second from the bottom is read in the weight of each neuron according to dma mode table tennis W_ji, it is deposited into the second part bank_2 and Part III bank_3 of SRAM, according to the update method of biasing and weight, is updated The weight and biasing of the last layer.

Covering is stored in the part of the SRAM where former weight and biasing after the completion of updating, and will update the biasing and power finished DDR is written according to dma mode in weight.

S8: calculating the delta delta of preceding layer in the same way, and calculates and update weight and biasing, equally Mode DDR is written.

S9: successively to previous Es-region propagations, all layers of neural network of weight and biasing are updated, and writes back DDR, completes mind Primary training through network.

The present invention by stacking-type from encryption algorithm input and weight be stored in SRAM division different zones in, can The memory access of Lothrus apterus calculates required variable, and by ping-pong operation and the time-sharing multiplex of computing resource, realizes the algorithm Calculating process fast implements, to substantially increase resource utilization and hardware realization speed, therefore the implementation application Prospect is extensive.

The foregoing is only a preferred embodiment of the present invention, but scope of protection of the present invention is not limited thereto, In the technical scope disclosed by the present invention, any changes or substitutions that can be easily thought of by anyone skilled in the art, It should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with scope of protection of the claims Subject to.

Claims

1. a kind of stacking-type self-encoding encoder based on SIMD framework, be based on neural network it is characterised by comprising:

DMA interface module is stored on piece SRAM by partitioned mode by the dma mode data that DDR is read in outside piece, and will be last Operation result DDR is write back by dma mode；

ANN Reasoning module, trained weight and biasing carry out categorical reasoning operation to new sample for use；Mind It, will be after training sample propagated forward according to gradient descent algorithm through network training module；It is reversed from the last layer of neural network It propagates, updates the weight and biasing of neural network.

2. the stacking-type self-encoding encoder according to claim 1 based on SIMD framework, it is characterised in that every layer of neural network Storage SRAM contain and have 4N source data storage bank, then the SRAM is divided into four parts, each part has N number of Bank is respectively as follows:

The first part of SRAM, storage input x_j；

The second part and Part III of SRAM stores weight W_ij；

Constant memory, storage biasing b_i。

3. if the described in any item stacking-types based on SIMD framework of claim 1-2 are from coding method, it is characterised in that including Algorithm reasoning process and algorithm training process, algorithm reasoning process include:

Step 1-1) initialization all neurons of first layer input x_j, biasing b_i, first neuron of first layer and nerve net All interneuronal weight W of the network second layer_ij；

Step 1-2) output of first neuron of second layer neural network, the calculating multiplied accumulating are calculated according to formula (1) The structure of the parallel multiply-add tree in 32 tunnel Cheng You is completedIt calculates, after the completion of calculating, by the power of second neuron Weight W_ijMove in the Part III of SRAM；

H in formula (1)_iIndicate the calculated result of every layer of neural network, a_iWhat is indicated is that weight multiplies accumulating and h with what is inputted_iS () table What is shown is sigmoid activation primitive；

Step 1-3) progress ping-pong operation moves in weight, the output calculating of the completion neural network second layer, and calculated result is deposited Enter the Part IV of SRAM；

Step 1-4) by the input of the neural network second layer exported as third layer, the output of neural network third layer is calculated, The first part of covering deposit SRAM；

Step 1-5) according to this access and calculation, obtain neural network the last layer as a result, and by result from SRAM Middle reading writes back DDR according to dma mode；

Step 2-1-2) basisAnd h_i=s (a_i) calculate the defeated of first neuron of the second layer Out, which is completed by the structure of the parallel multiply-add tree in 32 tunnelsIt calculates, calculates and complete Afterwards, by the weight W of second neuron_ijIt moves in the Part III of SRAM, calculates the output result of second neuron；

Step 2-1-3) use ping-pong operation to move in weight, the output of 512 neurons of the neural network second layer is calculated and is completed, It is stored in the Part IV of SRAM, and data are write back into DDR according to dma mode；

Step 2-1-4) by the input of the neural network second layer exported as third layer, calculate the defeated of neural network third layer Out, the first part of covering deposit SRAM；

Step 2-1-5) complete above-mentioned steps, obtain neural network the last layer as a result, and result is read simultaneously from SRAM DDR is write back according to dma mode；

In the backpropagation, label data is defined as Std, delta is defined as delta, specifically comprises the following steps:

Step 2-2-1) from DDR according to dma mode neural network label data Std is read in, and calculate resulting neural network most Latter layer data subtracts each other to obtain the error delta of neural network the last layer；

Step 2-2-2) the transposition weight of neural network layer second from the bottom is rattled according to dma mode and reads in the power of each neuron Weight W_ji, by weight W_jiIt is stored in the second part and Part III of SRAM, updates biasing and weight according to formula (2), until last The weight of layer and biasing are completed to update；

Covering is stored in the part of the SRAM where former weight and biasing after the completion of updating, will update the biasing finished and weight by DDR is written according to dma mode；

Step 2-2-3) the delta delta that calculates preceding layer in the same way, it calculates and updates weight and biasing, it will It updates the biasing finished and weight and DDR is written according to dma mode；

Step 2-2-4) successively to previous Es-region propagations, all layers of neural network of weight and biasing are updated, and write back DDR, it is complete At the primary training of neural network.

4. the stacking-type according to claim 3 based on SIMD framework is from coding method, it is characterised in that: the step 1- 5) if the total number of plies of neural network is odd-level, the result of the last layer is read from the first part of SRAM；If neural The total number of plies of network is even level, then the result of the last layer is read from the Part IV of SRAM.