CN109086883A

CN109086883A - Method and device for realizing sparse calculation based on deep learning accelerator

Info

Publication number: CN109086883A
Application number: CN201810803430.5A
Authority: CN
Inventors: 陈书明; 杨超; 李斌; 陈海燕; 扈啸; 张军阳; 陈伟文
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2018-07-20
Filing date: 2018-07-20
Publication date: 2018-12-25

Abstract

The invention discloses a method and a device for realizing sparse calculation based on a deep learning accelerator, wherein the method comprises the following steps: s1, starting scoring processing operation when input data from each channel contains a specified number of 0; when the scoring processing operation is started, compressing the input data to filter 0-value neurons to obtain compressed input data; s2, sequentially acquiring each input characteristic value in the compressed input data according to a channel sequence for judgment, storing a target input characteristic value to a pre-configured data cache region if the data required by the calculation of the current output characteristic value is judged, and switching to other channels to acquire the next input characteristic value for judgment if the data is not judged; and S3, sending the information of all input characteristic values in the data cache region to a multiplier array in an accelerator to execute sparse calculation. The device comprises a control module and a scoring processing module, and has the advantages of simple implementation method, low required cost, high calculation efficiency, low energy consumption and the like.

Description

A kind of method and device for realizing sparse calculation based on deep learning accelerator

Technical field

The present invention relates to deep learning accelerator art field, more particularly to one kind are dilute based on the realization of deep learning accelerator Dredge the method calculated.

Background technique

In neural network and deep learning, to obtain better Forecasting recognition effect, can by the hardware of higher performance, The means such as huger tape label training data or wider deeper network, but there are following two major defects: first is that more Deep broader network model can generate flood tide parameter, thus the phenomenon that being easy to appear over-fitting, and the problem is non-in label data It is very prominent when often limited, and to avoid over-fitting, the requirement and skill to a small amount of label data are very high；Another Be that calculation amount can be significantly greatly increased in network size increasing, consume more computing resources, and in practical application, computing resource it is pre- It is all very limited for calculating, and efficient distributed computing resource becomes more and more important to ever-increasing network size.It solves The basic method of above-mentioned two defect is to convert partially connected for full connection even general convolution, on the one hand reality biology mind Connection through system is sparse, another aspect, and for the neural network of Large Scale Sparse, the statistics that can analyze activation value is special Property and to highly relevant output clustered come successively construct an optimal network, i.e., too fat to move sparse network can be by not Lose the simplification of performance.

In neural network when preference pattern, selects least feature or only use Partial Feature as classification foundation, it can To obtain good result.Also there is the characteristic of sparsity, in neural pathways for vision, many neurons for the research of human brain It can only make a response to specific stimulation, such as color, texture, direction, scale, if the substrate being made of these neurons, to On signal be sparse.In field of signal processing, L1-norm is calculated by convex optimization, was obtained complete subbasal dilute Dredging indicates, has got more and more applications.

Isomery accelerator has outstanding power dissipation ratio of performance, and carrying out accelerans network algorithm using deep learning accelerator is Current research hotspot, key technology are how Processing with Neural Network system efficiently to be realized on accelerator.Utilize nerve The fault-tolerance of network, can be approximately by the value of characteristic pattern zero data as zero processing, excavate data among sparsity, mention Computationally efficient.Mainstream deep learning accelerator lacks to provide sparse network and effectively support at present, needs to be cut off with zero padding Weight, then calculated, therefore can not be benefited from sparse network with common mode.And in the hardware design, at present substantially Mask technology is all employed, i.e., is directly not processed when the input of data channel or memory are complete zero, Ke Yizhi Minimum energy consumption is connect, but the unnecessary clock cycle can not be masked in this way.

In deep learning algorithm, the operation of convolution algorithm or full articulamentum accounts for most calculating, convolution or complete Activation operation is generally carried out after articulamentum operation, the most commonly used is ReLu activated namely data input after, be greater than 0 number output or the number itself, the output that counts less than 0 is just 0, thus can be generated largely after the full articulamentum activation of convolution sum 0, about 40~60% or so, and 0 multiplied by a several result or 0, if it is possible to remove 0 operation, can greatly drop Low-power consumption improves calculated performance.But current traditional some accelerators are often by the way of Gating, i.e. judgement input number If being 0, just arithmetic element is closed, exports 0 automatically, although such mode can be to avoid 0 multiplication, it remains unchanged A beat period is wasted, still results in a large amount of energy consumption waste, while influencing computational efficiency.

Summary of the invention

The technical problem to be solved in the present invention is that, for technical problem of the existing technology, the present invention provides one Kind of implementation method is simple, it is required it is at low cost, computational efficiency is high and what low energy consumption realizes sparse calculation based on deep learning accelerator Method.

In order to solve the above technical problems, technical solution proposed by the present invention are as follows:

A method of sparse calculation is realized based on deep learning accelerator, this method comprises:

S1. starting score processing operation when in the input data from each channel comprising specified number 0；Start at score When reason operation, input data is compressed to filter out 0 value neuron therein, obtain compressed input data, be transferred to and hold Row step S2；

S2. each input feature vector value in compressed input data is successively obtained by channel sequence to be judged, if it is determined that It is the data needed for current output characteristic value calculates to target input feature vector value, target input feature vector value is stored to being pre-configured with Data buffer area, otherwise switch to other channels and obtain next input feature vector value in compressed input data and sentenced It is disconnected, until completing the judgement of all input feature vector values in input data；

S3. it is dilute to execute the information of input feature vector values all in data buffer area to be sent to multiplier array in accelerator It dredges and calculates.

Further improvement as the method for the present invention: in step S1 by input data specifically using stride long compression method into Row compression.

Further improvement as the method for the present invention: specifically when deep learning first layer inputs, control is not opened in step S1 Dynamic score processing operation, the control starting score processing operation after carrying out ReLu activation operation.

Further improvement as the method for the present invention: specifically target input feature vector value and target are inputted in step S2 special The corresponding true address of value indicative is stored to data buffer area.

Further improvement as the method for the present invention: specifically by input feature vector values all in data buffer area in step S3 Value and corresponding true address are sent to multiplier array, and multiplier array is corresponding according to true address acquisition input feature vector value Weighted value after input feature vector value is executed multiplying with the weighted value of corresponding acquisition, exports multiplication result.

Further improvement as the method for the present invention: it is pre-configured with for storing the first number for calculating required input characteristic value The second data buffer area according to buffer area and for storing non-computational required input characteristic value；If it is determined that arriving mesh in step S2 Mark input feature vector value is the data needed for current output characteristic value calculates, and is controlled target input feature vector value, target input feature vector The true address of the sparse address in input data and target input feature vector value is stored in the first data buffer storage to value upon compression Otherwise area is stored to the second data buffer area.

The present invention further provides the dresses for implementing the above-mentioned method for realizing sparse calculation based on deep learning accelerator It sets, comprising:

Control module, for starting score processing operation when in the input data from each channel comprising specified number 0；

Score processing module, for executing score processing operation, including sequentially connected data input cell, compression list Member, judging unit, data buffer area and transmission unit, data input cell access all input datas, and output is to compression Unit, compression unit compress input data to filter out 0 value neuron therein, obtain compressed input data；Sentence Disconnected unit successively obtains each input feature vector value in compressed input data by channel sequence and is judged, if it is determined that arriving target Input feature vector value is the data needed for current output characteristic value calculates, and target input feature vector value is stored to data buffer area；It passes Defeated unit by the information of input feature vector values all in data buffer area be sent in accelerator multiplier array needed for executing in terms of It calculates.

Further improvement as apparatus of the present invention: the data buffer area includes calculating required input feature for storing First data buffer area of value and the second data buffer area for storing non-computational required input characteristic value；In control module If it is determined that target input feature vector value be the data needed for current output characteristic value calculates, control by target input feature vector value, The true address of the sparse address in input data and target input feature vector value is stored in target input feature vector value upon compression Otherwise first data buffer area is stored to the second data buffer area.

Further improvement as apparatus of the present invention: score processing module further includes that the address connecting with control module increases certainly Unit, for the sparse address by obtaining each input feature vector value from increase.

Further improvement as apparatus of the present invention: ping-pong type data processing method is used in score processing module, i.e., will The input feature vector value for completing processing is stored to data storage area, when being transferred to multiplier array through transmission unit, while by data Input unit accesses input feature vector value, is stored after being judged by judging unit to data storage area.

Compared with the prior art, the advantages of the present invention are as follows:

1, the present invention is based on the methods that deep learning accelerator realizes sparse calculation, are calculated by the way that input feature vector value to be sent into Compression processing is first carried out before cell processing, filters out 0 value neuron therein, then successively each input feature vector value is judged, such as Fruit is the data needed for current output characteristic value calculates, and the information of input feature vector value is stored into specified data buffer area, is sieved After selecting the input feature vector value needed for all current output characteristic values calculate, it is dilute to be sent to multiplier array progress in accelerator It dredges and calculates, the deep learning accelerator of sparse operation is realized based on score processing operation, can make full use of in deep learning and roll up The sparse characteristic of operation is accumulated to accelerate deep learning accelerator to calculate.

2, the present invention is based on the method that deep learning accelerator realizes sparse calculation, multiplication in deep learning accelerator hardware Device array only calculates non-zero data, eliminates unnecessary 0 multiplication and invalid 0 and multiplies beat, can be avoided in deep learning calculating Invalid 0 while multiply operation, additionally it is possible to reduce by invalid 0 waste for multiplying beat number, computing resource, so that can not only reduce nothing Effect 0 multiplies operation to reduce power consumption, and the beat that can multiply to avoid invalid 0 improves operation and accelerator efficiency, implementation method It is simple and versatile, only need to increase score processing operation can by common calculation method to non-sparse neural network at Reason.

Detailed description of the invention

Fig. 1 is the implementation process schematic diagram for the method that the present embodiment realizes sparse calculation based on deep learning accelerator.

Fig. 2 is the structural schematic diagram that the present embodiment realizes the device based on deep learning accelerator sparse calculation.

Fig. 3 is the realization principle schematic diagram based on deep learning accelerator sparse calculation in the specific embodiment of the invention.

Specific embodiment

Below in conjunction with Figure of description and specific preferred embodiment, the invention will be further described, but not therefore and It limits the scope of the invention.

As shown in Figure 1, the method that the present embodiment realizes sparse calculation based on deep learning accelerator, step include:

The present embodiment filters out therein 0 by first carrying out compression processing before input feature vector value is sent into computing unit processing It is worth neuron, then successively each input feature vector value is judged, the data needed for calculating if it is current output characteristic value will be defeated The information for entering characteristic value is stored into specified data buffer area, filters out the input needed for all current output characteristic values calculate It after characteristic value, is sent to multiplier array in accelerator and carries out sparse calculation, sparse operation is realized based on score processing operation Deep learning accelerator can make full use of the sparse characteristic of convolution algorithm in deep learning to accelerate deep learning accelerator meter It calculates.

The above method realizes sparse calculation through this embodiment, and multiplier array only calculates in deep learning accelerator hardware Non-zero data eliminate unnecessary 0 multiplication and invalid 0 and multiply beat, can be avoided deep learning calculate in invalid 0 multiply operation While, additionally it is possible to reduce by invalid 0 waste for multiplying beat number, computing resource, so that can not only reduce invalid 0 multiplies operation to drop Low-power consumption can also avoid invalid 0 beat multiplied to improve operation and accelerator efficiency.

It, will be by meter when starting score processing operation when the present embodiment specifically works as accelerator using sparse operation mode Sparse data after point processing operation passes to arithmetic element and carries out operation, due to being non-zero value after processing operation of scoring, It is non-zero value that thus entire arithmetic element is received, and the operation of the entire each beat of arithmetic element is significance arithmetic；For non- Sparse data does not start score processing operation then, to mask the score processing operation, is directly transferred to input feature vector value and multiplies Operation is carried out in musical instruments used in a Buddhist or Taoist mass array.

Input data is specifically compressed using the long compression method that strides in the present embodiment step S1, which can Subsequent score processing operation is directly executed without being decompressed.

In the present embodiment step S1 specifically deep learning first layer input when control do not start score processing operation, into Control starting score processing operation after row ReLu activation operation.When the input of deep learning first layer, in data not A large amount of data 0, operation can not use scoring board circuit, in neural net layer below, after carrying out ReLu activation There can be a large amount of sparse data, output result contains a large amount of 0.The present embodiment specifically deep learning first layer input when, Do not start score processing operation, input feature vector value input multiplier array is directly subjected to operation, does not influence former operation, operation is complete At it is rear activated as needed, pond or normalization etc. operation；After carrying out ReLu activation operation, starting score processing behaviour Make, stride using upper one layer output result as input growing and compress to filter out wherein contain a large amount of 0, then by each input feature vector After value is successively judged, the input feature vector value needed for output characteristic value is calculated is stored into data buffer area, is completed above-mentioned The information input of input feature vector value in data buffer area is finally subjected to sparse operation, multiplication to multiplier array after preliminary treatment Device array exported after the completion of multiplying as a result, activated, pond or normalization etc. operation, then carry out data compression with into Next layer of operation of row.

Specifically by target input feature vector value and the corresponding true address of target input feature vector value in the present embodiment step S2 It stores to data buffer area, true address is for obtaining the corresponding weighted value of input feature vector value；Specifically data are delayed in step S3 It deposits the value of all input feature vector values and corresponding true address in area and is sent to multiplier array,.Multiplication array receives transmission Each input feature vector value and corresponding true address after, multiplier array according to true address obtain input feature vector value it is corresponding Weighted value after input feature vector value is executed multiplying with the weighted value of corresponding acquisition, exports multiplication result.

The present embodiment, which is pre-configured with, calculates the first data buffer area of required input characteristic value and for depositing for storing Store up the second data buffer area of non-computational required input characteristic value；In step S2 if it is determined that target input feature vector value be current Output characteristic value calculate needed for data, control is by target input feature vector value, target input feature vector value input data upon compression In sparse address and the true address of target input feature vector value be stored in the first data buffer area, otherwise store to second Data buffer area, wherein sparse address is the address after compressing where sparse data (input feature vector value), true address is then The actual address of input feature vector value, sparse address can specifically be generated by address from device is increased, every time sparse number after one compression of storage According to when, corresponding sparse address is obtained from device is increased by address.

As shown in Fig. 2, the present embodiment realizes the device of the above-mentioned method that sparse calculation is realized based on deep learning accelerator, Include:

In concrete application embodiment, the setting score plate module in entire accelerator, by control module controlling depth First layer input data remains unchanged after plate module of scoring and is transferred to multiplier array module in habit, when accelerator is using dilute When dredging operation mode, such as by the data of ReLu operation, starting score plate module, by score plate module first using the long pressure that strides Contracting method compresses sparse data, then judges compressed data, and the sparse data filtered out needed for calculating is (defeated Enter characteristic value) pass to arithmetic element and carry out operation, it can be ensured that carry out operation in entire multiplier array be will 0 reject after Data, multiplier array complete operation after output characteristic value is exported and is further processed.

In the present embodiment, data buffer area include for store calculate required input characteristic value the first data buffer area with And the second data buffer area for storing non-computational required input characteristic value；It is in control module if it is determined that special to target input Value indicative is the data needed for current output characteristic value calculates, and target input feature vector value, target input feature vector value are being compressed in control The true address of the sparse address in input data and target input feature vector value is stored in the first data buffer area afterwards, otherwise It stores to the second data buffer area.

In the present embodiment, score processing module further includes that the address connecting with control module increases unit certainly, for by certainly Increase the sparse address for obtaining each input feature vector value.

In the present embodiment, ping-pong type data processing method is used in processing module of scoring, the input for the processing that is near completion is special Value indicative is stored to data storage area, when being transferred to multiplier array through transmission unit, while being accessed and being inputted by data input cell Characteristic value is stored after being judged by judging unit to data storage area.The scoring board circuit of sparse operation is realized in accelerator Using the structure of similar table tennis, number is inputted toward multiplier array after the storage of a part completion sparse data of scoring board circuit According to when, the another part of scoring board circuit transmits input feature vector value from outside, judge the data whether be 0 and toward scoring board It is injected in circuit.

As shown in figure 3, processing module of scoring in concrete application embodiment is realized especially by a scoring board, scoring board It is connect by a transmission circuit with accelerator multiplication array, sparse address is obtained from device is increased by address, control module passes through One control circuit realizes that control circuit is connect with scoring board, address from increasing device and transmission circuit respectively.Scoring board is divided into Left side and right area, on the left of the storage of scoring board for from the sparse compressed address that external memory transmission is come in, sparse value and The corresponding true address of sparse value, control circuit judge whether the true address is input value that neural net layer needs to complete, It stores if so, the sparse value and true value address are transferred to right side scoring board, is currently needed if not neural net layer The input value of operation is completed, then is stored sparse address and sparse value and true address on the left of scoring board storage, and Input address is changed to another set input channel and restarts aforesaid operations, until completing sentencing for all input feature vector values It is disconnected.

There are 4 input channels with one below, convolution kernel size is four inputs in the convolutional neural networks algorithm of 5*5 In the input feature vector table in channel for the treatment process of the first row, the above method of the present invention is further described.

As shown in table 1 it is original untreated sparse input feature vector table:

Table 1: original sparse input feature vector table.

Data in table 1 stride long compression to filter out 0 value therein, are obtained after the long compression that strides such as 2 institute of table The input feature vector table shown, understands for convenience and the long compression algorithm that strides is fairly simple, and the right mark of numerical value is not shown true in table 2 The real long value that strides, but use true address value.

Table 2: stride long compression input feature vector table.

X0	X2	X3	X7	X8
					Y0	Y1	Y5	X6	Y8
Z1	Z2	Z6	Z7
					T3	T4	T5	T6

Scoring board storage left side is labeled as ScoreBoard, right side is labeled as InputFeatureBuffer, takes first First input channel first value out, judges this input feature vector value that should be calculated for first output characteristic value, with The value and true address are put into InputFeatureBuffer afterwards, wherein true address is for taking-up pair in multiplier array The weighted value answered；

According to the method described above until the 3rd beat has taken out X7, decision circuitry discovery X7 is not to calculate in scoring board What current output characteristic value needed, therefore the value is put into ScoreBoard；

It then continues in the input and judgement of second input channel input feature vector value, using similar approach is saved to the 6th The input and judgement completed to the corresponding input feature vector value of current output characteristic value are clapped, until the 12nd period completion input is special Corresponding entire 4 input channel of value indicative corresponds to the access of input feature vector value, then by the numerical value in InputFeatureBuffer It is sent to multiplier array with corresponding address, array obtains respective weights value according to address value to complete to transport using conventional method It calculates, scoring board Stored Procedure is detailed below:

By above-mentioned process, X0, X2, X3, Y0, Y1, Z1, Z2, T3 and T4 needed for output characteristic value is calculated are stored In InputFeatureBuffer, wherein without 0 value, and be the input feature vector value needed for calculating, it can effectively improve calculating Efficiency reduces calculating energy consumption.

Above-mentioned only presently preferred embodiments of the present invention, is not intended to limit the present invention in any form.Although of the invention It has been disclosed in a preferred embodiment above, however, it is not intended to limit the invention.Therefore, all without departing from technical solution of the present invention Content, technical spirit any simple modifications, equivalents, and modifications made to the above embodiment, should all fall according to the present invention In the range of technical solution of the present invention protection.

Claims

1. a kind of method for realizing sparse calculation based on deep learning accelerator, which is characterized in that this method comprises:

S1. starting score processing operation when in the input data from each channel comprising specified number 0；Starting score processing behaviour When making, input data is compressed to filter out 0 value neuron therein, obtain compressed input data, be transferred to and execute step Rapid S2；

S2. it successively obtains each input feature vector value in compressed input data by channel sequence to be judged, if it is determined that arriving mesh Mark input feature vector value is the data needed for current output characteristic value calculates, and target input feature vector value is stored to preconfigured number According to buffer area, otherwise switches to next input feature vector value in the compressed input data of other channels acquisition and judged, directly To the judgement for completing all input feature vector values in input data；

S3. by the information of input feature vector values all in data buffer area be sent in accelerator multiplier array by execute it is sparse in terms of It calculates.

2. the method according to claim 1 for realizing sparse calculation based on deep learning accelerator, which is characterized in that in step S1 Input data is specifically compressed using the long compression method that strides.

3. the method according to claim 1 for realizing sparse calculation based on deep learning accelerator, which is characterized in that in step S1 Specifically when deep learning first layer inputs, control does not start score processing operation, controls starting after carrying out ReLu activation operation Score processing operation.

4. according to claim 1 or 2 or 3 method for realizing sparse calculation based on deep learning accelerator, which is characterized in that step Specifically target input feature vector value and the corresponding true address of target input feature vector value are stored to data buffer area in rapid S2.

5. the method according to claim 4 for realizing sparse calculation based on deep learning accelerator, which is characterized in that in step S3 The value of input feature vector values all in data buffer area and corresponding true address are specifically sent to multiplier array, multiplier battle array Column obtain the corresponding weighted value of input feature vector value according to true address, and input feature vector value is multiplied with the execution of the weighted value of corresponding acquisition After method operation, multiplication result is exported.

6. according to claim 1 or 2 or 3 method for realizing sparse calculation based on deep learning accelerator, which is characterized in that pre- It is first configured to storage and calculates the first data buffer area of required input characteristic value and for storing non-computational required input spy Second data buffer area of value indicative；In step S2 if it is determined that target input feature vector value be needed for current output characteristic value calculates Data, control by target input feature vector value, target input feature vector value upon compression input data sparse address and mesh The true address of mark input feature vector value is stored in the first data buffer area, otherwise stores to the second data buffer area.

7. being used to implement the method based on deep learning accelerator realization sparse calculation of any one of claim 1~6 Device characterized by comprising

It scores processing module, for executing score processing operation, including sequentially connected data input cell, compression unit, sentences Disconnected unit, data buffer area and transmission unit, data input cell access all input datas, export to compression unit, Compression unit compresses input data to filter out 0 value neuron therein, obtains compressed input data；Judging unit It successively obtains each input feature vector value in compressed input data by channel sequence to be judged, if it is determined that special to target input Value indicative is the data needed for current output characteristic value calculates, and target input feature vector value is stored to data buffer area；Transmission unit The information of input feature vector values all in data buffer area is sent in accelerator multiplier array to execute required calculating.

8. device according to claim 7, it is characterised in that: the data buffer area includes calculating required input spy for storing First data buffer area of value indicative and the second data buffer area for storing non-computational required input characteristic value；Control module In if it is determined that being the data needed for current output characteristic value calculates to target input feature vector value, control is by target input feature vector The true address of the sparse address in input data and target input feature vector value is deposited upon compression for value, target input feature vector value Otherwise storage is stored in the first data buffer area to the second data buffer area.

9. device according to claim 8, it is characterised in that: the score processing module further includes the ground connecting with control module Location is from unit is increased, for the sparse address by obtaining each input feature vector value from increase.

10. according to the device of claim 8 or 9, it is characterised in that: use ping-pong type data processing side in score processing module The input feature vector value of formula, the processing that is near completion is stored to data storage area, when being transferred to multiplier array through transmission unit, simultaneously Input feature vector value is accessed by data input cell, is stored after being judged by judging unit to data storage area.