CN107392305A

CN107392305A - Realize and perform the method and computer-readable medium of neutral net

Info

Publication number: CN107392305A
Application number: CN201710333745.3A
Authority: CN
Inventors: J.布拉泽斯; 冀正平; 张强
Original assignee: Samsung Electronics Co Ltd
Current assignee: Samsung Electronics Co Ltd
Priority date: 2016-05-13
Filing date: 2017-05-12
Publication date: 2017-11-24

Abstract

Neutral net is trained to generate characteristic pattern and associated weight.Execution is reordered with the equivalent network of systematic function.It can perform and reorder to improve the compression of weight, load balance and executory at least one.In one implementation, null value weight is grouped, it is allowed to skips them during execution.

Description

Realize and perform the method and computer-readable medium of neutral net

The cross reference of related application

This application claims the U.S. Provisional Application No. 62/336,493 submitted on May 13rd, 2016, in January, 2017 The U. S. application submitted the 15/421,423rd on the 31st, the korean patent application 10-2017- submitted on April 13rd, 2017 The rights and interests of No. 0048036, its content are incorporated herein by reference.

Technical field

Embodiments of the invention are usually directed to neutral net.

Background technology

Artificial neural network (NN) can be designed and train to perform extensive function.NN example, which is applied, includes image Processing, speech recognition, data processing and control and other application.NN model can include significant number of layers and parameter (weight). Processor with highly-parallel framework (such as graphics processing unit (GPU)) can promote large-scale NN effective realization.

Brief description of the drawings

Fig. 1 is the block diagram to reorder for showing characteristic pattern and weight according to the neutral net of embodiment.

Fig. 2 shows a part for the neutral net according to embodiment.

Fig. 3 shows a part for the neutral net according to embodiment.

The method that Fig. 4 shows the neutral net that reorders according to embodiment.

The method that Fig. 5 shows the neutral net that reordered according to the execution of embodiment.

The method that Fig. 6 shows the neutral net that reorders for including trimming according to embodiment.

Fig. 7 is shown reorders neutral net to skip the method for null value weight according to the execution of embodiment.

Fig. 8 A and Fig. 8 B show reordering to improve load balance according to embodiment.

Fig. 9 A and Fig. 9 B show the huffman coding of the weight according to embodiment.

Figure 10, which is shown in the neutral net according to embodiment, covers code stream decoding and value stream decoding.

Embodiment

Fig. 1 is the high level block diagram according to embodiment.In one embodiment, neutral net (NN) Development Framework 105 is net One group of weight of all layers of generation of network.In one embodiment, the additional treatments of weight off-line execution on the computer systems. In one embodiment, optional post processing 110 is performed, it includes trimming (pruning), and it by them by being arranged to zero (0) To eliminate many weights, as described in more detail below.Perform characteristic pattern reorder 115, and it causes with reordering The equivalent network of weight.The weight to reorder is by compression 120.Compiled corresponding to the version to reorder of the neutral net of original training Translate the network 125 of optimization.In one embodiment, it is possible to achieve parallel place is utilized using the neutral net of the weight of compression Reason.In addition, it be may be implemented as using the neutral net of the weight of compression in all input weight values to parallel processor all It need not be handled in the case of with null value.

Fig. 2 is the block diagram according to the example of a part for the neutral net using the weight compressed of embodiment.Offer is deposited Reservoir (for example, static random-access (SRAM) memory) is with the weight and input feature vector figure (IFM) of storage compression.In a reality Apply in example, control unit includes：Special control logic, for controlling Parallel Unit；And CPU (CPU), it is tied Work is closed to control cumulative array (multiply-accumulate array, the MAA) unit of SRAM memory, multiplier and input The operation of data path (IDP) unit.In such as convolution NN many NN, many calculate may be implemented as being based on making The operation of the operation calculated with MAA units.

In one embodiment, each IDP units receive the weight and input feature vector diagram data of compression, and by decompression Weight and IFM data outputs are to MAA units.For example, each IDP can include at least one decompressor and buffer to buffer Input data.In one embodiment, MAA accumulated result corresponds to output characteristic diagram data (OFM) and intermediate result.Can The additional treatments function of the output to MAA units is supported to provide one or more units (in fig. 2 labeled as DRU), it is all Such as change of scale, addition biasing, application activating function and pond (pooling).In one embodiment, MAA connects from each IDP Receive IFM and non-zero weight.

In one embodiment, IDP quantity is 8, but can more generally use the IDP of varying number.One In individual embodiment, each IDP units are run parallel, each supply a non-zero weight and one group of characteristic pattern to MAA computing units It is worth (subset as IFM).In one embodiment, input block iteration IFM subset and corresponding power in cycles Weight, concurrently to generate one group of OFM.

Fig. 3 illustrates in greater detail the example of some data flows of the feeding MAA units according to embodiment.For explanation Purpose, show 8 parallel IDP and 16 MAA.More generally however, any number of unit can be configured to support simultaneously Row processing.For example, use 8 sram cells, a part (for example, 1/8) for each individually SRAM storage weights.In a reality Apply in example, single IDP provides a non-zero weight to MAA, and one IFM of each offer into MAA is (for example, 4 × 4 Block).

Fig. 4 is the flow chart of the method for the weight of NN compression for showing to be reordered according to the generation of embodiment.Receive warp The characteristic pattern and weight 403 of the neutral net of training.The optional optimization 404 of housebroken network can be performed.Characteristic pattern and/or Weight is reordered to generate the edition 4 05 to reorder of housebroken neutral net., can will be trained after reordering Neutral net the version to reorder weight compression 407 and store 409 (for example, in the memory of neutral net equipment, Although more generally, the weight of compression can be stored in storage medium or memory cell).

Then the weight of the compression of storage can be used for performing neutral net, as illustrated in the flow chart of figure 5.The weight of compression It is read 505 and decompression 510.The model 515 of neutral net is performed using the weight of the version to reorder of neutral net.

The characteristic pattern that NN training algorithms typically result in NN layer is arbitrarily organized in memory.Therefore, corresponding to feature The weight of figure generally also will be organized arbitrarily in memory.This any tissue can influence compression and execution efficiency in turn. What is reordered is the sequence for many function equivalents for having neutral net on one side.However, it is possible to the sequence that selection function is equivalent In some to obtain the structure of more more preferable than other compression ratios compression ratio with that can be utilized.As explanation, it is assumed that layer Characteristic pattern 0 and 10 can exchange, as long as the exchange of weight corresponding to this layer progress, does not just have shadow to NN Input output Relationship Ring.Identical weight inputs applied to identical, and identical knot in both these results and original and network for reordering Fruit is added to together.However, it is possible to select to reorder to produce more suitable for compressing and/or with knot the advantages of being directed to execution Structure.For example, NN weight can be reordered so that similar weight is grouped together in memory.That is, instructing Practice NN afterwards and before its weight is compressed, the characteristic pattern for the NN that can reorder and the weighted value of correlation.

In one embodiment, neutral net can be selected to reorder to sort to introduce to weight, to improve compression weight Ability (that is, reducing the data volume for representing NN).By being reordered to Internet, the weight of selection can be introduced Sequence is compressed with providing more preferable weight.One selection is to help to compress their structure to perform weight by introducing to weight Sequence is compressed with improving.For example, weight according to value can be grouped or sort.Another selection is based on the coding techniques for compression The characteristic of (such as huffman coding or Columbus-Lai Si codings) reorders to perform.As an example, characteristic pattern can be rearranged Sequence so that frequency distribution is more sharp in specific localized areas.Furthermore, it is possible to select to reorder to improve the prediction standard in encoding True property.As another example, network characterization figure can be reordered so that weighted value tends to the quantity of increase or null value weight Increase.

In addition, by redistributing non-zero weight, null value weight can be more effectively skipped during network performs.One Selection is to perform to reorder to be grouped null value weight, to allow to skip them during execution.

As another example, weight can be reordered more preferable to be created during the parallel processing of neural network model Load balance.Reordered for example, reordering and can perform with realizing, wherein each processing unit is selected in parallel processing Quantity cycle on be supplied (for example, about the same quantity) non-zero weights of more equal amounts.

In one embodiment, the network cut and weight cluster of weight selected by being performed after network training.It is poly- The weighted value that class includes for example being mapped to multiple different weighted values lesser amt is compressed with improving.For example, can be by 1,000 Or more slightly different weight be mapped to 32 weighted values.Cluster is also sometimes referred to as quantization (quantization). In one embodiment, low amplitude weight (being arranged to zero) is trimmed.In one embodiment, it is accurate without influenceing network to perform trimming Property.In shearing procedure, it is zero that low amplitude weight, which is clamped down on,.Network retraining be may then pass through to adjust remaining non-zero power Weight, to regain the accuracy of any loss.That is, in order to offset the loss of accuracy, retraining can be carried out, with Readjust some weights so that whole network keeps identical or almost identical accuracy, while keeps the advantage of compression.

In one embodiment, the percentage of trimming increase null value weight.This is for compressing and performing with potential excellent Gesture.During execution in terminal NN equipment, multiple weight (examples can be concurrently applied in period demand in a manner of SIMD Such as, all parallel computation unit application weights or whole skip null value weight).That is, it need not be applied during execution Null weight, because these weights do not influence.In some cases, trimming may cause the weight of significant proportion final It is zero (for example, about 60% to 95% or more), this in turn provides the chance for accelerating network to perform.

In one embodiment, null value weight is grouped is performed with improving.It may be difficult the place for eliminating many null value weights Manage the cycle.However, multiple null value weights can be skipped when they are grouped so that they are collected together in same period. This can help speed up execution, while improve compression.

In addition to resetting the Lossless Compression of sequence network and the weight to reorder, example embodiment can also utilize and damage pressure Contracting, it can be omitted in other embodiments.In this case, together with reordering, weight is adjusted (for example, small Adjustment) with improve compression.

Fig. 6 shows the method for including trimming and retraining according to embodiment.Receive the spy of housebroken neutral net Sign figure and weight 601.

Weight 610 is trimmed, to improve weight compression efficiency and reduce network calculations cost.In one embodiment, with can Become threshold value and perform trimming.For example, threshold value can be selected based on the predetermined ratio factor of the distance metric of weight.Implement in example Example in, threshold value be selected as be equal to convolutional layer in each convolution kernels or full articulamentum each weight vectors L1 hammings away from About 20% value from (hamming distance).Different scale factor or difference can be used in alternative embodiments Distance metric.In another example, the threshold value can be iteratively found via Dynamic Programming, so that handy meet limitation threshold Null value in each cluster of the rule generation of value maximizes.

Retraining Weighted residue 615.As shown in block 620, in certain embodiments, can include option with stability appraisal and Retraining is one or many, until meeting stop condition, such as meets the iteration of predetermined number.

The quantization of weight can perform 625 with optional retraining.In the exemplary embodiment, the cluster of weight is based on K mean clusters (k-means clustering) are carried out, wherein the barycenter each clustered is used to represent to be included in the cluster Weight.

The group of quantized weight (quantized weight) is reordered 630.As it was previously stated, reordering to include With surrounding characteristic pattern in full articulamentum or the switching of feature node of graph is corresponding reorders.However, weight can also be included by reordering Sequence is compressed with improving.Reordering can include reordering into cluster and reordering based on columns and rows attribute.It can also select The group of quantized weight in cluster, to maximize the validity of prediction.For example, such reset can be included by reordering Sequence：Cluster 0 is most common, and it is most uncommon to cluster 31.Selected as one kind, row can be reset according to increasing order Sequence into the row of selection quantity (for example, 16, depending on realizing details) cluster, to maximize the validity compressed between some row. In addition, row can reorder in one group of row, iteratively compressed with being effectively expert in dimension.For example, the element of row 1 is predicted to be Identical with row 0, plus some small positive increments, and increment is compressed.In alternative embodiments, cluster can be any suitable The row of quantity.In alternative embodiments, cluster can be formed by any suitable element (for example, row).

Increment 635 is calculated relative to prediction.For example, the difference in cluster between adjacent columns and/or rows can be calculated.Other become Change " basic " column or row that can apply to for being predicted to other columns and rows.For example, it is assumed that row 0 are chosen as " basic " Row, and the every other row (for example, 16 row) in group are by the different proportion factor prediction applied to fundamental sequence.For example, can be with Prediction a line is that row 0 is multiplied by scale factor and adds some increments.In some cases, increment can very little.

The optional adjustment of increment can be performed to improve compressibility 645, then perform retraining to mitigate accuracy loss. For example, in order to improve compressibility, increment size may adjust on a small quantity up or down.This adjustment will be damaging for compression scheme Part.

Then increment and fundamental forecasting 650 are compressed.Encoding scheme, such as entropy code scheme can be used.For example, it can make The increment with multiple is represented with huffman coding.It is real can to represent that most common increment comes by using position as few as possible Now effective compression.

Then the expression of the compression of the model to reorder is written to data storage device 655.

Fig. 7 is to show the flow chart for including skipping the method for the execution of null value weight according to embodiment.Read compression Weight 705.Decompress weight 710.During the execution of neutral net concurrently with the group of the quantity of selection (such as 16, depend on In realizing details) apply weight 715.When the cluster of (for one group) value makes its all weight be both configured to zero, this is skipped Cluster 720.Otherwise, convolution and vector product are handled during the execution of neutral net performs such as conventional neural networks.

In one embodiment, the mode for handling null value is partly dependent on channel type (for example, convolutional layer with being connected entirely Layer).That is, realize skip null value weight mode depend on channel type (it corresponds to different mathematical operations in turn, Convolution algorithm such as the vector product computing of full articulamentum and for convolutional layer).For example, null value weight can be grouped More effectively to skip them in the full articulamentum for calculating vector product.However, for convolutional layer, null value, which can be distributed, (to expand Exhibition) to help the load balance in parallel computation unit.Because it need not be combined in the convolution algorithm for convolutional layer Weight of zero is can skip processing null value.An example of convolutional layer is considered, wherein there is load balance.In this example, each Input block inputs subset for it and finds next non-zero weight and be moved to the weight.Therefore, each input block passes through it Input data is moved with different speed, is jumped to from a non-zero weight next.They are all moved through at different rates Their data.If each input block has the non-zero weight of about the same quantity with should in their input subset With then system, which is supported, balances and effectively skip the original cycle needed using null value weight.Fig. 8 A and Fig. 8 B are shown Reordered in convolutional layer to improve the example of load balance.Fig. 8 A are shown in the presence of two input blocks (input block 1 and defeated Enter unit 2) example.The processing feature Fig. 1 of input block 1 and kernel 1 (wherein * computings are convolution algorithms)；And the He of characteristic pattern 3 Kernel 3.The processing feature Fig. 2 of input block 2, kernel 2 and characteristic pattern 4, kernel 4.

Fig. 8 A show the example not reordered, wherein in the presence of big laod unbalance.Input block 1 needed for 4 week Phase sends four non-zero weights in (emit) kernel 1, then needs 3 cycles to send the power of three non-zeros in kernel 3 Weigh, altogether 7 cycles.Input block 2 needs 5 cycles to send 5 non-zero weights in kernel 2, then needs 6 cycles To send the non-zero weight in kernel 4,11 cycles altogether.Therefore, because laod unbalance, overall to need 11 cycles to locate Manage four characteristic patterns on two input blocks.

Fig. 8 B show the example according to embodiment, and the IFM in network is moved into (shuffle) to obtain wherein reordering The equivalent network of more load balance.Characteristic pattern 2 and characteristic pattern 3 are swapped by redefining neutral net, and are also existed The exchange of corresponding weight kernel.Therefore, characteristic pattern 3 is reordered as the characteristic pattern 3' with corresponding kernel 3'.It is also heavy The characteristic pattern 2' of sequence and corresponding kernel 2'.In this example, reordering causes bigger load balance.Input block 1 needs Four cycles are wanted to send four non-zero weights in kernel 1, then need 5 cycles to send kernel 3' non-zero weight, 9 cycles come processing feature Fig. 1 and characteristic pattern 3' altogether.Input block 2 needs three cycles to send three in kernel 2' Non-zero weight, and need six cycles to send the non-zero weight of kernel 4, altogether 9 cycles.Therefore, in the fig. 8b, it is necessary to nine The individual cycle handles four characteristic patterns on two input blocks.

In one embodiment, there is provided hardware supported is for the load balance that performs immediately.For example, it can perform offline Processing optional reordered with obtain IFM and performs reordering for OFM.In one embodiment, support remapping logic and Replay firing table with specify network hardware perform during performance variable remap.

As previously discussed, reordering can be such as by weight (example corresponding to the characteristic pattern for exchanging different layers and exchange Such as, interchange graph 2 and 10 and exchange weight corresponding to Fig. 2 and 10), cause the equivalent versions of identical network.However, at one In embodiment, reorder including producing additional replay firing table to help the hardware in neural processing unit.Replay firing table can be with Instruction hardware performs exchange.For example, replay firing table can indicate to input Fig. 2 and 10 for exporting Fig. 3 hardware-switch.

As previously discussed, many different data compression algorithms can be used for weight, such as, but not limited to huffman coding Or any other suitable compression algorithm, such as Columbus-Lai Si codings.Compression performance can depend on the data to be compressed Tissue.Make prediction for example, compression can depend on and represent the difference with prediction using variable amount of bits.For example, more Common value is compressed by less position.

Fig. 9 A and Fig. 9 B show the aspect of huffman coding according to an embodiment of the invention.As shown in Figure 9 A, principle On can carry out weight decoding using single shared huffman table.For output node sequence (for example, output node 0, 1...7 one group of weight index).Exist weight index use be uniformly distributed, wherein it is low index than it is high index it is more conventional.It is single Huffman table is used for the low index that upper frequency is utilized in whole weight group.However, in figure 9 a it is assumed that weight index uses Be uniformly distributed-it is low index than it is high index it is more conventional, but in left column compare it is right arrange it is more conventional.In Fig. 9 A example, weight The each column of index has random sequence.For example, row O₀With corresponding to it is any come self-training random index be distributed.For Fig. 9 A In weight index each column, arrange O₁With random index distribution etc..

Fig. 9 B show the use for being used for the huffman coding that context-adaptive Changeable weight compresses according to embodiment. Row (and/or row) can be classified to generate with the frequency for allowing the low index using two or more different huffman tables Weight tissue.For example, the distribution that weight index uses, to cause compared with right-hand column, rope low for left-hand line can be selected Draw more more conventional than high index.In Fig. 9 B example, reorder and low value weight is moved to the side of matrix and moves high level To opposite side.After the reordering of weight matrix, for the subset optimization group of huffman table of node.For example, each table Different groups of node are can correspond to, wherein each table has the low index of different frequency.For example, leftmost two are considered first Row.For output node O_0'Weight index column there is the most common low weight index in the row.For output node O_1''s There is weight index column the index similar to left-hand line to be distributed.For the first two node (O' and 1') weight index have with it is low The frequency of index very high corresponding node 0' and 1' the first huffman table.Into lower two row, for output node 2''s Weight index column has low index more uncommon than left-hand line here.For output node 3' weight index column have with The similar distribution of left-hand line.Weight index for node 2' and 3' is with the second huffman table for node 2 and 3.The row Sequence from left to right continues in the output node entirely to reorder, ends at for the output section with most uncommon low index Point 6' and with weight index column with the output node 7' of the distribution similar for output node 6'.

Figure 10 shows that the IDP decompressors for Huffman or Columbus-Lai Si decodings include the weight masks of compression The reality of stream decoder (mask stream decoder) and the weighted value stream decoder (value stream decoder) of compression Apply example.In one embodiment, weight kernel represents with the mask for specifying (trimming) weight and for the index of non-zero weight.Can To provide further look-up table (LUT) to support to decode.In one embodiment, output includes zero mask buffer and weighted value delays Rush device.

Exemplary embodiment can be deployed as the electronic equipment for including processor and the memory of store instruction.In addition, should Work as understanding, embodiment can be deployed as autonomous device or by the multiple equipment portion in distributed client-server networked system Administration.

For embodiments of the invention performing environment non-limiting example in graphics processing unit (GPU).Although GPU can for realize NN provide essence computing capability, but in the equipment with limited memory and/or power may It is difficult to NN.Example embodiment disclosed herein is by clustering 0 value weight so as to more effectively skip them, Ke Yishi The improvement compression of the neutral net weight parameter of storage device in the current memory in GPU, and network execution is provided Improved efficiency.

Here, in appropriate circumstances, computer-readable non-transitory storage medium or media can include one or Multiple based on semiconductor or other integrated circuits (IC) are (such as, such as field programmable gate array (FPGA) or application-specific integrated circuit (ASTC)), hard disk drive (HDD), hybrid hard drive (HHD), CD, CD drive (ODD), magneto-optic disk, magneto-optic Driver, floppy disk, floppy disk (FDD), tape, solid-state drive (SSD), ram driver, safe digital card or driving Device, any other suitable computer-readable non-temporary storage medium or two or more any suitable in these Combination.In appropriate circumstances, computer-readable non-transitory storage medium can be volatibility, it is non-volatile, Or volatibility and non-volatile combination.

Here, unless the context or it is otherwise noted, otherwise "or" is inclusive and not excluded 's.Therefore, herein, unless the context or it is otherwise noted, otherwise " A or B " refer to " A, B or both ". In addition, unless the context or be otherwise noted, " and " it is joint and several.Therefore, herein, unless Context is expressly stated otherwise or is otherwise noted, and " A and B " refer to " A and B jointly or respectively ".

The scope of the present disclosure cover those skilled in the art will appreciate that, it is real to the example that is described herein or shows Apply all changes, replacement, change, change and the modification of example.It is real that the scope of the present disclosure is not limited to example that is described herein or showing Apply example.In addition, although various embodiments herein is described and illustrates to be to include specific components, element, feature, work(by the disclosure Can, operation or step, but any one in these embodiments can include those skilled in the art will appreciate that, Any component, element, feature, function, operation or any combinations or the arrangement of step for Anywhere describing or illustrating herein.Separately Outside, although disclosure description or explanation specific embodiment provide the advantages of specific, specific embodiment can provide these Some or all of advantage can not provide these advantages.

Although the present invention is described in conjunction with specific embodiments, but it is to be understood that the present invention is not intended to be limited to institute The embodiment of description.On the contrary, it is intended to cover it can be included in the spirit and scope of the present invention being defined by the following claims Replacement, modification and equivalent.The present invention can be put into practice in the case of some or all of these no details.This Outside, known feature may not be described in detail, to avoid unnecessarily obscuring the present invention.According to the present invention, component, processing step And/or data structure can use various types of operating systems, programming language, calculating platform, computer program and/or calculating Equipment is realized.In addition, it will be appreciated by those of ordinary skill in the art that do not depart from inventive concept disclosed herein scope and In the case of spirit, such as hardwired device, field programmable gate array (FPGA), application specific integrated circuit can also be used Etc. (ASIC) equipment.The present invention can also be tangibly embodied as being stored on the computer-readable medium of such as memory devices One group of computer instruction.

Claims

1. a kind of method for realizing neutral net, including：

Receive the data for housebroken neutral net, including characteristic pattern and weight；

The characteristic pattern and/or weight of the housebroken neutral net that reorders, to generate reordering for housebroken neutral net Version；And

After execution is reordered, the weight of the version to reorder of housebroken neutral net is compressed.

2. according to the method for claim 1, wherein the packet reordering includes the characteristic pattern of the neutral net that reorders to reset The weight of sequence neutral net.

3. according to the method for claim 1, wherein the packet reordering includes the weight for the neutral net that reorders, with quilt Select to improve the structure of compression efficiency compared with the weight of received data.

4. according to the method for claim 1, wherein the packet reordering includes considers to reorder in weight based on load balancing It is at least some with distribution of weights.

5. according to the method for claim 1, at least some weights are divided by weighted value wherein the packet reordering includes Group.

6. according to the method for claim 5, wherein at least some null value weight is grouped.

7. according to the method for claim 1, it is additionally included in before reordering by the different weighted values by the first quantity Weight be mapped to the different weighted values of the second quantity to cluster weight, wherein the second quantity is less than the first quantity.

8. by the input reordered and the weight of output node before according to the method for claim 1, being additionally included in compression Index reorder.

9. according to the method for claim 1, wherein the version to reorder of the housebroken neutral net is trained Neutral net equivalent versions.

10. according to the method for claim 1, wherein the packet reordering includes replay firing table of the generation for neutral net, To realize the remapping to realize the version to reorder of housebroken neutral net of characteristic pattern.

11. a kind of method for performing neutral net, including：

The model of neutral net is provided, wherein the model corresponds to the pass the characteristic pattern for the housebroken neutral net that reorders And/or weight and the version to reorder of housebroken neutral net that generates；And

Perform the model of neutral net.

12. according to the method for claim 11, wherein performing the model includes skipping the group with complete zero weight Perform.

13. according to the method for claim 11, skip the zero of distribution wherein performing the model and being included in convolution pattern It is worth the execution of weight.

14. according to the method for claim 11, wherein the version to reorder includes being based on being used to parallel locate at one group Sequence of the load balance condition performed on reason input block to weight.

15. according to the method for claim 11, wherein the model of the neutral net is in one group of parallel processing input block Upper execution, and the version to reorder has the non-zero weight value based on the distribution of load balance condition so that at least One convolutional layer, each parallel processing element are grasped with each cycle about the same non-zero weight average in cycles Make.

16. according to the method for claim 11, wherein the model includes the replay firing table for neutral net, to realize Characteristic pattern is remapped to realize the version to reorder of housebroken neutral net.

17. according to the method for claim 16, wherein the table that remaps is utilized to perform spy during execution by hardware Sign figure reorders.

18. according to the method for claim 11, wherein the version to reorder is equivalent with housebroken neutral net Network or housebroken neutral net optimization version.

19. according to the method for claim 11, wherein the weight of the neutral net stores in the compressed format, and it is described Method also includes：

Read the weight of compression；

Decompress the weight of the compression；

Skip the execution of null value weight, including at least one of the following：Skip the power for being zero for all weights of full articulamentum Any cluster of weight or the execution for skipping the scattered null value weight for convolutional layer；And

Performed using the weight of remaining decompression for neutral net.

20. a kind of computer-readable medium, include the storage medium of store instruction, it is real when performing the instruction on a processor Existing method, methods described include：