CN108416423B

CN108416423B - Automatic threshold for neural network pruning and retraining

Info

Publication number: CN108416423B
Application number: CN201810100412.0A
Authority: CN
Inventors: 冀正平; 约翰·韦克菲尔德·布拉泽斯; 伊利亚·奥夫相尼科夫; 沈恩寿
Original assignee: Samsung Electronics Co Ltd
Current assignee: Samsung Electronics Co Ltd
Priority date: 2017-02-10
Filing date: 2018-01-31
Publication date: 2024-01-12
Anticipated expiration: 2038-01-31
Also published as: KR102566480B1; CN108416423A; US10832135B2; US20180232640A1; KR20180092810A; US20200410357A1

Abstract

An embodiment includes a method comprising: pruning layers of a neural network having a plurality of layers using a threshold; and repeating pruning the layers of the neural network using different thresholds until a pruning error of the pruned layers reaches a pruning error tolerance.

Description

Automatic threshold for neural network pruning and retraining

Cross Reference to Related Applications

The present application claims priority from U.S. provisional patent application No.62/457,806 filed on 2 months 10 in 2017, the contents of which are incorporated herein by reference in their entirety for all purposes.

Technical Field

The present disclosure relates to pruning and retraining of neural networks, and in particular, pruning and retraining of neural networks using automatic thresholds.

Background

Deep learning architectures, particularly convolutional deep neural networks, have been used in the fields of Artificial Intelligence (AI) and computer vision. These architectures have been shown to produce results for tasks including visual object recognition, detection, and segmentation. However, these architectures may have a large number of parameters, resulting in higher computational loads and increased power consumption.

Disclosure of Invention

An embodiment includes a method comprising: the following steps are repeated: pruning multiple layers of the neural network using the automatically determined threshold; and retraining the neural network using only weights remaining after pruning.

Embodiments include a system comprising: a memory; and a processor coupled to the memory and configured to: pruning layers of a neural network having a plurality of layers using a threshold; and repeating pruning the layers of the neural network using different thresholds until a pruning error of the pruned layers reaches a pruning error tolerance.

Drawings

1A-1B are flowcharts of techniques for automatically determining thresholds, according to some embodiments.

FIG. 2 is a flow diagram of a retraining operation according to some embodiments.

Fig. 3A-3B are flowcharts of retraining operations according to some embodiments.

FIG. 4 is a flow diagram of a technique for automatically determining thresholds, pruning, and retraining according to some embodiments.

FIG. 5 is a set of curves illustrating a retraining operation according to some embodiments.

FIG. 6 is a graph including results of various neural networks after pruning and retraining according to some embodiments.

Fig. 7A-7C are graphs including results of various techniques for pruning a neural network, according to some embodiments.

Fig. 8 is a system according to some embodiments.

Detailed Description

Embodiments relate to pruning and retraining of neural networks, and in particular, pruning and retraining of neural networks using automatic thresholds. The following description is presented to enable one of ordinary skill in the art to make and use the embodiments and is provided in the context of a patent application and its requirements. Various modifications to the embodiments and the generic principles and features described herein will be readily apparent. The embodiments are described primarily in the sense of specific methods, apparatus and systems provided in the detailed description.

However, the methods, apparatus and systems will operate effectively in other embodiments. Phrases such as "an embodiment," "one embodiment," and "another embodiment" may refer to the same or different embodiments as well as multiple embodiments. Embodiments will be described with respect to systems and/or devices having particular components. However, the system and/or apparatus may include more or less components than those shown, and variations in the arrangement and type of components may be made without departing from the scope of the disclosure. Embodiments will also be described in the context of specific methods having certain operations. However, the methods and systems may operate according to other methods with different and/or additional operations and with different orders and/or parallel operations that are not inconsistent with the embodiments. Thus, the embodiments are not intended to be limited to the specific embodiments shown, but are to be accorded the widest scope consistent with the principles and features described herein.

Embodiments are described in the context of a particular system or device having certain components. One of ordinary skill in the art will readily recognize that embodiments are consistent with the use of systems or devices having other and/or additional components and/or other features. Methods, apparatus, and systems may also be described in the context of a single element. However, one of ordinary skill in the art will readily recognize that: these methods and systems are consistent with using an architecture having multiple elements.

Those skilled in the art will appreciate that: terms used herein and particularly in the appended claims (e.g., the text of the appended claims) are generally intended to be "open" terms (e.g., the term "including" should be interpreted as "including but not limited to," the term "having" should be interpreted as "having at least," the term "including" should be interpreted as "including but not limited to," etc.). Those skilled in the art will further understand that: if a specific number of an introduced claim recitation is intended, such an intent will be explicitly recited in the claim, and in the absence of such recitation no such intent is present. For example, as an aid to understanding, the following appended claims may contain usage of the introductory phrases "at least one" and "one or more" to introduce claim recitations. However, even when the same claim includes the introductory phrases "one or more" or "at least one" and indefinite articles such as "a" or "an" (e.g., "a" and/or "an" should be interpreted to mean "at least one" or "one or more"), the use of such phrases should not be construed to imply that the introduction of a claim recitation by the indefinite articles "a" or "an" limits any particular claim containing such introduced claim recitation to examples containing only one such recitation; the same holds true for the use of definite articles used to introduce claim recitations. Further, in those instances where a convention analogous to "at least one of A, B or C, etc." is used, such a construction in general is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B or C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B and C together, etc.). Those skilled in the art will further understand that: whether in the specification, claims, or drawings, virtually any disjunctive word and/or phrase presenting two or more alternative terms should be understood to contemplate the possibilities including one of the terms, either of the terms, or both terms. For example, the phrase "a or B" will be understood to include the possibilities of "a" or "B" or "a and B".

In some embodiments, a neural network (such as a deep learning neural network) may be created with a reduced parameter size. As a result, the performance level of the image recognition task can be maintained while reducing the load on the neural network hardware. In some embodiments, to reduce the size of the parameters of the neural network, the neural network may be trimmed to zero many parameters. However, the problem is how to set good thresholds for each layer of the neural network to trim the network as much as possible, while maintaining its original performance. For a neural network consisting of tens of layers, the violence search threshold may not be practical, especially considering that the threshold of one layer may depend on the other layers. In addition, pruning may require retraining the network to restore the original performance. Such a trimming process may take a significant amount of time to verify as being effective. As described herein, in various embodiments, automatic selection of thresholds along with a method of retraining a network may be used to prune a neural network to reduce parameters.

FIG. 1A is a flow diagram of a technique for automatically determining a threshold according to some embodiments. At 100, a threshold for pruning layers of a neural network is initialized. In some embodiments, the threshold may be initialized to the extreme of the range of values, e.g., 0 or 1. In other embodiments, the threshold may be initialized to a particular value such as 0.2, 0.5, etc. In other embodiments, the threshold may be set to a random value. In other embodiments, the threshold may be determined using empirical rules. In some embodiments, the threshold may be set to a value that is automatically determined for another layer, such as a layer of similar situation (similarly situated layer) or a layer of similar or same type. In some embodiments, the initial threshold may be the same for all layers of the neural network; however, in other embodiments, the threshold may be different for some layers or for each layer.

In 102, layers of the neural network are trimmed using a threshold. For example, a threshold is used to set some of the weights for that layer to zero. At 104, a pruning error is calculated for the pruned layers. The clipping error is a function of the weights before and after clipping.

At 106, the clipping error (PE) is compared to a clipping error tolerance (PEA). In some embodiments, if the clipping error is equal to the clipping error tolerance, the clipping error may have reached the clipping error tolerance. However, in other embodiments, if the clipping error is within a range that includes a clipping error tolerance, the clipping error may have reached the clipping error tolerance. Alternatively, the clipping error margin may be a range of acceptable clipping errors. For example, a relatively small number may be used to define a range above and below a particular clipping error tolerance. If the degree of separation of the clipping error and the clipping error tolerance is less than a relatively small number, then the clipping error is considered to be at the clipping error tolerance.

If the clipping error has not reached the clipping error tolerance, then the threshold is changed at 108. Changing the threshold may be performed in various ways. For example, the threshold may be changed by a fixed amount. In other embodiments, the threshold may change the amount of difference based on the clipping error and the clipping error tolerance. In other embodiments, the threshold may be changed by an amount based on the current threshold. In other embodiments, another threshold may be selected using a search technique such as a binary search or other type of search. The technique of changing the threshold may (but need not) be the same technique for all layers.

Regardless of how the change is made, after the threshold is changed at 108, the process is repeated by pruning the layers at 102 and calculating a pruning error at 104. The clipping error is again compared to the clipping error tolerance to determine if the clipping error has reached the clipping error tolerance. Thus, pruning of layers of the neural network is repeated using different thresholds until the pruning error reaches the pruning error tolerance.

FIG. 1B is a flow diagram of a technique for automatically determining a threshold according to some embodiments. In this embodiment, the trimming technique is similar to the trimming technique of fig. 1A. A description of similar operations will be omitted. In some embodiments, a clipping error margin is initialized at 101. In other embodiments, a percentage or range of percentages of non-zero weights may be initialized. In other embodiments, a combination of the threshold, clipping error margin, and percentage or range of percentages of non-zero weights may be initialized. The pruning error margin and/or the percentage or percentage range of non-zero weights may be initialized using techniques similar to those described above with respect to the initialization threshold.

At 110, after the clipping error reaches the clipping error tolerance, a percentage of non-zero weights is calculated and compared to an acceptable percentage or range of percentages of non-zero weights. In this embodiment, the number of pruning weights is expressed as a percentage; however, in other embodiments, the number of pruning weights may be represented in a different manner. Although percentages of non-zero weights have been used as examples, in other embodiments percentages of pruning weights may be used and compared to corresponding ranges or values.

If the percentage of clipping weights is not within the range for that layer, then at 112 the clipping error margin is changed. The clipping error margin may be changed using techniques as described above for changing the threshold in 108. The same or different techniques used to change the threshold may also be used to change the clipping error margin. The technique to change the clipping error margin may (but need not) be the same technique for all layers.

After the trimming error margin has changed, the layer may be trimmed again at 102. Subsequent operations may be performed until the percentage of pruning weights is within an acceptable amount. In 114, the next layer may be processed similarly. Thus, this process may be repeated for each layer of the neural network.

Using techniques according to some embodiments allows all layers of the neural network to be trimmed with a single threshold and/or a single trimming error margin. However, each layer will eventually have an automatically determined threshold based on the particular layer. If a fixed threshold is used for two layers to all layers, the threshold may not be optimal for one or more of those layers. Furthermore, since pruning techniques according to some embodiments focus on a single layer, the threshold may be determined specifically for that layer.

In some embodiments, the percentage of non-zero weight may be a single control or a single type of control for pruning the neural network. As described above, the clipping error margin will change until the percentage of non-zero weights is within the desired range. Similarly, the threshold value is changed until the clipping error reaches the clipping error tolerance. Thus, by setting the percentage of non-zero weights, the clipping error margin and threshold will be changed to achieve the desired percentage.

In some embodiments, the clipping error margin and/or threshold may also be initialized. For example, the clipping error margin may be initialized to bias the result of the operation to a particular side of the percentage range of non-zero weights or to a particular location within the range.

In some embodiments, the threshold may be determined as follows. Tolerance epsilon for clipping errors ^* Initialization is performed. For each layer 1, the threshold T is calculated using the techniques described above ₁ Initialization is performed. Using a threshold T ₁ For each weight w of layer 1 _i Trimming. Equation 1 is an example of how weights may be pruned.

In some embodiments, the threshold T may be scaled by a scaling factor ₁ Scaling is performed. Here, the threshold T can be determined by σ (w) ₁ Scaling is performed. σ (w) is the standard deviation of the ownership within the layer. However, in other embodiments, the threshold T may be paired by a different scaling factor ₁ Scaling is performed.

Once the layer is trimmed, a trim error epsilon is calculated. Equation 2 is an example of how the clipping error is calculated.

Here, w _pruned Is a vector of pruning weights and w is a vector of original weights before pruning. D (w) is the total length of w. Thus, the resulting clipping error ε is based on the amount of error and the number of clipped weights.

The clipping error ε may be related to the clipping error tolerance ε ^* A comparison is made. Equation 3 is an example of comparison.

|ε-ε ^* |＞θ (3)

Here, θ is defined to define a clipping error margin ε ^* Is the number of ranges in the center. In some embodiments, θ is 0; however in other casesIn an embodiment, θ is a relatively small number. In other embodiments, θ is a number defining the size of the range.

If the clipping error epsilon and the clipping error tolerance epsilon ^* If the difference is smaller than θ, the clipping error ε has reached the clipping error tolerance ε ^* . If not, threshold T ₁ May be varied as described in equation 4.

Here ζ is a constant, the threshold value T may be ₁ The constant is changed. As described above, in other embodiments, the threshold T ₁ May be varied in different ways. For example, ζ may be a value that progressively decreases by a factor of 2 at each iteration. Regardless, once the threshold T is changed ₁ The use of the updated threshold T may be performed as described above ₁ Is provided, and subsequent steps.

If the clipping error ε has reached the clipping error tolerance ε ^* The percentage of non-zero weights may be checked. Equation 5 is an example of calculating the percentage p.

The percentage p is then compared to a range of acceptable percentages. In some embodiments, the range of acceptable percentages may be the same; however, in other embodiments, the range may be different. In particular, the range may depend on the type of layer. For example, for convolutional layers, the percentage p may range between 0.2 and 0.9, while for other layers (e.g., fully connected layers) the range may be between 0.04 and 0.2.

If the percentage p is less than the lower end of the range for that layer, then the error margin ε is trimmed ^* Decreasing as in equation 6. Similarly, if the percentage is greater than the upper end of the range for that layer, the error margin ε is trimmed ^* Increasing as in equation 7.

ε ^* ＝ε ^* -τ (6)

ε ^* ＝ε ^* +τ (7)

At the error margin ε ^* After the change, the trimming may be repeated until the trimming error ε has reached a new trimming error margin ε ^* Until that point. In some embodiments, the threshold T for the previous iteration may be maintained ₁ The method comprises the steps of carrying out a first treatment on the surface of the However, in other embodiments, the threshold T ₁ May be different, for example initialized to an original initial value or initialized according to an initialization algorithm. For example, the threshold T for the next iteration may be used ₁ Initializing to a past-based threshold T ₁ But adjusted in a direction expected to reduce the number of pruning iterations to reach a new pruning error margin epsilon ^* 。

The above technique may be repeated until the percentage p is within an acceptable range for the layer. The operation may be repeated for other layers of the neural network. In some embodiments, the correction error margin ε may be selected, for example, with or without depending on the previously trimmed layer ^* And an initial threshold T ₁ And the like. For example, for two similarly-case layers, the later trimmed layers may use the resulting trim error margin ε from the earlier trimmed layers ^* And threshold T ₁ 。

Thus by pruning according to the techniques described herein, in some embodiments, the pruning threshold for each layer may be automatically determined. That is, the threshold may be determined to satisfy a particular range of non-zero weights and/or a particular clipping error margin for the clipped layer retention. The threshold may be different for one or more layers (including all layers) depending on the particular layer.

FIG. 2 is a flow diagram of a retraining operation according to some embodiments. In 200, various parameters may be initialized. For example, a base learning rate, a counter for the number of iterations, etc. may be initialized.

In 202, layers of the neural network are trimmed using an automatically determined threshold. Specifically, the threshold for a layer may be automatically generated as described above. In some embodiments, all layers may be trimmed; however, as will be described in further detail below, in some embodiments less than all of the layers may be trimmed.

As a result of the pruning, the neural network with non-zero weights remains. At 204, the neural network is retrained using those non-zero weights. The pruning and retraining operations are repeated until the desired number of iterations is completed. For example, at 206, the number of iterations may be compared to the number of iterations required. If the number of iterations has not reached the desired number, pruning and retraining may be repeated.

Fig. 3A-3B are flowcharts of retraining operations according to some embodiments. Referring to fig. 3A, various parameters may be initialized in 300 similar to those described above in 200. In 302, convolutional (CONV) layers are pruned for those layers using an automatically determined threshold. Although convolutional-type layers have been used as an example, in other embodiments, other subsets of different types or containing less than all of the layers may be pruned.

At 304, the neural network is retrained using non-zero weights. In some embodiments, retraining continues for a particular number of iterations. In other embodiments, retraining continues until the retraining has covered all of the training sample set.

In 306, the number of iterations is compared to a threshold. If the number of iterations is less than the threshold, the pruning and retraining in 302 and 304 are repeated. Specifically, after retraining in 304, when pruning in 302 is performed, some non-zero weights that were previously survived the earlier pruning operation may have been reduced below the pruning threshold of the associated layer. Thus, those weights may be set to zero and the remaining non-zero weights retrained in 304.

If the number of iterations has reached the threshold in 306, then a set of layers in the neural network having a different type than the layers trimmed in 302 is fixed in 308. That is, during the subsequent retraining in 304, the fixed layer is not retrained. In some embodiments, the Full Connection (FC) and Input (IP) layers are fixed. The pruning at 302 and retraining at 304 may be repeated until the desired number of iterations is completed at 310.

Referring to fig. 3B, at 312, the layer being trimmed at 302 of fig. 3A is fixed. In this example, the convolutional layer is the layer trimmed at 302. Thus, a convolutional layer is fixed at 312.

At 314, the layers fixed at 308 are trimmed using an automatically determined threshold associated with the layers. In this example, the layer is an FC/IP layer that is the layer fixed in step 308.

At 316, the retraining rate may be adjusted based on the pruning rate. In particular, since pruning may reflect a reduced number of weights, the drop-out rate may be changed accordingly to accommodate a lower number of non-zero weights.

At 318, the neural network is retrained. However, in 321, the convolutional layer is fixed. Therefore, the layers are not retrained. At 320, if the number of iterations has not been completed, pruning and retraining at 314 and 318 are repeated.

In some embodiments, the remaining operations in FIG. 3A may not be performed other than the initialization in 300 in FIG. 3A. That is, operations may begin at 312, where a convolutional-type layer is fixed. In some embodiments, the convolutional layers may be pruned using respective automatically determined thresholds before the layers are fixed.

Although a particular type of layer has been used as an example of a layer that is trimmed, retrained, and fixed, in other embodiments the type may be different. Further, in some embodiments, the first layer set trimmed at 302 and fixed at 312 and the second layer set trimmed at 308 and fixed at 314 may form an entire layer set. However, in other embodiments, pruning, retraining, and fixing of other layer sets may not follow the techniques for the first set or the second set. For example, the third tier aggregate may be trimmed at 302, but not fixed at 312.

FIG. 4 is a flow diagram of a technique for automatically determining thresholds, pruning, and retraining according to some embodiments. In some embodiments, at 400, a pruning threshold is automatically determined. The threshold may be automatically determined as described above. In 402, those automatically determined thresholds are used to trim and retrain the neural network. In some embodiments, the threshold may be automatically determined in 400 by multiple iterations. After these iterations are completed, the resulting threshold is used to iteratively prune and retrain the neural network at 402.

FIG. 5 is a set of curves illustrating a retraining operation according to some embodiments. These graphs illustrate the pruning and retraining of google net neural networks. Specifically, the graphs show loss variation, top-1 accuracy, and top-2 accuracy. Here two pruning operations are shown, one before the first training iteration and after the second has performed a certain amount of training iterations. Although two trimming operations are shown, any number of trimming operations may be performed.

FIG. 6 is a graph including results of various neural networks after pruning and retraining according to some embodiments. In particular, the size variation of the weight parameters, top-1 accuracy, and top-5 accuracy are described for various neural networks and trimmed versions of those networks. AlexNet, VGG16, squeezeNet and GoogLeNet are listed herein, as well as pruned versions of AlexNet and VGG 16. In particular, the pruned google net entries illustrate training and reasoning networks of the pruned google net as described herein. As noted, google net trimmed as described herein has a weight parameter that can provide the smallest dimension of a neural network with higher accuracy. Specifically, the pruned training and reasoning neural network is able to achieve top-5 accuracy in excess of 89% with minimal weight parameters.

Fig. 7A-7C are graphs including results of various techniques for pruning a neural network, according to some embodiments. These charts list the layers and sublayers of the google net neural network and the results of various pruning techniques. Two examples of pre-fix (prefix) thresholds are shown, including the resulting total weight after pruning and top-1 and top-5 performance. Another example illustrates the result of using thresholds generated by empirical rules. Finally, the last example shows the pruning results according to the embodiments described herein. This result shows that pruning as described herein can achieve accuracy comparable to untrimmed networks, with less weight than a pre-fixed threshold. Pruning described herein achieves similar or higher accuracy with similar overall weights for thresholds generated by empirical rules. However, pruning as described herein may be performed without requiring selection techniques or rules for pre-selecting the pruning threshold. That is, multiple iterations with pre-fixed thresholds need not be performed, and similar and/or better results may be achieved without empirical information for generating rules.

Fig. 8 illustrates a system according to some embodiments. The system 800 includes a processor 802 and a memory 804. The processor 802 may be a general purpose processor, a Digital Signal Processor (DSP), an application specific integrated circuit, a microcontroller, a programmable logic device, a discrete circuit, a combination of these, or the like. The processor 802 may include internal parts such as registers, cache memory, processing cores, etc., and may also include external interfaces such as address and data bus interfaces, interrupt interfaces, etc. Although only one processor 802 is shown in system 800, multiple processors 802 may be present. In addition, other interface devices, such as a logic chipset, hub, memory controller, communication interface, etc., may be part of the system 800 to connect the processor 802 to internal and external components.

The memory 804 may be any device capable of storing data. Here, one memory 804 is shown for system 800; however, any number of memories 804 may be included in system 800, including different types of memory. Examples of memory 804 include Dynamic Random Access Memory (DRAM) modules, double data rate synchronous dynamic random access memory (DDR SDRAM) according to various standards such as DDR, DDR2, DDR3, DDR4, static Random Access Memory (SRAM), nonvolatile memory such as flash memory, spin-transfer torque magnetoresistive random access memory (STT-MRAM) or phase change RAM, magnetic or optical media, and the like.

The memory 804 may be configured to store code that, when executed on the processor 802, causes the system 800 to implement any or all of the techniques described herein. In some embodiments, the system 800 may be configured to receive an input 806, such as a neural network, an initial threshold, an initial clipping error margin, an acceptable clipping percentage range, and the like. Output 808 may include automatically determined thresholds, trimmed and retrained neural networks, or other result information described above.

Although the method and system have been described in terms of particular embodiments, those of ordinary skill in the art will readily recognize that there could be variations to the embodiments disclosed and, therefore, any variations would be considered to be within the spirit and scope of the methods and systems disclosed herein. Accordingly, many modifications may be made by one of ordinary skill in the art without departing from the spirit and scope of the appended claims.

Claims

1. A method of pruning a neural network for image recognition using a threshold, comprising:

pruning layers of a neural network having a plurality of layers using a threshold; and

repeating pruning of the layers of the neural network using different thresholds, until a pruning error of the pruned layers reaches a pruning error tolerance,

wherein the pruning error is calculated by dividing a magnitude obtained by subtracting a length of a vector of pruning weights of the layer of a current iteration from a length of a vector of weights of the layer of a previous iteration by a magnitude obtained by subtracting a length of a vector of pruning weights of the layer of a current iteration from a total length of a vector of initial weights of the layer.

2. The method of claim 1, further comprising, for each layer of the neural network:

initializing the clipping error margin;

initializing the threshold; and

the following steps are repeated until the percentage of pruning weights for the layer is within the range for the layer:

repeating the steps until the clipping error reaches the clipping error tolerance:

pruning the layers of the neural network using the threshold;

calculating a trimming error of the trimmed layer;

comparing the clipping error with the clipping error tolerance; and

changing the threshold in response to the comparison;

calculating a percentage of the pruning weight of the pruned layer; and

the clipping error margin is changed in response to the percentage of the clipping weights.

3. The method of claim 1, further comprising:

repeating pruning the layer using different pruning error margins until the percentage of pruning weights for the layer is within a range for the layer.

4. A method according to claim 3, wherein different types of layers of the neural network have different ranges for the percentage of pruning weights.

5. The method of claim 1, wherein pruning the layer of the neural network comprises: if the magnitude of the weight is less than the threshold, the weight is set to zero.

6. The method of claim 1, wherein pruning the layer of the neural network comprises: if the magnitude of the weight is less than the threshold scaled by a scaling factor, the weight is set to zero.

7. The method of claim 6, wherein the scaling factor is a standard deviation of weights of the layers.

8. The method of claim 1, further comprising performing the following operations to produce the different thresholds:

if the clipping error is less than the clipping error tolerance, increasing the threshold; and

if the clipping error is greater than the clipping error tolerance, the threshold is lowered.

9. The method of claim 1, further comprising: repeating pruning of layers of the neural network is performed for each layer of the neural network.

10. The method of claim 1, further comprising: after the clipping error of the clipped layer reaches the clipping error tolerance, the neural network is iteratively clipped and retrained using the threshold.

11. A method of pruning a neural network for image recognition using a threshold, comprising:

the following steps are repeated:

pruning layers of a neural network having a plurality of layers using an automatically determined threshold until a pruning error of the pruned layers reaches a pruning error tolerance; and

the neural network is retrained using only the weights remaining after pruning,

12. The method of claim 11, wherein pruning the layer of the neural network comprises: layers of the neural network having a first type are trimmed using an automatically determined threshold.

13. The method of claim 12, further comprising:

fixing weights of layers of the neural network having a second type different from the first type;

the following steps are repeated:

trimming the layer having the first type; and

retraining the neural network using only weights remaining after pruning;

fixing weights of the neural network having the layers of the first type; and

the following steps are repeated:

pruning layers of the neural network having the second type; and

the neural network is retrained using only weights that remain after pruning.

14. The method of claim 12, further comprising: the weights of layers of the neural network having a second type different from the first type are fixed.

15. The method of claim 14, further comprising:

fixing weights of the neural network having the layers of the first type; and

the following steps are repeated:

pruning layers of the neural network having the second type; and

the neural network is retrained using only weights that remain after pruning.

16. The method of claim 11, further comprising: the automatically determined threshold is generated prior to retraining the neural network.

17. The method of claim 11, further comprising: the drop-out rate for retraining is adjusted in response to the pruning rate of the pruning.

18. A system for pruning a neural network for image recognition using a threshold, comprising:

a memory; and

a processor coupled to the memory and configured to:

repeating pruning of the layers of the neural network using different thresholds, until a pruning error of a pruned layer reaches a pruning error tolerance,

19. The system of claim 18, wherein the processor is further configured to, for each layer of the neural network:

initializing the clipping error margin;

initializing the threshold; and

pruning the layers of the neural network using the threshold;

calculating a trimming error of the trimmed layer;

comparing the clipping error with the clipping error tolerance; and

changing the threshold in response to the comparison;

calculating a percentage of the pruning weight of the pruned layer; and