CN113052307B

CN113052307B - Memristor accelerator-oriented neural network model compression method and system

Info

Publication number: CN113052307B
Application number: CN202110281982.6A
Authority: CN
Inventors: 王琴; 沈林耀; 景乃锋; 绳伟光; 蒋剑飞; 毛志刚
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2021-03-16
Filing date: 2021-03-16
Publication date: 2022-09-06
Anticipated expiration: 2041-03-16
Also published as: CN113052307A

Abstract

The invention provides a memristor accelerator-oriented neural network model compression method and system, and relates to the technical field of memristor-based neural network accelerators, wherein the method comprises the following steps: step 1: cutting an original network model through an array-aware regularized incremental pruning algorithm to obtain a memristor array-friendly regularized sparse model; step 2: through a power quantization algorithm of two, the ADC precision requirement and the number of low-resistance devices in the memristor array are reduced so as to reduce the system power consumption overall. The method can solve the problems of overlarge hardware resource consumption and overhigh power consumption of the ADC unit and the computing array when the original model is mapped on the memristor accelerator.

Description

Memristor accelerator-oriented neural network model compression method and system

Technical Field

The invention relates to the technical field of memristor-based neural network accelerators, in particular to a memristor accelerator-oriented neural network model compression method and system.

Background

With the increasing computing power of hardware, neural network technology has become the most popular research direction at present. Neural network algorithms, including convolutional neural network algorithms, have achieved a daunting endeavor in areas such as image recognition, object detection, and semantic segmentation. The neural network application is usually deployed in the form of edge computing, however, with the continuous increase of network scale, the CMOS dedicated neural network accelerator located at the edge end cannot meet the increasing storage and computing requirements, and is further incapable of solving the problems of performance bottleneck and excessive power consumption caused by frequent data transportation of a separate hardware architecture.

In recent years, researchers have tried to break through the limitations of the traditional memory-separate architecture and gradually focused on the technology of in-memory computing. The invention of the resistive random access memory (ReRAM) has great potential to fundamentally solve the problems caused by the computing mechanism and the architecture. The resistive random access memory is also called as a memristor, has the advantages of low power consumption, simple structure, high working speed, variable and controllable resistance and the like, and can realize various operation forms such as Boolean logic operation, vector-matrix multiplication operation and the like by utilizing the memristor. In recent years, the proposal of a neural network accelerator based on memristors provides an effective solution for reducing data handling, reducing storage requirements and improving deep learning forward reasoning capability.

Although memristor neural network accelerators have great advantages in implementing network forward reasoning, such accelerators still have certain problems when used in the field of edge computing. Firstly, a large amount of hardware resources are still consumed when the original dense neural network model is mapped to the memristor neural network accelerator; second, memristor compute arrays and analog-to-digital conversion units (ADCs) in memristor accelerator systems are power hungry.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a memristor accelerator-oriented neural network model compression method and system, which can solve the problems of overlarge hardware resource consumption and overhigh power consumption of an ADC unit and a calculation array when an original model is mapped onto a memristor accelerator.

According to the neural network model compression method and system for the memristor accelerator, the scheme is as follows:

in a first aspect, a neural network model compression method for a memristor accelerator is provided, and the method includes:

cutting an original network model through an array-aware regularized incremental pruning algorithm to obtain a memristor array-friendly regularized sparse model;

through a power quantization algorithm of two, the ADC precision requirement and the number of low-resistance devices in the memristor array are reduced so as to reduce the system power consumption overall.

Preferably, the array-aware regularized incremental pruning algorithm includes:

array sensing: adjusting pruning granularity according to the actual array size of the memristor during network cutting;

incremental pruning and layered sparsity are combined: the method comprises the steps that incremental pruning is conducted on a neural network model, model precision layering sparsity is recovered, different pruning rate parameters are set for each layer of the network according to different positions of the network layers in the model, and the pruning parameters of each layer of the network are set according to a low-high-low strategy according to the pruning rate;

threshold calibration: the calibration scheme is to divide the L2 norm of each row by the number of valid columns in the row to achieve normalization.

Preferably, the incremental pruning comprises:

setting the pruning rate to a lower initial value, and only cutting off a small amount of weight of behavior granularity of the memristor array in the first pruning;

restoring the accuracy of the previous model by retraining;

according to the pruning rate step increment, the pruning rate is increased so as to further cut the model;

through retraining to recover the precision, the process of improving the pruning rate and recovering the precision is executed until the accuracy of the whole model reaches the designed threshold target and the network training times reach the set requirement.

Preferably, the threshold calibration comprises:

converting the current layer network to be cut into a general matrix multiplication form;

solving L2 norms of each row according to the granularity of memristor array rows, and sequencing the L2 norms according to the size; and acquiring a pruning L2 norm threshold according to the pruning rate corresponding to the current layer in the hierarchical sparse configuration table, setting the row weight lower than the threshold to be 0, and reserving the rest weights for retraining.

Preferably, the power of two quantization algorithm includes:

grouping current layer networks to be processed, wherein one group is weights needing quantization, and the other group is floating point weights not needing quantization to participate in network retraining;

after the quantization is finished, recovering the precision of quantization loss through retraining;

the step of quantization retraining is performed again by increasing the packet rate alpha until the entire weight of the current layer network is quantized to 2 ⁿ Form (a).

In a second aspect, a neural network model compression system for a memristor accelerator is provided, the system comprising:

module M1: cutting the original network model through a regularized incremental pruning algorithm of array perception to obtain a memristor array friendly regularized sparse model;

module M2: through a power quantization algorithm of two, the ADC precision requirement and the number of low-resistance devices in the memristor array are reduced so as to reduce the system power consumption overall.

Preferably, the array-aware regularized incremental pruning algorithm in the module M1 includes:

incremental pruning and layered sparsity are combined: the method comprises the following steps that incremental pruning is conducted on a neural network model, model precision hierarchical sparsity is recovered, different pruning rate parameters are set for each layer of the network according to different positions of the network layers in the model, and the pruning parameters of each layer of the network are set according to a low-high-low strategy according to the pruning rate;

Preferably, the incremental pruning comprises:

restoring the accuracy of the previous model by retraining;

Preferably, the threshold calibration comprises:

Preferably, the power of two quantization algorithm in the module M2 includes:

the step of quantization retraining is performed again with the packet rate a raised until the total weight of the current layer network is quantized to 2 ⁿ Form (a).

Compared with the prior art, the invention has the following beneficial effects:

1. by the aid of a regularized increment pruning algorithm of array perception, a friendly regularized sparse model of the memristor array is obtained under the condition that accuracy of the model is guaranteed, and memristor array resources are saved;

2. through a power-of-two quantization algorithm, a weight binary coding form is restricted to reduce the precision requirement of an ADC unit in an accelerator system and the number of low-resistance memristor devices in an array, so that the overall power consumption is reduced.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a schematic diagram of the overall framework design of a neural network model compression method;

FIG. 2 is a pruning schematic of memristor array sensing;

FIG. 3 is a schematic diagram of pruning threshold determination;

FIG. 4 is a memristor array power consumption schematic;

fig. 5 is a diagram illustrating a quantization process of power of two.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.

As shown in fig. 1, an embodiment of the present invention provides a neural network model compression method for a memristor accelerator, and compared with a neural network accelerator in a conventional CMOS process, the neural network accelerator based on a memristor can reduce power consumption caused by data transfer and eliminate limitation caused by bandwidth. The memristor neural network accelerator can realize highly parallel multiply-accumulate operation by means of a cross array structure, and the calculation speed of forward reasoning of the neural network is greatly improved. The existing memristor neural network accelerator has the defects of overlarge hardware resource consumption and overhigh power consumption of an analog-digital conversion unit and a memristor array unit. The array-aware regularized incremental pruning algorithm provided by the invention can cut an original network model to obtain a memristor array-friendly regularized sparse model so as to reduce the utilization of hardware resources.

The regularized increment pruning algorithm for array perception mainly comprises three parts of array perception, increment pruning and threshold value calibration:

the array sensing is that the pruning granularity can be adjusted according to the actual array size of the memristor during network clipping, and the pruning visual angle of the existing channel-level pruning algorithm starts from the angle of an original three-dimensional network model, and the input channel or the whole convolution kernel of the same layer in a plurality of convolution kernels is pruned. When the pruning algorithm is adopted to realize the sparse neural network accelerator, the pruned memristors still span the same row or the same column of a plurality of rows no matter how the array scale changes, so the pruning algorithm is irrelevant to the array.

As shown in fig. 2, in the pruning algorithm for array sensing in the row direction, the specific pruning effect is affected by the size of the actual memristor array, and the actual sparsity distributions generated by the arrays of different sizes are also different. When the memristor array is small in scale, the pruning granularity is reduced, and the element-level pruning is close to the fine-grained element-level pruning; when the array is gradually increased, the array aware pruning adjusts the sparse distribution according to the size to obtain a high performance network model again. But no matter how the memristor array changes, the regularized 0 value distribution can be generated inside a single memristor array all the time, so that the weight rearrangement saves array resources.

The clipping of the neural network model inevitably damages the model performance, and the retraining is a commonly used means for recovering the model precision. If the pruning rate is directly set to be too high, the residual non-0 weight after one-time clipping is difficult to update to a theoretical optimal value through gradient reduction, the network retraining loses the function, and the model accuracy cannot be restored to the accuracy of the original dense network. The incremental pruning technology can be used for solving the problems, the incremental pruning firstly sets the pruning rate to a low initial value, so that only a small amount of weights for memorizing the behavior granularity of the resistor array are cut in the first pruning, and the accuracy of the model before is recovered through retraining. And then, the pruning rate is increased according to the step increment of the pruning rate so as to further cut the model, and the precision is recovered through retraining, wherein the whole process of increasing the pruning rate and recovering the precision is executed until the accuracy of the whole model reaches the designed threshold target and the network training frequency reaches the set requirement.

The layering sparsity is to set different pruning rate parameters for each layer of the network according to different positions of the network layers in the model. The main basis is that the redundancy of different network layer models is different, and the influence on the performance of the models is also different. A shallow network close to image input in the model is a key layer for learning the explicit characteristics of the images, and the fitting capacity of the characteristics of the model is influenced, so that the parameter redundancy is low. The parameters are also particularly critical because the weight of the fully-connected layer close to the probability output layer in the model directly influences the final classification accuracy. The layering pruning technology adopted in the embodiment sets the network pruning parameters of each layer according to the low-high-low strategy according to the pruning rate, combines incremental pruning and layering sparsity, and improves the pruning rate according to the principle of layering sparsity and the difference of network layers correspondingly while improving the overall pruning rate step by step.

The pruning principle adopted in this embodiment is to delete the row with the smaller norm of L2, as shown in fig. 3, which shows the pruning process, first, the current layer network to be pruned is converted into a general matrix multiplication form, that is, one-dimensional expansion of a single 3D convolution kernel is performed in the vertical direction, and each 3D convolution kernel is sequentially arranged in the horizontal direction. Each row L2 norm is then solved at the granularity of the memristor array rows and sorted by size. And acquiring a pruning L2 norm threshold according to the pruning rate corresponding to the current layer in the hierarchical sparse configuration table, setting the row weight lower than the threshold to be 0, and reserving the rest weights for retraining. Because the L2 norm size of the memristor array weights is not evenly distributed, the number of rows required to be pruned by each memristor array is not fixed.

As can be seen from fig. 3, when the number of convolution kernels is not a multiple of the memristor array size, the rightmost group of memristor arrays is not filled with the full filling, and when the total number of single 3D convolution kernel elements is not a multiple of the memristor array size, the bottom array is also not filled with the full filling. When performing array-aware row-granularity pruning, there is a disadvantage in computing the L2 norm for array rows that fill less than full, so that the missing rows in the rightmost array are biased toward being pruned by the pruning algorithm. Similar problems exist with the bottom array when performing array-aware column-granularity pruning. The threshold needs to be calibrated by dividing the L2 norm of each row by the number of valid columns in the row to achieve normalization.

Aiming at the problem that the power consumption of an ADC unit and a memristor calculation array in a memristor accelerator system is overlarge, the ADC precision requirement and the number of low-resistance devices in the memristor array can be reduced through power quantization of two, and the system power consumption is reduced overall. ADC power consumption is affected by ADC precision and is exponentially increased along with the improvement of resolution, and the theoretical precision of an ADC in the memristor accelerator is controlled by DAC precision and memristor devices. The ADC precision required by actual memristor array calculation is often smaller than the upper limit set by the formula, the actual precision requirement of the ADC is determined by the current type number on the bit line, when the weight in the array column direction is a large number of 0, the multiplication result of each device is always 0, the current value type number on the bit line is greatly reduced, and the precision of the ADC can also be reduced.

The power consumption of the memristor computational array is dynamically varied as a function of the actually mapped weights and profile input data. An input value of 1 corresponding to a high voltage on a word line is generally designated as HVL and an input value of 0 corresponding to a low voltage on a word line is generally designated as LVL. The memristor has two types, namely a low resistance value LRS and a high resistance value HRS, so that the power consumption generated during calculation of the memristor array can be divided into four parts according to the reference figure 4: (1) HVL on HRS, (2) HVL on LRS, (3) LVL on HRS, and (4) LVL on LRS. According to ohm's law, the power consumption of the device is positively related to the voltage and negatively related to the resistance. The highest power dissipation of the four sections is therefore the HVL on LRS case, i.e. a high voltage is applied across the low resistance devices. LVL on HRS is the case of minimum power consumption. Considering that LVL is typically 0V in a practical memristor accelerator system, power consumption in (3) and (4) is negligible. By reducing the HVL on LRS condition in the memristor array, the power consumption of the memristor array is reduced. Since the feature map cannot be changed as test data, it is necessary to reduce the number of occurrences of bit 1 in the weight.

As shown in fig. 5, which is a schematic diagram of a quantization process of power two in this embodiment, a scheme first groups a current layer network to be processed, where one group is weights that need to be quantized, and the other group is floating point weights that do not need to be quantized to participate in network retraining. The grouping is performed with a criterion that the model is more influenced by numerically high weights, i.e. the quantization is first performed on the weights with high absolute values according to the grouping rate α. And after the quantization is finished, recovering the precision of quantization loss through retraining. Then increasing the packet rate alpha and performing the step of quantization retraining again untilThe full weight of the current layer network is quantized to 2 ⁿ Forms thereof.

After the algorithm processing, all non-0 floating point weights in the originally regularized sparse network model become 2 ⁿ When the quantization precision is 8, the range after the current layer quantization is { + -2 ^-1 ,…,±2 ^-8 0, which is uniformly divided by the minimum value 2 in the quantization range ^-8 Then, the whole power quantization range of two is reduced to { + -2 ⁷ ,…,±2 ⁰ ,0}. After the reduced quantization range is represented in binary code form, the binary code of each element is composed of 7 bits of 0 and 1 bit of 1. The set of reduced weights is mapped with memristors, and each weight requires 7 HRS memristor devices and 1 LRS memristor device.

Compared with the traditional uniform quantization, the power of two quantization can greatly reduce the use number of LRS devices in the memristor array. When the number of LRS devices in the array is greatly reduced, on one hand, a large number of HRS devices corresponding to 0bit numerical values exist in the same column direction in the array, the devices complete multiplication operation of weight 0 and feature map data, the result of the multiplication operation is 0, and the physical expression is extremely low current value. Therefore, when there are a large number of HRS devices in a column, the upper limit of the number of current sums aggregated from the low side is also reduced, thereby reducing the ADC accuracy requirement. Similarly, the proportion of HVL on LRS in the memristor array that accounts for the vast majority of the power consumption is optimized because of the reduced number of LRSs. While the HVL on HRS and LVL on HRS cases are rising. Overall, the power consumption of the memristor array will be greatly reduced.

The embodiment of the invention provides a neural network model compression method for a memristor accelerator, wherein a memristor array-friendly regularized sparse model is obtained by an array-aware regularized incremental pruning algorithm under the condition of ensuring the accuracy of the model so as to save memristor array resources; through a power-of-two quantization algorithm, a weight binary coding form is constrained so as to reduce the precision requirement of an ADC unit in an accelerator system and the number of low-resistance memristor devices in an array, thereby reducing the overall power consumption.

Those skilled in the art will appreciate that, in addition to implementing the system and its various devices, modules, units provided by the present invention as pure computer readable program code, the system and its various devices, modules, units provided by the present invention can be fully implemented by logically programming method steps in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system and various devices, modules and units thereof provided by the present invention can be regarded as a hardware component, and the devices, modules and units included therein for implementing various functions can also be regarded as structures within the hardware component; means, modules, units for performing the various functions may also be regarded as structures within both software modules and hardware components for performing the method.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims

1. A memristor accelerator-oriented neural network model compression method is characterized by comprising the following steps:

step 1: cutting an original network model through an array-aware regularized incremental pruning algorithm to obtain a memristor array-friendly regularized sparse model;

step 2: by means of a power quantization algorithm of two, the ADC precision requirement and the number of low-resistance devices in the memristor array are reduced so as to reduce the system power consumption overall;

the array-aware regularized incremental pruning algorithm in the step 1 comprises the following steps:

incremental pruning and layered sparsity are combined: incremental pruning cuts the neural network model and recovers model precision, layering sparsity sets different pruning rate parameters for each layer of the network according to different positions of the network layers in the model, and the network pruning parameters of each layer are set according to a low-high-low strategy by following the pruning rate;

2. The method of claim 1, wherein the incremental pruning comprises:

step 1-1: setting the pruning rate as an initial value, and only cutting off a small amount of weight with the behavior granularity of the memristor array in the first pruning;

step 1-2: restoring the accuracy of the previous model by retraining;

step 1-3: according to the pruning rate step increment, the pruning rate is increased so as to further cut the model;

step 1-4: through retraining to recover the precision, the process of improving the pruning rate and recovering the precision is executed until the accuracy of the whole model reaches the designed threshold target and the network training times reach the set requirement.

3. The method of claim 1, wherein the threshold calibration comprises:

step 1-5: converting the current layer network to be cut into a general matrix multiplication form;

1-6: solving L2 norms of all rows according to the granularity of memristor array rows, and sequencing the norms according to the sizes; and acquiring a pruning L2 norm threshold according to the pruning rate corresponding to the current layer in the hierarchical sparse configuration table, setting the row weight lower than the threshold to be 0, and reserving the rest weights for retraining.

4. The method of claim 1, wherein the power of two quantization algorithm in step 2 comprises:

step 2-1: grouping current layer networks to be processed, wherein one group is weights needing quantization, and the other group is floating point weights not needing quantization so as to participate in network retraining;

step 2-2: after the quantization is finished, recovering the precision of quantization loss through retraining;

step 2-3: increasing packet rate

The step of quantization retraining is performed again until all weights of the current layer network are quantized to

Forms thereof.

5. A memristor accelerator-oriented neural network model compression system, the system comprising:

module M1: cutting an original network model through an array-aware regularized incremental pruning algorithm to obtain a memristor array-friendly regularized sparse model;

module M2: by means of a power quantization algorithm of two, the ADC precision requirement and the number of low-resistance devices in the memristor array are reduced so as to reduce the system power consumption overall;

wherein the array-aware regularized incremental pruning algorithm in the module M1 includes:

6. The system of claim 5, wherein the incremental pruning comprises:

setting the pruning rate as an initial value, and only cutting off a small amount of weight with the behavior granularity of the memristor array in the first pruning;

restoring the accuracy of the previous model by retraining;

7. The system of claim 5, wherein the threshold calibration comprises:

solving L2 norms of all rows according to the granularity of memristor array rows, and sequencing the norms according to the sizes; and acquiring a pruning L2 norm threshold according to the pruning rate corresponding to the current layer in the hierarchical sparse configuration table, setting the row weight lower than the threshold to be 0, and reserving the rest weights for retraining.

8. The system according to claim 5, wherein the power of two quantization algorithm in the module M2 comprises:

grouping current layer networks to be processed, wherein one group is weights needing quantization, and the other group is floating point weights not needing quantization so as to participate in network retraining;

increasing packet rate

Form (a).