CN113052307B - Memristor accelerator-oriented neural network model compression method and system - Google Patents

Memristor accelerator-oriented neural network model compression method and system Download PDF

Info

Publication number
CN113052307B
CN113052307B CN202110281982.6A CN202110281982A CN113052307B CN 113052307 B CN113052307 B CN 113052307B CN 202110281982 A CN202110281982 A CN 202110281982A CN 113052307 B CN113052307 B CN 113052307B
Authority
CN
China
Prior art keywords
pruning
memristor
array
model
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110281982.6A
Other languages
Chinese (zh)
Other versions
CN113052307A (en
Inventor
王琴
沈林耀
景乃锋
绳伟光
蒋剑飞
毛志刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202110281982.6A priority Critical patent/CN113052307B/en
Publication of CN113052307A publication Critical patent/CN113052307A/en
Application granted granted Critical
Publication of CN113052307B publication Critical patent/CN113052307B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a memristor accelerator-oriented neural network model compression method and system, and relates to the technical field of memristor-based neural network accelerators, wherein the method comprises the following steps: step 1: cutting an original network model through an array-aware regularized incremental pruning algorithm to obtain a memristor array-friendly regularized sparse model; step 2: through a power quantization algorithm of two, the ADC precision requirement and the number of low-resistance devices in the memristor array are reduced so as to reduce the system power consumption overall. The method can solve the problems of overlarge hardware resource consumption and overhigh power consumption of the ADC unit and the computing array when the original model is mapped on the memristor accelerator.

Description

Memristor accelerator-oriented neural network model compression method and system
Technical Field
The invention relates to the technical field of memristor-based neural network accelerators, in particular to a memristor accelerator-oriented neural network model compression method and system.
Background
With the increasing computing power of hardware, neural network technology has become the most popular research direction at present. Neural network algorithms, including convolutional neural network algorithms, have achieved a daunting endeavor in areas such as image recognition, object detection, and semantic segmentation. The neural network application is usually deployed in the form of edge computing, however, with the continuous increase of network scale, the CMOS dedicated neural network accelerator located at the edge end cannot meet the increasing storage and computing requirements, and is further incapable of solving the problems of performance bottleneck and excessive power consumption caused by frequent data transportation of a separate hardware architecture.
In recent years, researchers have tried to break through the limitations of the traditional memory-separate architecture and gradually focused on the technology of in-memory computing. The invention of the resistive random access memory (ReRAM) has great potential to fundamentally solve the problems caused by the computing mechanism and the architecture. The resistive random access memory is also called as a memristor, has the advantages of low power consumption, simple structure, high working speed, variable and controllable resistance and the like, and can realize various operation forms such as Boolean logic operation, vector-matrix multiplication operation and the like by utilizing the memristor. In recent years, the proposal of a neural network accelerator based on memristors provides an effective solution for reducing data handling, reducing storage requirements and improving deep learning forward reasoning capability.
Although memristor neural network accelerators have great advantages in implementing network forward reasoning, such accelerators still have certain problems when used in the field of edge computing. Firstly, a large amount of hardware resources are still consumed when the original dense neural network model is mapped to the memristor neural network accelerator; second, memristor compute arrays and analog-to-digital conversion units (ADCs) in memristor accelerator systems are power hungry.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a memristor accelerator-oriented neural network model compression method and system, which can solve the problems of overlarge hardware resource consumption and overhigh power consumption of an ADC unit and a calculation array when an original model is mapped onto a memristor accelerator.
According to the neural network model compression method and system for the memristor accelerator, the scheme is as follows:
in a first aspect, a neural network model compression method for a memristor accelerator is provided, and the method includes:
cutting an original network model through an array-aware regularized incremental pruning algorithm to obtain a memristor array-friendly regularized sparse model;
through a power quantization algorithm of two, the ADC precision requirement and the number of low-resistance devices in the memristor array are reduced so as to reduce the system power consumption overall.
Preferably, the array-aware regularized incremental pruning algorithm includes:
array sensing: adjusting pruning granularity according to the actual array size of the memristor during network cutting;
incremental pruning and layered sparsity are combined: the method comprises the steps that incremental pruning is conducted on a neural network model, model precision layering sparsity is recovered, different pruning rate parameters are set for each layer of the network according to different positions of the network layers in the model, and the pruning parameters of each layer of the network are set according to a low-high-low strategy according to the pruning rate;
threshold calibration: the calibration scheme is to divide the L2 norm of each row by the number of valid columns in the row to achieve normalization.
Preferably, the incremental pruning comprises:
setting the pruning rate to a lower initial value, and only cutting off a small amount of weight of behavior granularity of the memristor array in the first pruning;
restoring the accuracy of the previous model by retraining;
according to the pruning rate step increment, the pruning rate is increased so as to further cut the model;
through retraining to recover the precision, the process of improving the pruning rate and recovering the precision is executed until the accuracy of the whole model reaches the designed threshold target and the network training times reach the set requirement.
Preferably, the threshold calibration comprises:
converting the current layer network to be cut into a general matrix multiplication form;
solving L2 norms of each row according to the granularity of memristor array rows, and sequencing the L2 norms according to the size; and acquiring a pruning L2 norm threshold according to the pruning rate corresponding to the current layer in the hierarchical sparse configuration table, setting the row weight lower than the threshold to be 0, and reserving the rest weights for retraining.
Preferably, the power of two quantization algorithm includes:
grouping current layer networks to be processed, wherein one group is weights needing quantization, and the other group is floating point weights not needing quantization to participate in network retraining;
after the quantization is finished, recovering the precision of quantization loss through retraining;
the step of quantization retraining is performed again by increasing the packet rate alpha until the entire weight of the current layer network is quantized to 2 n Form (a).
In a second aspect, a neural network model compression system for a memristor accelerator is provided, the system comprising:
module M1: cutting the original network model through a regularized incremental pruning algorithm of array perception to obtain a memristor array friendly regularized sparse model;
module M2: through a power quantization algorithm of two, the ADC precision requirement and the number of low-resistance devices in the memristor array are reduced so as to reduce the system power consumption overall.
Preferably, the array-aware regularized incremental pruning algorithm in the module M1 includes:
array sensing: adjusting pruning granularity according to the actual array size of the memristor during network cutting;
incremental pruning and layered sparsity are combined: the method comprises the following steps that incremental pruning is conducted on a neural network model, model precision hierarchical sparsity is recovered, different pruning rate parameters are set for each layer of the network according to different positions of the network layers in the model, and the pruning parameters of each layer of the network are set according to a low-high-low strategy according to the pruning rate;
threshold calibration: the calibration scheme is to divide the L2 norm of each row by the number of valid columns in the row to achieve normalization.
Preferably, the incremental pruning comprises:
setting the pruning rate to a lower initial value, and only cutting off a small amount of weight of behavior granularity of the memristor array in the first pruning;
restoring the accuracy of the previous model by retraining;
according to the pruning rate step increment, the pruning rate is increased so as to further cut the model;
through retraining to recover the precision, the process of improving the pruning rate and recovering the precision is executed until the accuracy of the whole model reaches the designed threshold target and the network training times reach the set requirement.
Preferably, the threshold calibration comprises:
converting the current layer network to be cut into a general matrix multiplication form;
solving L2 norms of each row according to the granularity of memristor array rows, and sequencing the L2 norms according to the size; and acquiring a pruning L2 norm threshold according to the pruning rate corresponding to the current layer in the hierarchical sparse configuration table, setting the row weight lower than the threshold to be 0, and reserving the rest weights for retraining.
Preferably, the power of two quantization algorithm in the module M2 includes:
grouping current layer networks to be processed, wherein one group is weights needing quantization, and the other group is floating point weights not needing quantization to participate in network retraining;
after the quantization is finished, recovering the precision of quantization loss through retraining;
the step of quantization retraining is performed again with the packet rate a raised until the total weight of the current layer network is quantized to 2 n Form (a).
Compared with the prior art, the invention has the following beneficial effects:
1. by the aid of a regularized increment pruning algorithm of array perception, a friendly regularized sparse model of the memristor array is obtained under the condition that accuracy of the model is guaranteed, and memristor array resources are saved;
2. through a power-of-two quantization algorithm, a weight binary coding form is restricted to reduce the precision requirement of an ADC unit in an accelerator system and the number of low-resistance memristor devices in an array, so that the overall power consumption is reduced.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
FIG. 1 is a schematic diagram of the overall framework design of a neural network model compression method;
FIG. 2 is a pruning schematic of memristor array sensing;
FIG. 3 is a schematic diagram of pruning threshold determination;
FIG. 4 is a memristor array power consumption schematic;
fig. 5 is a diagram illustrating a quantization process of power of two.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.
As shown in fig. 1, an embodiment of the present invention provides a neural network model compression method for a memristor accelerator, and compared with a neural network accelerator in a conventional CMOS process, the neural network accelerator based on a memristor can reduce power consumption caused by data transfer and eliminate limitation caused by bandwidth. The memristor neural network accelerator can realize highly parallel multiply-accumulate operation by means of a cross array structure, and the calculation speed of forward reasoning of the neural network is greatly improved. The existing memristor neural network accelerator has the defects of overlarge hardware resource consumption and overhigh power consumption of an analog-digital conversion unit and a memristor array unit. The array-aware regularized incremental pruning algorithm provided by the invention can cut an original network model to obtain a memristor array-friendly regularized sparse model so as to reduce the utilization of hardware resources.
The regularized increment pruning algorithm for array perception mainly comprises three parts of array perception, increment pruning and threshold value calibration:
the array sensing is that the pruning granularity can be adjusted according to the actual array size of the memristor during network clipping, and the pruning visual angle of the existing channel-level pruning algorithm starts from the angle of an original three-dimensional network model, and the input channel or the whole convolution kernel of the same layer in a plurality of convolution kernels is pruned. When the pruning algorithm is adopted to realize the sparse neural network accelerator, the pruned memristors still span the same row or the same column of a plurality of rows no matter how the array scale changes, so the pruning algorithm is irrelevant to the array.
As shown in fig. 2, in the pruning algorithm for array sensing in the row direction, the specific pruning effect is affected by the size of the actual memristor array, and the actual sparsity distributions generated by the arrays of different sizes are also different. When the memristor array is small in scale, the pruning granularity is reduced, and the element-level pruning is close to the fine-grained element-level pruning; when the array is gradually increased, the array aware pruning adjusts the sparse distribution according to the size to obtain a high performance network model again. But no matter how the memristor array changes, the regularized 0 value distribution can be generated inside a single memristor array all the time, so that the weight rearrangement saves array resources.
The clipping of the neural network model inevitably damages the model performance, and the retraining is a commonly used means for recovering the model precision. If the pruning rate is directly set to be too high, the residual non-0 weight after one-time clipping is difficult to update to a theoretical optimal value through gradient reduction, the network retraining loses the function, and the model accuracy cannot be restored to the accuracy of the original dense network. The incremental pruning technology can be used for solving the problems, the incremental pruning firstly sets the pruning rate to a low initial value, so that only a small amount of weights for memorizing the behavior granularity of the resistor array are cut in the first pruning, and the accuracy of the model before is recovered through retraining. And then, the pruning rate is increased according to the step increment of the pruning rate so as to further cut the model, and the precision is recovered through retraining, wherein the whole process of increasing the pruning rate and recovering the precision is executed until the accuracy of the whole model reaches the designed threshold target and the network training frequency reaches the set requirement.
The layering sparsity is to set different pruning rate parameters for each layer of the network according to different positions of the network layers in the model. The main basis is that the redundancy of different network layer models is different, and the influence on the performance of the models is also different. A shallow network close to image input in the model is a key layer for learning the explicit characteristics of the images, and the fitting capacity of the characteristics of the model is influenced, so that the parameter redundancy is low. The parameters are also particularly critical because the weight of the fully-connected layer close to the probability output layer in the model directly influences the final classification accuracy. The layering pruning technology adopted in the embodiment sets the network pruning parameters of each layer according to the low-high-low strategy according to the pruning rate, combines incremental pruning and layering sparsity, and improves the pruning rate according to the principle of layering sparsity and the difference of network layers correspondingly while improving the overall pruning rate step by step.
The pruning principle adopted in this embodiment is to delete the row with the smaller norm of L2, as shown in fig. 3, which shows the pruning process, first, the current layer network to be pruned is converted into a general matrix multiplication form, that is, one-dimensional expansion of a single 3D convolution kernel is performed in the vertical direction, and each 3D convolution kernel is sequentially arranged in the horizontal direction. Each row L2 norm is then solved at the granularity of the memristor array rows and sorted by size. And acquiring a pruning L2 norm threshold according to the pruning rate corresponding to the current layer in the hierarchical sparse configuration table, setting the row weight lower than the threshold to be 0, and reserving the rest weights for retraining. Because the L2 norm size of the memristor array weights is not evenly distributed, the number of rows required to be pruned by each memristor array is not fixed.
As can be seen from fig. 3, when the number of convolution kernels is not a multiple of the memristor array size, the rightmost group of memristor arrays is not filled with the full filling, and when the total number of single 3D convolution kernel elements is not a multiple of the memristor array size, the bottom array is also not filled with the full filling. When performing array-aware row-granularity pruning, there is a disadvantage in computing the L2 norm for array rows that fill less than full, so that the missing rows in the rightmost array are biased toward being pruned by the pruning algorithm. Similar problems exist with the bottom array when performing array-aware column-granularity pruning. The threshold needs to be calibrated by dividing the L2 norm of each row by the number of valid columns in the row to achieve normalization.
Aiming at the problem that the power consumption of an ADC unit and a memristor calculation array in a memristor accelerator system is overlarge, the ADC precision requirement and the number of low-resistance devices in the memristor array can be reduced through power quantization of two, and the system power consumption is reduced overall. ADC power consumption is affected by ADC precision and is exponentially increased along with the improvement of resolution, and the theoretical precision of an ADC in the memristor accelerator is controlled by DAC precision and memristor devices. The ADC precision required by actual memristor array calculation is often smaller than the upper limit set by the formula, the actual precision requirement of the ADC is determined by the current type number on the bit line, when the weight in the array column direction is a large number of 0, the multiplication result of each device is always 0, the current value type number on the bit line is greatly reduced, and the precision of the ADC can also be reduced.
The power consumption of the memristor computational array is dynamically varied as a function of the actually mapped weights and profile input data. An input value of 1 corresponding to a high voltage on a word line is generally designated as HVL and an input value of 0 corresponding to a low voltage on a word line is generally designated as LVL. The memristor has two types, namely a low resistance value LRS and a high resistance value HRS, so that the power consumption generated during calculation of the memristor array can be divided into four parts according to the reference figure 4: (1) HVL on HRS, (2) HVL on LRS, (3) LVL on HRS, and (4) LVL on LRS. According to ohm's law, the power consumption of the device is positively related to the voltage and negatively related to the resistance. The highest power dissipation of the four sections is therefore the HVL on LRS case, i.e. a high voltage is applied across the low resistance devices. LVL on HRS is the case of minimum power consumption. Considering that LVL is typically 0V in a practical memristor accelerator system, power consumption in (3) and (4) is negligible. By reducing the HVL on LRS condition in the memristor array, the power consumption of the memristor array is reduced. Since the feature map cannot be changed as test data, it is necessary to reduce the number of occurrences of bit 1 in the weight.
As shown in fig. 5, which is a schematic diagram of a quantization process of power two in this embodiment, a scheme first groups a current layer network to be processed, where one group is weights that need to be quantized, and the other group is floating point weights that do not need to be quantized to participate in network retraining. The grouping is performed with a criterion that the model is more influenced by numerically high weights, i.e. the quantization is first performed on the weights with high absolute values according to the grouping rate α. And after the quantization is finished, recovering the precision of quantization loss through retraining. Then increasing the packet rate alpha and performing the step of quantization retraining again untilThe full weight of the current layer network is quantized to 2 n Forms thereof.
After the algorithm processing, all non-0 floating point weights in the originally regularized sparse network model become 2 n When the quantization precision is 8, the range after the current layer quantization is { + -2 -1 ,…,±2 -8 0, which is uniformly divided by the minimum value 2 in the quantization range -8 Then, the whole power quantization range of two is reduced to { + -2 7 ,…,±2 0 ,0}. After the reduced quantization range is represented in binary code form, the binary code of each element is composed of 7 bits of 0 and 1 bit of 1. The set of reduced weights is mapped with memristors, and each weight requires 7 HRS memristor devices and 1 LRS memristor device.
Compared with the traditional uniform quantization, the power of two quantization can greatly reduce the use number of LRS devices in the memristor array. When the number of LRS devices in the array is greatly reduced, on one hand, a large number of HRS devices corresponding to 0bit numerical values exist in the same column direction in the array, the devices complete multiplication operation of weight 0 and feature map data, the result of the multiplication operation is 0, and the physical expression is extremely low current value. Therefore, when there are a large number of HRS devices in a column, the upper limit of the number of current sums aggregated from the low side is also reduced, thereby reducing the ADC accuracy requirement. Similarly, the proportion of HVL on LRS in the memristor array that accounts for the vast majority of the power consumption is optimized because of the reduced number of LRSs. While the HVL on HRS and LVL on HRS cases are rising. Overall, the power consumption of the memristor array will be greatly reduced.
The embodiment of the invention provides a neural network model compression method for a memristor accelerator, wherein a memristor array-friendly regularized sparse model is obtained by an array-aware regularized incremental pruning algorithm under the condition of ensuring the accuracy of the model so as to save memristor array resources; through a power-of-two quantization algorithm, a weight binary coding form is constrained so as to reduce the precision requirement of an ADC unit in an accelerator system and the number of low-resistance memristor devices in an array, thereby reducing the overall power consumption.
Those skilled in the art will appreciate that, in addition to implementing the system and its various devices, modules, units provided by the present invention as pure computer readable program code, the system and its various devices, modules, units provided by the present invention can be fully implemented by logically programming method steps in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system and various devices, modules and units thereof provided by the present invention can be regarded as a hardware component, and the devices, modules and units included therein for implementing various functions can also be regarded as structures within the hardware component; means, modules, units for performing the various functions may also be regarded as structures within both software modules and hardware components for performing the method.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims (8)

1. A memristor accelerator-oriented neural network model compression method is characterized by comprising the following steps:
step 1: cutting an original network model through an array-aware regularized incremental pruning algorithm to obtain a memristor array-friendly regularized sparse model;
step 2: by means of a power quantization algorithm of two, the ADC precision requirement and the number of low-resistance devices in the memristor array are reduced so as to reduce the system power consumption overall;
the array-aware regularized incremental pruning algorithm in the step 1 comprises the following steps:
array sensing: adjusting pruning granularity according to the actual array size of the memristor during network cutting;
incremental pruning and layered sparsity are combined: incremental pruning cuts the neural network model and recovers model precision, layering sparsity sets different pruning rate parameters for each layer of the network according to different positions of the network layers in the model, and the network pruning parameters of each layer are set according to a low-high-low strategy by following the pruning rate;
threshold calibration: the calibration scheme is to divide the L2 norm of each row by the number of valid columns in the row to achieve normalization.
2. The method of claim 1, wherein the incremental pruning comprises:
step 1-1: setting the pruning rate as an initial value, and only cutting off a small amount of weight with the behavior granularity of the memristor array in the first pruning;
step 1-2: restoring the accuracy of the previous model by retraining;
step 1-3: according to the pruning rate step increment, the pruning rate is increased so as to further cut the model;
step 1-4: through retraining to recover the precision, the process of improving the pruning rate and recovering the precision is executed until the accuracy of the whole model reaches the designed threshold target and the network training times reach the set requirement.
3. The method of claim 1, wherein the threshold calibration comprises:
step 1-5: converting the current layer network to be cut into a general matrix multiplication form;
1-6: solving L2 norms of all rows according to the granularity of memristor array rows, and sequencing the norms according to the sizes; and acquiring a pruning L2 norm threshold according to the pruning rate corresponding to the current layer in the hierarchical sparse configuration table, setting the row weight lower than the threshold to be 0, and reserving the rest weights for retraining.
4. The method of claim 1, wherein the power of two quantization algorithm in step 2 comprises:
step 2-1: grouping current layer networks to be processed, wherein one group is weights needing quantization, and the other group is floating point weights not needing quantization so as to participate in network retraining;
step 2-2: after the quantization is finished, recovering the precision of quantization loss through retraining;
step 2-3: increasing packet rate
Figure 736538DEST_PATH_IMAGE002
The step of quantization retraining is performed again until all weights of the current layer network are quantized to
Figure 856941DEST_PATH_IMAGE004
Forms thereof.
5. A memristor accelerator-oriented neural network model compression system, the system comprising:
module M1: cutting an original network model through an array-aware regularized incremental pruning algorithm to obtain a memristor array-friendly regularized sparse model;
module M2: by means of a power quantization algorithm of two, the ADC precision requirement and the number of low-resistance devices in the memristor array are reduced so as to reduce the system power consumption overall;
wherein the array-aware regularized incremental pruning algorithm in the module M1 includes:
array sensing: adjusting pruning granularity according to the actual array size of the memristor during network cutting;
incremental pruning and layered sparsity are combined: incremental pruning cuts the neural network model and recovers model precision, layering sparsity sets different pruning rate parameters for each layer of the network according to different positions of the network layers in the model, and the network pruning parameters of each layer are set according to a low-high-low strategy by following the pruning rate;
threshold calibration: the calibration scheme is to divide the L2 norm of each row by the number of valid columns in the row to achieve normalization.
6. The system of claim 5, wherein the incremental pruning comprises:
setting the pruning rate as an initial value, and only cutting off a small amount of weight with the behavior granularity of the memristor array in the first pruning;
restoring the accuracy of the previous model by retraining;
according to the pruning rate step increment, the pruning rate is increased so as to further cut the model;
through retraining to recover the precision, the process of improving the pruning rate and recovering the precision is executed until the accuracy of the whole model reaches the designed threshold target and the network training times reach the set requirement.
7. The system of claim 5, wherein the threshold calibration comprises:
converting the current layer network to be cut into a general matrix multiplication form;
solving L2 norms of all rows according to the granularity of memristor array rows, and sequencing the norms according to the sizes; and acquiring a pruning L2 norm threshold according to the pruning rate corresponding to the current layer in the hierarchical sparse configuration table, setting the row weight lower than the threshold to be 0, and reserving the rest weights for retraining.
8. The system according to claim 5, wherein the power of two quantization algorithm in the module M2 comprises:
grouping current layer networks to be processed, wherein one group is weights needing quantization, and the other group is floating point weights not needing quantization so as to participate in network retraining;
after the quantization is finished, recovering the precision of quantization loss through retraining;
increasing packet rate
Figure 856252DEST_PATH_IMAGE002
The step of quantization retraining is performed again until all weights of the current layer network are quantized to
Figure 720303DEST_PATH_IMAGE004
Form (a).
CN202110281982.6A 2021-03-16 2021-03-16 Memristor accelerator-oriented neural network model compression method and system Active CN113052307B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110281982.6A CN113052307B (en) 2021-03-16 2021-03-16 Memristor accelerator-oriented neural network model compression method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110281982.6A CN113052307B (en) 2021-03-16 2021-03-16 Memristor accelerator-oriented neural network model compression method and system

Publications (2)

Publication Number Publication Date
CN113052307A CN113052307A (en) 2021-06-29
CN113052307B true CN113052307B (en) 2022-09-06

Family

ID=76513119

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110281982.6A Active CN113052307B (en) 2021-03-16 2021-03-16 Memristor accelerator-oriented neural network model compression method and system

Country Status (1)

Country Link
CN (1) CN113052307B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113553293A (en) * 2021-07-21 2021-10-26 清华大学 Storage and calculation integrated device and calibration method thereof
CN115311506B (en) * 2022-10-11 2023-03-28 之江实验室 Image classification method and device based on quantization factor optimization of resistive random access memory

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109635936A (en) * 2018-12-29 2019-04-16 杭州国芯科技股份有限公司 A kind of neural networks pruning quantization method based on retraining
CN109711532A (en) * 2018-12-06 2019-05-03 东南大学 A kind of accelerated method inferred for hardware realization rarefaction convolutional neural networks
CN109791628A (en) * 2017-12-29 2019-05-21 清华大学 Neural network model splits' positions method, training method, computing device and system
CN110378468A (en) * 2019-07-08 2019-10-25 浙江大学 A kind of neural network accelerator quantified based on structuring beta pruning and low bit
CN110633747A (en) * 2019-09-12 2019-12-31 网易(杭州)网络有限公司 Compression method, device, medium and electronic device for target detector

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3657398A1 (en) * 2017-05-23 2020-05-27 Shanghai Cambricon Information Technology Co., Ltd Weight quantization method for a neural network and accelerating device therefor

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109791628A (en) * 2017-12-29 2019-05-21 清华大学 Neural network model splits' positions method, training method, computing device and system
CN109711532A (en) * 2018-12-06 2019-05-03 东南大学 A kind of accelerated method inferred for hardware realization rarefaction convolutional neural networks
CN109635936A (en) * 2018-12-29 2019-04-16 杭州国芯科技股份有限公司 A kind of neural networks pruning quantization method based on retraining
CN110378468A (en) * 2019-07-08 2019-10-25 浙江大学 A kind of neural network accelerator quantified based on structuring beta pruning and low bit
CN110633747A (en) * 2019-09-12 2019-12-31 网易(杭州)网络有限公司 Compression method, device, medium and electronic device for target detector

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"Learning the sparsity for ReRAM: mapping and pruning sparse neural network for ReRAM based accelerator";Jilan Lin 等;《ASPDAC "19: Proceedings of the 24th Asia and South Pacific Design Automation Conference》;20190131;第639-644页 *
"PattPIM: A Practical ReRAM-Based DNN Accelerator by Reusing Weight Pattern Repetitions";Yuhao Zhang 等;《2020 57th ACM/IEEE Design Automation Conference (DAC)》;20201009;第1-6页 *
"基于粗粒度数据流架构的稀疏卷积神经网络加速";吴欣欣 等;《计算机研究与发展》;20210228;第58卷(第7期);第1504-1517页 *

Also Published As

Publication number Publication date
CN113052307A (en) 2021-06-29

Similar Documents

Publication Publication Date Title
US20220374688A1 (en) Training method of neural network based on memristor and training device thereof
Sun et al. Fully parallel RRAM synaptic array for implementing binary neural network with (+ 1,− 1) weights and (+ 1, 0) neurons
CN113052307B (en) Memristor accelerator-oriented neural network model compression method and system
EP3389051B1 (en) Memory device and data-processing method based on multi-layer rram crossbar array
Chen et al. Technology-design co-optimization of resistive cross-point array for accelerating learning algorithms on chip
Zhang et al. Sign backpropagation: An on-chip learning algorithm for analog RRAM neuromorphic computing systems
EP3627401B1 (en) Method and device for training neural network
CN109635935B (en) Model adaptive quantization method of deep convolutional neural network based on modular length clustering
Ni et al. Distributed in-memory computing on binary RRAM crossbar
Meng et al. Structured pruning of RRAM crossbars for efficient in-memory computing acceleration of deep neural networks
US11544540B2 (en) Systems and methods for neural network training and deployment for hardware accelerators
CN110569962A (en) Convolution calculation accelerator based on 1T1R memory array and operation method thereof
US11562220B2 (en) Neural processing unit capable of reusing data and method thereof
Chen et al. A high-throughput and energy-efficient RRAM-based convolutional neural network using data encoding and dynamic quantization
Lin et al. Rescuing rram-based computing from static and dynamic faults
Peng et al. Inference engine benchmarking across technological platforms from CMOS to RRAM
US20230409892A1 (en) Neural processing unit being operated based on plural clock signals having multi-phases
Chen et al. WRAP: weight RemApping and processing in RRAM-based neural network accelerators considering thermal effect
US20220075601A1 (en) In-memory computing method and in-memory computing apparatus
Chang et al. T-eap: Trainable energy-aware pruning for nvm-based computing-in-memory architecture
CN113705784A (en) Neural network weight coding method based on matrix sharing and hardware system
Peng et al. Network Pruning Towards Highly Efficient RRAM Accelerator
Li et al. Memory saving method for enhanced convolution of deep neural network
Kim et al. SNPU: An energy-efficient spike domain deep-neural-network processor with two-step spike encoding and shift-and-accumulation unit
TWI798798B (en) In-memory computing method and in-memory computing apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant