CN116739049A - Network compression method and device and storage medium - Google Patents

Network compression method and device and storage medium Download PDF

Info

Publication number
CN116739049A
CN116739049A CN202210194057.4A CN202210194057A CN116739049A CN 116739049 A CN116739049 A CN 116739049A CN 202210194057 A CN202210194057 A CN 202210194057A CN 116739049 A CN116739049 A CN 116739049A
Authority
CN
China
Prior art keywords
parameter sharing
clustering
neural network
sharing range
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210194057.4A
Other languages
Chinese (zh)
Inventor
李文进
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Oppo Mobile Telecommunications Corp Ltd
Original Assignee
Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Oppo Mobile Telecommunications Corp Ltd filed Critical Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority to CN202210194057.4A priority Critical patent/CN116739049A/en
Publication of CN116739049A publication Critical patent/CN116739049A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The embodiment of the application provides a network compression method, a device and a storage medium, comprising the following steps: determining a network layer for weight clustering to be executed from a neural network, and acquiring quantization granularity and clustering quantity corresponding to the network layer; according to the quantization granularity, determining the parameter sharing granularity corresponding to the network layer; dividing the network layer into at least one parameter sharing range according to the parameter sharing granularity; respectively allocating a group of quantization parameters for each parameter sharing range in at least one parameter sharing range; and quantizing and clustering weights in the at least one parameter sharing range based on the at least one group of quantized parameters and the clustering number corresponding to the at least one parameter sharing range, thereby obtaining the compressed neural network.

Description

Network compression method and device and storage medium
Technical Field
The present application relates to the field of artificial intelligence, and in particular, to a network compression method and apparatus, and a storage medium.
Background
Neural networks have been widely used for classification, recognition, noise reduction, superrepresentation, etc. tasks in the fields of computer vision, autopilot, etc., however, their complex structure results in a large amount of power consumption and computing resources being occupied even when reasoning, severely limiting the deployment of such technologies on power consumption and resource constrained mobile and embedded platforms, and thus having to undergo network compression prior to deployment. The compression of the neural network may be performed using a method of parameter sharing.
The method comprises the steps of firstly, clustering all single weights layer by layer on a pre-training neural network to obtain the central value of the cluster to which each weight belongs, replacing the original value of the weight with the central value of the cluster to which each weight belongs, then training the central value of each cluster of all layers in the pre-training neural network, and carrying out subsequent processing; and secondly, uniformly dividing each layer of weight in the pre-training neural network into a plurality of weight groups according to a certain rule, taking the whole weight group as a vector in a high-dimensional space, clustering the connection in the high-dimensional space, and taking the weight groups as updated minimum units in subsequent training. However, both of the above methods may result in low accuracy of the compressed neural network.
Disclosure of Invention
The embodiment of the application provides a network compression method, a network compression device and a storage medium, which can improve the precision of a compressed neural network.
The technical scheme of the application is realized as follows:
in a first aspect, an embodiment of the present application proposes a network compression method, where the method includes:
determining a network layer for weight clustering to be executed from a neural network, and acquiring quantization granularity and clustering quantity corresponding to the network layer;
according to the quantization granularity, determining the parameter sharing granularity corresponding to the network layer; dividing the network layer into at least one parameter sharing range according to the parameter sharing granularity;
respectively allocating a group of quantization parameters for each parameter sharing range in the at least one parameter sharing range; and quantizing and clustering weights in the at least one parameter sharing range based on at least one group of quantization parameters corresponding to the at least one parameter sharing range and the clustering quantity to obtain the compressed neural network.
In a second aspect, an embodiment of the present application proposes a network compression device, the device including:
the determining unit is used for determining a network layer for performing weight clustering from the neural network; according to the quantization granularity, determining the parameter sharing granularity corresponding to the network layer;
the acquisition unit is used for acquiring the quantization granularity and the clustering number corresponding to the network layer;
the dividing unit is used for dividing the network layer into at least one parameter sharing range according to the parameter sharing granularity;
an allocation unit for allocating a set of quantization parameters for each of the at least one parameter sharing range, respectively;
and the compression unit is used for quantizing and clustering the weights in the at least one parameter sharing range based on the at least one group of quantized parameters corresponding to the at least one parameter sharing range and the clustering quantity, so as to obtain the compressed neural network.
In a third aspect, an embodiment of the present application provides a network compression apparatus, including: a processor, a memory, and a communication bus; the processor implements the network compression method described above when executing the running program stored in the memory.
In a fourth aspect, an embodiment of the present application provides a storage medium having stored thereon a computer program which, when executed by a processor, implements the network compression method described above.
The embodiment of the application provides a network compression method, a device and a storage medium, wherein the method comprises the following steps: determining a network layer for weight clustering to be executed from a neural network, and acquiring quantization granularity and clustering quantity corresponding to the network layer; according to the quantization granularity, determining the parameter sharing granularity corresponding to the network layer; dividing the network layer into at least one parameter sharing range according to the parameter sharing granularity; respectively allocating a group of quantization parameters for each parameter sharing range in at least one parameter sharing range; and quantizing and clustering weights in the at least one parameter sharing range based on the at least one group of quantized parameters and the clustering number corresponding to the at least one parameter sharing range, thereby obtaining the compressed neural network. By adopting the implementation scheme, the parameter sharing range of each network layer is determined by utilizing the quantization granularity of the network layer, so that the parameter sharing strategy can be applicable to a quantization scene, cluster failure after quantization is avoided, and the precision of the compressed neural network is further improved; on the other hand, the finer granularity parameter sharing is carried out on each network layer, so that the precision of the compressed neural network can be improved.
Drawings
FIG. 1 is a schematic diagram of a method for parameter sharing;
fig. 2 is a flowchart of a network compression method according to an embodiment of the present application;
FIG. 3 is a flow chart of an exemplary network compression method according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a network compression device according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a network compression device according to an embodiment of the present application.
Detailed Description
In the field of neural network compression, parameter sharing refers to dividing a weight group with a small distance or a weight group with a close size in an original neural network into a plurality of clusters with a small number through a certain clustering algorithm, taking the average value of all weights divided into the same cluster as the central value of the cluster, and replacing the original value of all weights/weight groups of the cluster. For the compressed neural network, the index value of the cluster to which the neural network belongs is stored only at the storage position of each weight instead of the original weight, and a code table stored with the original bit width is matched, and each cluster center value is stored in the code table, so that the purposes of compressing the storage space and the memory read-write bandwidth are achieved. For example, the storage location of the weight belonging to the 0 th cluster stores 0, and the 0 th value is searched from the code table during calculation; the memory location of the weights belonging to cluster 1 stores 1, and so on. The weight tensor with n weights in one network layer, the original bit width of each weight is b bits, and after parameter clustering of k clusters is performed on the weight tensor, the implemented compression ratio r is shown in formula (1):
as to the subsequent training, all weights of the same cluster are updated synchronously, which is equivalent to updating only the central value of each cluster, but not the index value. I.e., the weight that originally belongs to the ith cluster still belongs to the ith cluster after training, but its corresponding ith code table value may be updated by training, therefore, the effective parameter quantity in the whole neural network is greatly reduced compared with the original neural network, the degree of freedom of network parameters is further reduced, the regularization effect is achieved, and certain help is provided for avoiding training and fitting. On the other hand, the data stored in the storage position of each weight is converted into an integer index value through a clustering algorithm by the weight of the floating point, so that the number of non-repeated data in the weight tensor is reduced, the regularity of the data is improved, the follow-up adoption of the existing lossless compression coding technology is facilitated, and the storage space and the transmission bandwidth requirement are further reduced.
The specific weight sharing flow is shown in fig. 1, the weights of the 32-bit floating point numbers are clustered, a cluster index of 2-bit unsigned integer numbers is stored in a storage position of each weight, a code table consisting of cluster centers is generated, meanwhile, gradients of the weights of the 32-bit floating point numbers are calculated, the gradients of the 32-bit floating point numbers and the gradients of the clusters are accumulated to obtain gradient accumulated values corresponding to each cluster, and then the cluster center values in the code table are finely adjusted by the gradient accumulated values corresponding to each cluster, so that the compressed neural network is continuously subjected to network training.
Currently, for a parameter sharing method for clustering single weights layer by layer to obtain a clustering center value, for a scene with quantization requirements, quantization may be performed on different granularity according to the precision requirements, for example, layer by layer quantization or channel by channel quantization. The channel-by-channel quantization adopts different quantization parameters for each channel of each layer of the convolutional neural network, and if the layer is still used as a unit, all weights in the same layer are clustered together, and the weights originally belonging to the same cluster after quantization are applied with different quantization parameters in different channels, so that the quantized neural network becomes different weights, and further a code table is invalid. Therefore, for scenes with high precision requirements, if channel-by-channel quantization is intended, the technique cannot be directly applied.
Currently, aiming at a parameter sharing method for equally dividing each layer of weight into a plurality of weight groups and clustering by taking the weight groups as units, a concept of the weight groups is introduced, and the structural advantage of the deep neural network is tried to be utilized. However, clustering in a high-dimensional space has lower flexibility in clustering on a single weight than the former technique, and does not necessarily take care of each weight in the whole set. The premise of the technology is that a certain statistical rule exists in the distribution of each weight group in the same layer of weights, a limited number of clusters can be formed instead of random distribution, and in a physical sense, the original weights are implied to contain a certain redundancy, so that a plurality of groups of weights simultaneously extract similar characteristics to be suitable for being combined through the clusters. But the recent model is more and more focused on the execution capability of end-side reasoning, has obvious restriction on parameter quantity, and is turned to work on innovative aspects of network structure, such as residual connection, attention mechanism and the like. In this case, the practical significance of the scheme is reduced, there is a risk of over-removing the expressive power, and the accuracy is not necessarily higher than for a single weight clustering scheme at the same compression rate.
For a more complete understanding of the nature and the technical content of the embodiments of the present application, reference should be made to the following detailed description of embodiments of the application, taken in conjunction with the accompanying drawings, which are meant to be illustrative only and not limiting of the embodiments of the application.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict. It should also be noted that the term "first\second\third" in relation to embodiments of the present application is used merely to distinguish similar objects and does not represent a particular ordering for the objects, it being understood that the "first\second\third" may be interchanged in a particular order or sequence, where allowed, to enable embodiments of the present application described herein to be practiced in an order other than that illustrated or described herein.
To solve the above problem, an embodiment of the present application provides a network compression method, as shown in fig. 2, which may include:
s101, determining a network layer for weight clustering from a neural network, and acquiring quantization granularity and clustering quantity corresponding to the network layer.
The network compression method provided by the embodiment of the application is suitable for the scenes of neural network compression in the fields of image, voice, natural language processing, control and the like. The neural network may be convolutional neural network (Convolutional Neural Network, CNN), deep neural network (Deep Neural Network, DNN), cyclic neural network (Recurrent Neural Network, RNN) or other neural networks, which may be specifically selected according to practical situations, and the embodiment of the present application is not specifically limited.
In the embodiment of the application, the neural network is a trained neural network, so that the method further comprises the process of training the original neural network to obtain the neural network before the step S101. It should be noted that the training is completed under the condition that the neural network converges or the training frequency reaches the limit of the maximum number of steps. The training herein is training for each weight or each weight group in the original neural network.
It should be noted that, the training type for training the original neural network may be floating point or integer training, or may be quantized perceptual training.
Further, in order to improve the accuracy of the weight clustering, after the original neural network is trained, the neural network can be further processed to obtain a processed neural network, and the accuracy of the weight clustering corresponding to the processed neural network is larger than that of the weight clustering corresponding to the neural network. The processing process can be retraining based on spectrum relaxation K-means regularization, and the weight distribution of the neural network can be adjusted, so that the weight distribution is more centralized, and the accuracy of weight clustering can be promoted; any processing method consistent with promoting the accuracy of weight clustering is also possible.
It should be noted that, for applications with high precision requirements, the need is to avoid any operation of severely changing the weight distribution, and the above-mentioned process of processing the neural network to obtain the processed neural network is not performed at this time.
In the embodiment of the application, the original neural network may be a neural network model with floating point number, or may be a neural network model with fixed point number with any bit width, where any bit width may be 16 bits, 12 bits, 11 bits, 10 bits, 8 bits, 6 bits, 4 bits, or an integer neural network model, and the specific may be selected according to the actual situation, and the embodiment of the application is not limited specifically.
It should be noted that, the network layer to be subjected to weight clustering is a part or all of the network layers in the neural network, and the network layer to be subjected to weight clustering includes, but is not limited to, a convolution layer, a full connection layer, an RNN circulation unit, and other network layers suitable for weight clustering.
In the embodiment of the present application, a corresponding quantization granularity is configured for each network layer in advance, where the quantization granularity may be layer-by-layer quantization, channel-by-channel quantization, or quantization according to other standards, and may specifically be selected according to actual situations, and the embodiment of the present application is not specifically limited.
It should be noted that the number of clusters corresponding to different network layers may be the same or different; the number of clusters corresponding to different network layers can be any reasonable positive integer or can be an integer power of 2, and the number can be specifically selected according to actual conditions.
In the embodiment of the application, the clustering quantity can be preconfigured, and can also be output after the neural network is processed. Further, the number of clusters can be fine-tuned by means such as sensitivity index analysis.
S102, determining parameter sharing granularity corresponding to a network layer according to the quantization granularity; and dividing the network layer into at least one parameter sharing range according to the parameter sharing granularity.
In the embodiment of the application, the quantization granularity is greater than or equal to the parameter sharing granularity. If the quantization granularity is layer-by-layer quantization, the parameter sharing granularity may be parameter sharing in a layer-by-layer range, or parameter sharing in a channel-by-channel range, or parameter sharing in a smaller range; and if the quantization granularity is channel-by-channel quantization, the parameter sharing granularity is parameter sharing in a channel-by-channel range or parameter sharing in a smaller range.
For example, for parameter sharing in a layer-by-layer range, all weights/weight groups in the layer participate in clustering together; for parameter sharing in a channel-by-channel range, all weights/weight groups in the same channel participate in clustering together, and weights/weight groups in different channels do not participate in clustering together.
In the embodiment of the application, if the granularity of sharing of multiple parameters is determined according to the quantization granularity, the granularity of sharing of parameters for dividing the network layer can be determined based on a rule of thumb or some quantitative or qualitative indexes. Wherein the quantitative or qualitative indicators include, but are not limited to, certain numerical indicators, inferred parameters on the data set, network structural characteristics of the neural network, and the like.
S103, respectively distributing a group of quantization parameters for each parameter sharing range in at least one parameter sharing range; and quantizing and clustering weights in the at least one parameter sharing range based on the at least one group of quantized parameters and the clustering number corresponding to the at least one parameter sharing range, thereby obtaining the compressed neural network.
In the embodiment of the application, the quantization parameter can be a scaling coefficient, an offset and the like of linear transformation such as scaling, offset and the like.
In the embodiment of the application, if a platform for deploying the neural network has a support of an operation library and a hardware level, adopting a compression format in the form of an index adding code table, specifically, quantizing and clustering weights in at least one parameter sharing range based on at least one group of quantization parameters and clustering quantity to obtain a group of index values corresponding to each parameter sharing range in the at least one parameter sharing range and one code table corresponding to one group of index values, and storing a clustering center value in one parameter sharing range in one code table; and updating a group of index values into the storage positions of the weights in the corresponding parameter sharing range to obtain the compressed neural network. Therefore, the corresponding clustering center value can be searched from the code table corresponding to the corresponding parameter sharing range through the index value in the storage position of the weight.
In the embodiment of the application, if a deployed neural network slave platform does not have the support of an operation library and a hardware level, corresponding clustering center values are directly stored in a storage position of the weight values, specifically, the weight values in at least one parameter sharing range are quantized and clustered based on at least one group of quantization parameters and clustering quantity, so as to obtain the clustering center value in each parameter sharing range in the at least one parameter sharing range; and respectively updating the clustering center value in each parameter sharing range to the storage position of the weight value in the corresponding parameter sharing range to obtain the compressed neural network.
It should be noted that, the clustering algorithm applied to clustering the weights in the at least one parameter sharing range may be a K-means algorithm, or may be any other clustering algorithm, specifically may be selected according to the actual situation, and the embodiment of the present application is not limited specifically.
In the embodiment of the application, the clustering unit for clustering the weights in the at least one parameter sharing range may be a single weight, or may be a set of weights such as a weight bar, a filter, a feature map, etc., and may be specifically selected according to actual situations.
Further, after the compressed neural network is obtained, training is performed on the compressed neural network to obtain a trained neural network. Wherein the training of the compressed neural network is only directed to clustering center points therein.
In the embodiment of the application, a loss function comprising a parameter sharing loss term is firstly generated, and the compressed neural network is trained by utilizing the loss function to obtain the trained neural network. The other terms of the loss function may be the same as those used in the training process of the original neural network, or more loss terms may be selected or added in the loss function, or the loss terms in the loss function may be learned in the training process, and may be specifically selected according to the actual situation.
In an alternative embodiment, to reduce the loss of accuracy, the loss function may be implemented in a "teaching" manner, and the error in the expression of the compressed neural network from the original neural network and/or other models may be calculated by the output layer and/or the intermediate layer.
Illustratively, taking convolution as an example, the parameter sharing loss term is shown in equation (2),
L ws =‖conv(X,W)-conv(X,W c )‖ n (2)
wherein L is ws For parameter sharing loss term, X is input feature graph tensor, W is weight tensor before clustering, W c For the weight tensor after the original weight value is replaced by each clustering center value after clustering, II n Referring to the Ln norm, n can be any reasonable natural number (typically 2 is desirable).
It can be appreciated that training the compressed neural network by the loss function including the parameter sharing loss term can reduce the influence of the parameter sharing on the network expression capability, thereby improving the accuracy.
In the embodiment of the application, the training of the compressed neural network can be as shown in fig. 1, each weight is firstly independently calculated by gradient, then all gradients of the same cluster are accumulated and then the cluster center point is updated, or each weight can be redirected to the center point of the cluster by modifying the operation diagram on the premise of supporting a Deep Learning (DL) framework, so that the center point directly participates in forward reasoning and backward propagation updating.
The data set required for training the original neural network, the data set required for preprocessing the weight clusters of the neural network, and the data set required for training the compressed neural network may be the whole training set, a fixed part of the whole training set, or a part of the training set randomly extracted from the whole training set each time, or another independent data set. The data sets used in the multiple cycles of the same step may be different or the same; the data sets used in the different steps may or may not be identical.
Further, if there are further processing steps such as further compression, conversion of the model format, preprocessing of Duan Ce deployment, etc., then after the trained neural network is obtained, the further processing steps may also be entered.
It can be understood that the quantization granularity of the network layer is utilized to determine the parameter sharing range of each network layer, so that on one hand, the parameter sharing strategy can be applied to a quantization scene, cluster failure after quantization is avoided, and the precision of the compressed neural network is further improved; on the other hand, the finer granularity parameter sharing is carried out on each network layer, so that the precision of the compressed neural network can be improved.
Based on the above embodiments, the embodiments of the present application provide a network compression method, as shown in fig. 3, which may include:
1. training the original neural network until the original neural network is converged or the training frequency reaches a maximum step number threshold value to obtain the neural network;
2. determining a network layer for weight clustering to be performed from a neural network;
3. traversing each network layer of the neural network, and selecting parameter sharing ranges layer by layer according to the quantization granularity of each network layer;
4. and distributing a set of quantization parameters for each parameter sharing range, and clustering and quantizing the weight values in each parameter sharing range to obtain the compressed neural network.
5. Training the clustering center value in the compressed neural network until the compressed neural network is converged or the training frequency reaches the maximum step number threshold value, and obtaining the trained neural network.
6. And performing post-processing on the trained neural network.
Based on the above embodiments, the embodiment of the present application provides a network compression device 1. As shown in fig. 4, the network compression apparatus 1 includes:
a determining unit 11, configured to determine a network layer for which weight clustering is to be performed from a neural network; according to the quantization granularity, determining the parameter sharing granularity corresponding to the network layer;
an obtaining unit 12, configured to obtain a quantization granularity and a cluster number corresponding to the network layer;
a dividing unit 13, configured to divide the network layer into at least one parameter sharing range according to the parameter sharing granularity;
an allocation unit 14 for allocating a set of quantization parameters for each of the at least one parameter sharing range, respectively;
and the compressing unit 15 is configured to quantize and cluster the weights in the at least one parameter sharing range based on the at least one set of quantization parameters corresponding to the at least one parameter sharing range and the number of clusters, to obtain a compressed neural network.
Optionally, the compressing unit 15 is specifically configured to quantize and cluster weights in the at least one parameter sharing range based on the at least one set of quantization parameters and the number of clusters, obtain a set of index values corresponding to each parameter sharing range in the at least one parameter sharing range and a code table corresponding to the set of index values, and store a cluster center value in the parameter sharing range in the code table; updating the group of index values to the storage positions of the weights in the corresponding parameter sharing range to obtain the compressed neural network;
and/or the compressing unit 15 is specifically configured to quantize and cluster the weights in the at least one parameter sharing range based on the at least one set of quantization parameters and the number of clusters, to obtain a cluster center value in each parameter sharing range in the at least one parameter sharing range; and respectively updating the clustering center value in each parameter sharing range to the storage position of the weight value in a corresponding parameter sharing range to obtain the compressed neural network.
Optionally, the apparatus further includes: a generating unit and a training unit;
the generating unit is used for generating a loss function comprising parameter sharing loss items;
the training unit is used for training the compressed neural network by utilizing the loss function to obtain a trained neural network.
Optionally, the apparatus further includes: a preprocessing unit;
the preprocessing unit is used for carrying out weight clustering preprocessing on the neural network to obtain a processed neural network, and the weight clustering accuracy corresponding to the processed neural network is larger than that corresponding to the neural network.
Optionally, the clustering number is output and/or preconfigured after the neural network is subjected to weight clustering pretreatment.
Optionally, the quantization granularity is greater than or equal to the parameter sharing granularity.
The network compression device provided by the embodiment of the application determines a network layer for weight clustering to be executed from a neural network, and acquires quantization granularity and clustering quantity corresponding to the network layer; according to the quantization granularity, determining the parameter sharing granularity corresponding to the network layer; dividing the network layer into at least one parameter sharing range according to the parameter sharing granularity; respectively allocating a group of quantization parameters for each parameter sharing range in at least one parameter sharing range; and quantizing and clustering weights in the at least one parameter sharing range based on the at least one group of quantized parameters and the clustering number corresponding to the at least one parameter sharing range, thereby obtaining the compressed neural network. Therefore, the network compression device provided by the embodiment utilizes the quantization granularity of the network layers to determine the parameter sharing range of each network layer, so that on one hand, the parameter sharing strategy can be applicable in a quantization scene, cluster failure after quantization is avoided, and the precision of the compressed neural network is further improved; on the other hand, the finer granularity parameter sharing is carried out on each network layer, so that the precision of the compressed neural network can be improved.
Fig. 5 is a schematic diagram of a second component structure of a network compression device 1 according to an embodiment of the present application, in practical application, based on the same disclosure concept of the above embodiment, as shown in fig. 5, the terminal 1 of this embodiment includes: a processor 16, a memory 17 and a communication bus 18.
In a specific embodiment, the determining unit 11, the acquiring unit 12, the dividing unit 13, the distributing unit 14, the compressing unit 15, the generating unit, the training unit and the preprocessing unit may be implemented by a processor 16 located on the network compressing device 1, where the processor 16 may be at least one of an application specific integrated circuit (ASIC, application Specific Integrated Circuit), a digital signal processor (DSP, digital Signal Processor), a digital signal processing image processing device (DSPD, digital Signal Processing Device), a programmable logic image processing device (PLD, programmable Logic Device), a field programmable gate array (FPGA, field Programmable Gate Array), a CPU, a controller, a microcontroller, and a microprocessor. It will be appreciated that the electronics for implementing the above-described processor functions may be other for different devices, and the present embodiment is not particularly limited.
In the embodiment of the present application, the communication bus 18 is used to implement connection communication between the processor 16 and the memory 17; the processor 16 implements the following network compression method when executing the running program stored in the memory 17:
determining a network layer for weight clustering to be executed from a neural network, and acquiring quantization granularity and clustering quantity corresponding to the network layer; according to the quantization granularity, determining the parameter sharing granularity corresponding to the network layer; dividing the network layer into at least one parameter sharing range according to the parameter sharing granularity; respectively allocating a group of quantization parameters for each parameter sharing range in the at least one parameter sharing range; and quantizing and clustering weights in the at least one parameter sharing range based on at least one group of quantization parameters corresponding to the at least one parameter sharing range and the clustering quantity to obtain the compressed neural network.
Further, the processor 16 is further configured to quantize and cluster weights in the at least one parameter sharing range based on the at least one set of quantization parameters and the number of clusters, obtain a set of index values corresponding to each parameter sharing range in the at least one parameter sharing range and a code table corresponding to the set of index values, and store a cluster center value in the one parameter sharing range in the code table; updating the group of index values to the storage positions of the weights in the corresponding parameter sharing range to obtain the compressed neural network; and/or, quantizing and clustering weights in the at least one parameter sharing range based on the at least one set of quantization parameters and the number of clusters to obtain a cluster center value in each parameter sharing range in the at least one parameter sharing range; and respectively updating the clustering center value in each parameter sharing range to the storage position of the weight value in a corresponding parameter sharing range to obtain the compressed neural network.
Further, the processor 16 is further configured to generate a loss function including a parameter sharing loss term; and training the compressed neural network by using the loss function to obtain a trained neural network.
Further, the above processor 16 is further configured to perform weight clustering preprocessing on the neural network to obtain a processed neural network, where the weight clustering accuracy corresponding to the processed neural network is greater than the weight clustering accuracy corresponding to the neural network.
Further, the clustering quantity is output and/or preconfigured after the neural network is subjected to weight clustering pretreatment.
Further, the quantization granularity is greater than or equal to the parameter sharing granularity.
An embodiment of the present application provides a storage medium, on which a computer program is stored, where the computer readable storage medium stores one or more programs, where the one or more programs are executable by one or more processors and applied to a network compression device, where the computer program implements a network compression method as described above.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present disclosure may be embodied essentially or in a part contributing to the related art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), including several instructions for causing an image display device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method described in the embodiments of the present disclosure.
The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the present application.

Claims (10)

1. A method of network compression, the method comprising:
determining a network layer for weight clustering to be executed from a neural network, and acquiring quantization granularity and clustering quantity corresponding to the network layer;
according to the quantization granularity, determining the parameter sharing granularity corresponding to the network layer; dividing the network layer into at least one parameter sharing range according to the parameter sharing granularity;
respectively allocating a group of quantization parameters for each parameter sharing range in the at least one parameter sharing range; and quantizing and clustering weights in the at least one parameter sharing range based on at least one group of quantization parameters corresponding to the at least one parameter sharing range and the clustering quantity to obtain the compressed neural network.
2. The method of claim 1, wherein the quantizing and clustering weights within the at least one parameter sharing range based on the at least one set of quantization parameters and the number of clusters to obtain a compressed neural network, comprises:
quantizing and clustering weights in the at least one parameter sharing range based on the at least one group of quantization parameters and the clustering quantity to obtain a group of index values corresponding to each parameter sharing range in the at least one parameter sharing range and a code table corresponding to the group of index values, wherein a clustering center value in the parameter sharing range is stored in the code table; updating the group of index values to the storage positions of the weights in the corresponding parameter sharing range to obtain the compressed neural network;
and/or, quantizing and clustering weights in the at least one parameter sharing range based on the at least one set of quantization parameters and the number of clusters to obtain a cluster center value in each parameter sharing range in the at least one parameter sharing range; and respectively updating the clustering center value in each parameter sharing range to the storage position of the weight value in a corresponding parameter sharing range to obtain the compressed neural network.
3. The method according to claim 1 or 2, wherein the quantizing and clustering weights within the at least one parameter sharing range based on the at least one set of quantizing parameters and the number of clusters, after obtaining the compressed neural network, the method further comprises:
generating a loss function comprising a parameter sharing loss term;
and training the compressed neural network by using the loss function to obtain a trained neural network.
4. The method of claim 1, wherein prior to the obtaining the quantization granularity and the number of clusters corresponding to the network layer, the method further comprises:
and carrying out weight clustering pretreatment on the neural network to obtain a treated neural network, wherein the weight clustering accuracy corresponding to the treated neural network is greater than that of the neural network.
5. The method of claim 4, wherein the number of clusters is output after weight clustering preprocessing of the neural network, and/or is preconfigured.
6. The method of claim 1, wherein the quantization granularity is greater than or equal to the parameter sharing granularity.
7. A network compression apparatus, the apparatus comprising:
the determining unit is used for determining a network layer for performing weight clustering from the neural network; according to the quantization granularity, determining the parameter sharing granularity corresponding to the network layer;
the acquisition unit is used for acquiring the quantization granularity and the clustering number corresponding to the network layer;
the dividing unit is used for dividing the network layer into at least one parameter sharing range according to the parameter sharing granularity;
an allocation unit for allocating a set of quantization parameters for each of the at least one parameter sharing range, respectively;
and the compression unit is used for quantizing and clustering the weights in the at least one parameter sharing range based on the at least one group of quantized parameters corresponding to the at least one parameter sharing range and the clustering quantity, so as to obtain the compressed neural network.
8. The apparatus of claim 7, wherein the device comprises a plurality of sensors,
the compression unit is specifically configured to quantize and cluster weights in the at least one parameter sharing range based on the at least one set of quantization parameters and the number of clusters, obtain a set of index values corresponding to each parameter sharing range in the at least one parameter sharing range and one code table corresponding to the set of index values, and store a cluster center value in the parameter sharing range in the one code table; updating the group of index values to the storage positions of the weights in the corresponding parameter sharing range to obtain the compressed neural network;
and/or the compression unit is specifically configured to quantize and cluster the weights in the at least one parameter sharing range based on the at least one set of quantization parameters and the number of clusters, so as to obtain a cluster center value in each parameter sharing range in the at least one parameter sharing range; and respectively updating the clustering center value in each parameter sharing range to the storage position of the weight value in a corresponding parameter sharing range to obtain the compressed neural network.
9. A network compression apparatus, the apparatus comprising: a processor, a memory, and a communication bus; the processor, when executing a memory-stored operating program, implements the method of any one of claims 1-6.
10. A storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any of claims 1-6.
CN202210194057.4A 2022-03-01 2022-03-01 Network compression method and device and storage medium Pending CN116739049A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210194057.4A CN116739049A (en) 2022-03-01 2022-03-01 Network compression method and device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210194057.4A CN116739049A (en) 2022-03-01 2022-03-01 Network compression method and device and storage medium

Publications (1)

Publication Number Publication Date
CN116739049A true CN116739049A (en) 2023-09-12

Family

ID=87915552

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210194057.4A Pending CN116739049A (en) 2022-03-01 2022-03-01 Network compression method and device and storage medium

Country Status (1)

Country Link
CN (1) CN116739049A (en)

Similar Documents

Publication Publication Date Title
Huang et al. Learning to prune filters in convolutional neural networks
JP2022066192A (en) Dynamic adaptation of deep neural networks
WO2019155064A1 (en) Data compression using jointly trained encoder, decoder, and prior neural networks
CN110969251B (en) Neural network model quantification method and device based on label-free data
JP2022501677A (en) Data processing methods, devices, computer devices, and storage media
JP2022501678A (en) Data processing methods, devices, computer devices, and storage media
JP2022501675A (en) Data processing methods, devices, computer devices, and storage media
Sakr et al. An analytical method to determine minimum per-layer precision of deep neural networks
CN109344893B (en) Image classification method based on mobile terminal
Huang et al. Task scheduling with optimized transmission time in collaborative cloud-edge learning
CN111079899A (en) Neural network model compression method, system, device and medium
CN111612134A (en) Neural network structure searching method and device, electronic equipment and storage medium
US11544542B2 (en) Computing device and method
CN111008694A (en) No-data model quantization compression method based on deep convolution countermeasure generation network
CN109144719A (en) Cooperation discharging method based on markov decision process in mobile cloud computing system
CN112200296A (en) Network model quantification method and device, storage medium and electronic equipment
CN116775807A (en) Natural language processing and model training method, equipment and storage medium
CN112949610A (en) Improved Elman neural network prediction method based on noise reduction algorithm
Pietron et al. Retrain or not retrain?-efficient pruning methods of deep cnn networks
CN110826692B (en) Automatic model compression method, device, equipment and storage medium
CN110263917B (en) Neural network compression method and device
WO2023087303A1 (en) Method and apparatus for classifying nodes of a graph
CN112200314A (en) Method and system for fast training HTM space pool based on microcolumn self recommendation
Sevakula et al. Fuzzy rule reduction using sparse auto-encoders
CN116739049A (en) Network compression method and device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination