CN114781615A

CN114781615A - Two-stage quantization implementation method and device based on compressed neural network

Info

Publication number: CN114781615A
Application number: CN202210458582.2A
Authority: CN
Inventors: 杨文鑫; 支小莉; 童维勤
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2022-04-24
Filing date: 2022-04-24
Publication date: 2022-07-22

Abstract

The invention relates to a two-stage quantization implementation method and device based on a compressed neural network. The method comprises the following steps: (1) and training the neural network model to be converged by using a CPU or a GPU according to the target task of the neural network. (2) And setting the target sparsity of each layer of the neural network according to the compression ratio of the neural network required by the target task, and pruning the weight stage by stage. (3) Quantizing the weight of each layer of the neural network in two stages according to a quantization bit width preset by a target task, wherein the first stage of quantization is clustering, and clustering the weight of each layer of the neural network by using a clustering algorithm; and the second stage is scaling, the obtained clustering center is scaled to a fixed point number according to a quantization bit width preset by the target task, and the final result stores a weight matrix of each layer in a triple form. The method can reduce the calculation cost and the storage cost required by the model, improve the reasoning speed of the model, and effectively make up for the precision loss of the traditional linear quantization under the condition of low bit width.

Description

Two-stage quantization implementation method and device based on compressed neural network

Technical Field

The invention relates to the technical field of neural networks, in particular to a two-stage quantization implementation method and device based on a compressed neural network.

Background

The deep neural network has high discrimination capability in complex applications such as image classification, target detection, speech synthesis, semantic segmentation and the like. But these models require significant computational and storage costs that make them less well deployed at the edge end devices. For projects that can access powerful computing resources through network connections, deploying large neural networks may not create a resource strain problem. However, for marginal computing on embedded hardware platforms, the computational resources available are limited due to major considerations of security, privacy, and latency (e.g., smart sensors, wearable devices, autonomous and unmanned aircraft tracking) such that its reasoning must be performed locally or at the network edge, while such computing is subject to strict area and power constraints.

The method aims to solve the problems of calculation cost and storage cost of the neural network. Researchers have proposed compressing and quantizing neural networks. Clustering is a common neural network compression technology, the method is mainly used for maximally compressing a neural network model, researchers store a clustering label in a weight matrix of a neural network to convert the weight matrix of floating point numbers into fixed point numbers, but the clustering label cannot be used for calculation, and during network reasoning, the clustering center of the floating point numbers is still used. The idea of converting floating point weights into fixed point weights by quantization was proposed in the last 90 s, and the quantized neural network can use fixed point calculations to accelerate network reasoning. Linear quantization is a commonly used quantization method, which does not cause a significant decrease in the inference accuracy of the neural network model in 8-bit quantization bit width, but the network model loses the inference capability when the model is quantized below 6-bit using linear quantization. The current edge end equipment, such as FPGA, ASIC and the like, can enable the model to use low bit width weight value to carry out reasoning acceleration by customizing the bit width of an adder, a multiplier and a memory. Therefore, how to compress and quantize the model at the same time and how to compensate for the precision loss in the low bit width quantization process is a difficult point of current research.

Disclosure of Invention

Aiming at the problems in the prior art, the invention aims to provide a method and a device for realizing two-stage quantization based on a compressed neural network, so as to improve the model reasoning accuracy of a neural network model under low-bit width quantization while ensuring the compression of the neural network.

In order to achieve the purpose, the technical scheme of the invention is realized as follows:

a two-stage quantization implementation method based on a compressed neural network comprises the following steps:

and S1, training the neural network model to be convergent by using the CPU or the GPU according to the target task of the neural network.

S2, setting the target sparsity of each layer of the neural network according to the compression ratio of the neural network required by the target task, and pruning the weight stage by stage.

S3, quantizing the weight of each layer of the neural network in two stages according to the preset quantization bit width of the target task, wherein the first stage is clustering and clustering the weight of each layer of the neural network by using a clustering algorithm; and the second stage is scaling, the obtained clustering center is scaled to a fixed point number according to a quantization bit width preset by the target task, and the final result stores a weight matrix of each layer in a triple form.

Preferably, the setting of the target sparsity of each layer of the neural network according to the compression ratio of the neural network required by the target task in step S2 specifically includes: according to the compression ratio of the neural network required by the target task, except for the first layer of the neural network, setting the target sparsity of each layer of the neural network; the target sparsity of each layer of the neural network is determined by the type of each layer of the neural network and the depth of the neural network where the layer of the network is located.

Preferably, the pruning the weight step by step in the step S2 specifically includes: determining the weight number M of required pruning in each layer of the network according to the target sparsity, the initial sparsity and the preset pruning frequency of each layer of the neural network, wherein M is a positive integer, determining the M weights with the minimum weight magnitude without pruning as the weights required for the pruning in the current stage, and pruning the weights of each layer of the neural network stage by stage.

Preferably, the quantifying the weight of each layer of the neural network in two stages in the step S3 specifically includes:

the first stage is as follows: determining the number of clustering centers according to the quantization bit width required by a target task, clustering the weight matrix subjected to pruning by using a particle swarm clustering algorithm to obtain clustering centers and clustering labels, and storing the obtained clustering centers in the weight matrix of each layer of the neural network;

and a second stage: and quantizing the clustering center of each layer by using a linear quantization algorithm according to the quantization bit width required by the target task, and storing the weight matrix of each layer in a form of a triple as a final result.

Preferably, the first stage specifically includes: obtaining a clustering center number X according to a preset quantization bit width n of a target task, wherein X is 2ⁿ-1; and because the number of the clustering centers is known, the initial clustering center of each layer of weight of the neural network is obtained by using a particle swarm algorithm, and then the weight of each layer is clustered by using the clustering algorithm to obtain the clustering center and the clustering label.

Preferably, the second stage specifically includes: the input of the linear quantization algorithm is X clustering centers of each layer of the neural network, and the whole weight matrix is not required to be input; and storing the weight matrix of each layer in a triple form, and respectively storing the row and the column of each weight and a clustering center obtained by clustering the weights.

The invention also provides a two-stage quantization device based on the compressed neural network, which comprises the following steps:

a training module: the neural network model is selected according to the target task, and the CPU or the GPU is used for training the neural network model to be in a convergence state;

a pruning module: the target sparsity of each layer of the network is respectively set aiming at layers of different types and depths in the network according to the compression ratio required by the neural network model, and the weight of each layer is pruned stage by stage;

a clustering module: the device comprises a neural network model, a clustering center and a clustering label, wherein the neural network model is used for determining the number of the clustering centers according to the quantization bit width required by a target task, clustering the weight of each layer of the pruned neural network model by using a particle swarm clustering algorithm to obtain the clustering centers and the clustering labels, and the weight matrix of each layer of the neural network stores the clustering centers;

a quantization module: and quantizing the clustering center of each layer by using a linear quantization algorithm according to the quantization bit width required by the target task, storing the weight matrix of each layer in a triple form as a final result, and quantizing and compressing the neural network.

Compared with the prior art, the invention has the following advantages:

the method and the device effectively reduce the storage cost and the calculation cost of the neural network model through pruning, particle swarm clustering and linear quantization. The invention provides a method for separating the quantization process into two stages, wherein the first stage uses a particle swarm clustering algorithm to cluster the weight of each layer of a neural network, and the second stage uses a linear quantization algorithm to scale the weight of each layer of the neural network to a fixed point number, so that the linear quantization only plays a scaling role, and the precision loss of the traditional linear quantization in the low-bit-width quantization process can be effectively compensated through the two-stage quantization of the weight of the neural network. And the pruned and quantized sparse matrix is compressed and stored in a triple form, so that the memory space occupied by the model can be effectively reduced.

Drawings

Fig. 1 is a flowchart of a two-stage quantization implementation method based on a compressed neural network according to an embodiment of the present invention.

Fig. 2 is a structural diagram of a two-stage quantization implementation apparatus based on a compressed neural network according to an embodiment of the present invention.

Fig. 3 is a flowchart of a clustering algorithm in a two-stage quantization implementation method based on a compressed neural network according to an embodiment of the present invention.

Fig. 4 is a flowchart of a particle group algorithm in a two-stage quantization implementation method based on a compressed neural network according to an embodiment of the present invention.

Detailed Description

Technical solutions and features in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the embodiments described below are only a part of the embodiments of the present invention, and not all of the embodiments. Other embodiments, which can be obtained by those skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

As shown in fig. 1, a two-stage quantization implementation method based on a compressed neural network provided by an embodiment of the present invention includes the following steps:

and S1, training the neural network model to be converged by using the CPU or the GPU according to the target task of the neural network.

In this embodiment, the neural network model needs to be trained to a convergence state according to the target task, so as to ensure that the neural network model has performance and accuracy for realizing the target task before compression and quantization.

In this embodiment, the neural network model may include multiple layers of different types of networks. And according to the compression ratio of the neural network model required by the target task, setting respective target sparsity for each layer of network except the first layer of network.

In one possible implementation, the sparsity of network targets of different types is different. For example, the target sparsity of the fully connected layer is 0.8, and the target sparsity of the convolutional layer is 0.6.

In one possible implementation, different depths of convolutional layers and target sparsity of convolutional layers are different. For example, the target sparsity of the first convolutional layer is 0.4, and the target sparsity of the second convolutional layer is 0.5.

In this embodiment, the pruning of the weight is divided into a plurality of stages, and the weight required for pruning in each stage is pruned stage by stage.

In one possible implementation, the target sparsity differs for each stage of each layer network. S_pIs the target sparsity at stage p; s. the₀Is the initial sparsity, typically 0; s_fIs the final target sparsity; q represents a preset total number of stages; the target sparsity calculation process of each layer of the network at each stage is as follows:

in one possible implementation, each stage of each layer network may or may not prune according to the set pruning frequency.

In one possible implementation, the number of weights of pruning required for each stage of each layer of the network is different. And obtaining the neural network weight required to be pruned in the stage according to the target sparsity, the initial sparsity, the preset pruning frequency and the part, which is not pruned, of the neural network weight in the current stage.

In this embodiment, the neural network determines the number of clustering centers according to the quantization bit width required by the target task, and clusters the weight of the pruned neural network model to obtain the clustering centers and the clustering labels.

In a possible implementation mode, the weight of the neural network needs to be quantized to n bits, and the number of clustering centers is 2ⁿ-1. For example, the quantization is 8 bits according to the requirement of the target task neural network, and the clustering center number is 255.

In this embodiment, according to the number of clustering centers, a particle swarm clustering algorithm is used to cluster the weights of each layer of the neural network, so that the convergence rate of the clustering algorithm can be effectively increased, and the problem that the final result cannot converge due to the fact that the traditional clustering algorithm falls into a local optimal solution can also be avoided.

In a possible implementation mode, a particle swarm algorithm is used for the weight of each layer of neural network to obtain an initial clustering center of the clustering algorithm. Where t denotes the current time, t +1 denotes the next time, X_iAnd V_iRespectively the velocity and position of each particle, where ω is the inertia factor, c₁、c₂Is a learning factor, typically 2, the individual optimum position pb_iAnd historical optimal position g_b. The calculation process is as follows:

V_i(t+1)＝ωV_i(t)+c₁random(0，1)(pb_i(t)-X_i(t))+c₂random(0，1)(g_b-X_i(t))

X_i(t+1)＝V_i(t)+V_i(t+1)

in a possible implementation manner, as shown in fig. 4, in the iteration process, when the iteration number of the particle swarm algorithm exceeds the maximum iteration number or the average change of the centroid update is smaller than a preset threshold, the iteration is ended, and the clustering initial center of the weight of each layer of the network is output.

In the embodiment, a linear quantization algorithm is divided into two stages, wherein the first stage is clustering, and the weight of each layer of a neural network is clustered by using a clustering algorithm; and the second stage is scaling, and scaling the obtained clustering center to a fixed point number according to the preset quantization bit width of the target task. By dividing the linear quantization algorithm into two stages, the precision loss in the linear quantization process can be compensated, so that the accuracy rate required by the target task can still be realized after the fixed-point numeration of the neural network model is ensured.

In this embodiment, the weights of each layer are clustered according to the obtained initial clustering center of each layer of the neural network, so as to obtain a final clustering center and a clustering label of each layer of weights.

In a possible implementation manner, each layer of initial clustering center obtained according to a particle swarm algorithm is used as an initial center of the clustering algorithm, Euclidean distance d is used as a measurement mode among weights, wherein (M1, M2.. Mk) is a centroid set, C is an input data dimension degree, and C is a current dimension; and dividing the data set into K clusters, and continuously iterating to update the mass center, so that the smaller the point set interval in each cluster is, the larger the point set interval outside each cluster is. The calculation process is as follows:

wherein Z is_mRepresenting the mth data vector; m is a group of_jRepresents the jth centroid vector; r_jRepresenting the jth set of centroids and x representing each sample in the current set of centroids.

In one possible implementation, the iteration process is as shown in fig. 3, until the iteration reaches the maximum number of times or the current centroid is not updated, the weight cluster center and the cluster label of each layer of neural network are output, and the cluster center is stored in the neural network weight matrix.

In the present embodiment, the cluster center of each layer is quantized using a linear quantization algorithm according to the number of bits required for quantization of the target task.

In a possible implementation mode, the clustering centers are quantized to the fixed point number through the maximum value max and the minimum value min of the clustering centers stored in the weight matrix of each layer of the network and the fixed point scaling range determined according to the target task, the input of a linear quantization algorithm is X clustering centers of each layer of the neural network, the whole weight matrix does not need to be input, and the quantization efficiency can be effectively improved. For example, the target task quantization bit number is 8 bits, the fixed point scaling range is-127 to 127, and the calculation process is as follows:

wherein R represents a floating-point number matrix; a represents the minimum value of the current matrix, and b represents the maximum value of the current matrix; n is the target task quantization bit width; s represents a scaling coefficient, and Z represents a constant before and after quantization.

In this embodiment, the matrix of the neural network after pruning is a sparse matrix, the quantized clustering center is stored in the weight matrix of each layer of the network, and the sparse weight matrix is finally stored in the triplet, so that the compression rate of the neural network model can be improved.

In a possible implementation manner, the weight matrix of each layer of the neural network is stored in a triple form, and rows and columns of each weight and a clustering center obtained by clustering the weights are respectively stored.

As shown in fig. 2, an embodiment of the present invention further provides a two-stage quantization apparatus based on a compressed neural network, including:

the training module 21: and the neural network model is trained to be in a convergence state by using the CPU or the GPU according to the target task.

Pruning module 22: and the target sparsity of each layer is respectively set aiming at layers of different types and depths in the network according to the compression ratio required by the neural network model, and the weight of each layer is pruned stage by stage.

The clustering module 23: the method is used for determining the number of clustering centers according to the quantization bit width required by a target task, clustering the weight of the pruned neural network model to obtain the clustering centers and clustering labels, and storing the clustering centers in the weight matrix of each layer of the neural network.

The quantization module 24: and quantizing the clustering center of each layer by using a linear quantization algorithm according to the quantization bit width required by the target task, and storing the weight matrix of each layer in a triple form as a final result.

The invention provides a two-stage quantization implementation method and device based on a compressed neural network, which effectively reduce the calculation cost and the storage cost of the neural network. Aiming at the problems that a clustering label accessed by a weight matrix of the traditional clustering method only plays a role of a compression network and cannot be used for actual calculation, the low bit width linear quantization loses reasoning ability and the like, the invention provides a two-stage quantization method combining clustering and linear quantization.

The first stage uses a particle swarm clustering algorithm to cluster the weight of each layer of the neural network, and the second stage uses a linear quantization algorithm to scale the weight of each layer to a fixed point number, so that the linear quantization only plays a scaling role. By dividing the linear quantization algorithm into two stages, the precision loss in the linear quantization process can be effectively compensated, so that the accuracy rate required by the target task can still be realized after the fixed-point numeration of the neural network model is ensured.

The method uses multi-stage unstructured pruning aiming at the redundant neural network weight, and the finally obtained sparse matrix is compressed and stored in a triple mode, so that the calculation cost and the storage cost of the neural network model are reduced at the same time.

The embodiment is based on a two-stage quantization implementation method and device of a compressed neural network, and a CPU or a GPU is used for training a neural network model to be convergent according to a target task of the neural network; setting target sparsity of each layer of the neural network according to the compression ratio of the neural network required by the target task, and pruning the weight stage by stage; quantizing the weight of each layer of the neural network in two stages according to a preset quantization bit width of a target task, wherein the first stage of quantization is clustering, and clustering the weight of each layer of the neural network by using a clustering algorithm; and the second stage is scaling, the obtained clustering center is scaled to a fixed point number according to a quantization bit width preset by the target task, and the final result stores a weight matrix of each layer in a triple form. The embodiment of the invention can reduce the calculation cost and the storage cost required by the model, improve the reasoning speed of the model and effectively make up the precision loss of the traditional linear quantization under the low bit width.

Finally, it should be noted that: the foregoing is merely a preferred embodiment of the present invention and is not intended to be exhaustive. The technical solutions of the present invention are only used for illustration, and the scope of the present invention is not limited. The terminology used herein was chosen in order to best explain the principles of various embodiments. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A two-stage quantization implementation method based on a compressed neural network is characterized by comprising the following steps:

s1, according to the target task of the neural network, using a CPU or a GPU to train the neural network model to convergence;

s2, setting the target sparsity of each layer of the neural network according to the compression ratio of the neural network required by the target task, and pruning the weight stage by stage;

s3, quantizing the weight of each layer of the neural network in two stages according to the preset quantization bit width of the target task; the first stage is clustering, and clustering the weight of each layer of the neural network by using a clustering algorithm; and the second stage is scaling, the obtained clustering centers are scaled to fixed points according to the preset quantization bit width of the target task, and the final result stores the weight matrix of each layer in a triple form.

2. The method according to claim 1, wherein the step S2 is to set a target sparsity of each layer of the neural network according to a compression ratio of the neural network required by a target task, and specifically includes: according to the compression ratio of the neural network required by the target task, except for the first layer of the neural network, setting the target sparsity of each layer of the neural network; the target sparsity of each layer of the neural network is determined by the type of each layer of the neural network and the depth of the neural network where the layer of the network is located.

3. The method for implementing two-stage quantization based on compressed neural network of claim 1, wherein the pruning of the weights stage by stage in the step S2 specifically includes: determining the number M of weights of pruning required by each layer of the network at the current stage according to the target sparsity, the initial sparsity and the preset pruning frequency of each layer of the neural network, wherein M is a positive integer, determining the M weights with the minimum weight magnitude of non-pruning as the weights of pruning required by the current stage, and pruning the weights of each layer of the neural network stage by stage.

4. The method according to claim 1, wherein the step S3 of quantizing the weights of each layer of the neural network in two stages specifically includes:

5. The compressed neural network-based two-stage quantization implementation method of claim 4, wherein the first stage specifically comprises: obtaining a clustering center number X according to a preset quantization bit width n of a target task, wherein X is 2ⁿ-1; and because the number of the clustering centers is known, the initial clustering center of each layer of weight of the neural network is obtained by using a particle swarm algorithm, and then the weight of each layer is clustered by using the clustering algorithm to obtain the clustering center and the clustering label.

6. The method of claim 4, wherein the second stage comprises: the input of the linear quantization algorithm is X clustering centers of each layer of the neural network, and the whole weight matrix is not required to be input; and storing the weight matrix of each layer in a triple form, and respectively storing the row and the column of each weight and a clustering center obtained by clustering the weights.

7. A two-stage quantization apparatus based on a compressed neural network, comprising:

a clustering module: the device comprises a neural network model, a particle swarm clustering algorithm, a cluster center and a cluster label, wherein the neural network model is used for dividing a target task into a plurality of layers;

a quantization module: and (3) quantizing the clustering center of each layer by using a linear quantization algorithm according to the quantization bit width required by the target task, storing the weight matrix of each layer in a triple form as a final result, and quantizing and compressing the neural network.