CN116432715A

CN116432715A - Model compression method, device and readable storage medium

Info

Publication number: CN116432715A
Application number: CN202310704493.6A
Authority: CN
Inventors: 谢旭; 艾国; 杨作兴
Original assignee: Shenzhen MicroBT Electronics Technology Co Ltd
Current assignee: Shenzhen MicroBT Electronics Technology Co Ltd
Priority date: 2023-06-14
Filing date: 2023-06-14
Publication date: 2023-07-14
Anticipated expiration: 2043-06-14
Also published as: CN116432715B

Abstract

The embodiment of the invention provides a model compression method, a device and a readable storage medium. The method comprises the following steps: acquiring a first model, wherein the first model is a floating point model obtained by training by using a training set; scaling the weight of each convolution layer in the first model to obtain a second model; quantizing the second model based on a preset quantization mode to obtain a third model; the third model is a fixed-point model; determining a target convolution layer in the third model, and correcting the shift times of the target convolution layer to obtain a target model; the target convolution layer refers to a convolution layer of which the shift times in the third model do not meet a preset range; the shift number of each convolution layer in the target model satisfies the predetermined range. The embodiment of the invention can save the hardware resources required by the fixed-point model obtained after quantization when the convolution calculation is carried out.

Description

Model compression method, device and readable storage medium

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to a method and apparatus for compressing a model, and a readable storage medium.

Background

With the development of deep learning technology, the deep neural network model is widely applied to various application scenarios, such as image processing, speech recognition, reasoning/prediction, knowledge expression, operation control, and the like.

In order to improve the performance of the deep neural network model, the parameter number and the calculation amount of the model are also increased sharply, so that great challenges are brought to the training and deployment of the model. Particularly in the aspect of model deployment, the deep neural network model is difficult to deploy on hardware equipment (such as mobile equipment) with limited resources due to huge parameter and calculation amount of the deep neural network model.

In order to reduce the consumption of the deep neural network model on hardware, the deep neural network model can be deployed on hardware equipment with limited resources, and floating point weights can be approximated to low-bit integers through a model quantization technology, so that the calculation process is completed under the low-bit representation. However, the shift operation of the fixed-point model obtained after quantization when performing convolution calculation still requires a large amount of hardware resources.

Disclosure of Invention

The embodiment of the invention provides a model compression method, a device and a readable storage medium, which can save hardware resources required by a fixed point model obtained after quantization when convolution calculation is performed.

In a first aspect, an embodiment of the present invention discloses a method for compressing a model, the method including:

acquiring a first model, wherein the first model is a floating point model obtained by training by using a training set;

scaling the weight of each convolution layer in the first model to obtain a second model;

quantizing the second model based on a preset quantization mode to obtain a third model; the third model is a fixed-point model;

determining a target convolution layer in the third model, and correcting the shift times of the target convolution layer to obtain a target model; the target convolution layer refers to a convolution layer of which the shift times in the third model do not meet a preset range; the shift number of each convolution layer in the target model satisfies the predetermined range.

In a second aspect, an embodiment of the present invention discloses a model compression apparatus, the apparatus including:

the model training module is used for acquiring a first model, wherein the first model is a floating point model obtained by training by using a training set;

the weight calibration module is used for scaling the weight of each convolution layer in the first model to obtain a second model;

the model quantization module is used for quantizing the second model based on a preset quantization mode to obtain a third model; the third model is a fixed-point model;

The numerical correction module is used for determining a target convolution layer in the third model and correcting the shift times of the target convolution layer to obtain a target model; the target convolution layer refers to a convolution layer of which the shift times in the third model do not meet a preset range; the shift number of each convolution layer in the target model satisfies the predetermined range.

In a third aspect, embodiments of the present invention disclose a machine-readable medium having instructions stored thereon, which when executed by one or more processors of an apparatus, cause the apparatus to perform a model compression method as described in one or more of the preceding.

The embodiment of the invention has the following advantages:

before the floating point model is quantized, the embodiment of the invention calibrates the weight range of each convolution layer of the floating point model (first model) so that the weight of each convolution layer is distributed in a relatively gentle interval, thereby avoiding the condition that the weight range of certain convolution layers fluctuates greatly, reducing the value range of shift times when the fixed point model obtained after quantization carries out convolution calculation, and further reducing hardware resources required by the fixed point model for carrying out convolution calculation. In addition, in the embodiment of the invention, the range of the weight of each convolution layer of the floating point model (the first model) is calibrated to obtain the second model, the second model is quantized to obtain the third model, and then the shift times of the target convolution layer in the third model are further corrected to ensure that the convolution calculation of the target convolution layer can be correctly executed. The shift times of each convolution layer of the finally obtained target model meet a preset range, and the preset range is the minimum value range of the shift times of each convolution layer of the fixed-point model for convolution calculation under the premise of ensuring the accuracy of the fixed-point model. Therefore, the embodiment of the invention can reduce redundancy of the fixed-point model as much as possible on the premise of ensuring the precision of the fixed-point model, thereby reducing hardware resources required by convolution calculation of the fixed-point model.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of steps of an embodiment of a model compression method in accordance with an embodiment of the present invention;

FIG. 2 is a schematic diagram of a network structure of a first model in an example of an embodiment of the invention;

fig. 3 is a schematic structural view of an embodiment of a model compressing apparatus according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The terms first, second and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged, as appropriate, such that embodiments of the present invention may be implemented in sequences other than those illustrated or described herein, and that the objects identified by "first," "second," etc. are generally of a type, and are not limited to the number of objects, such as the first object may be one or more. Furthermore, the term "and/or" as used in the specification and claims to describe an association of associated objects means that there may be three relationships, e.g., a and/or B, may mean: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship. The term "plurality" in embodiments of the present invention means two or more, and other adjectives are similar.

Referring to fig. 1, a flowchart illustrating steps of an embodiment of a model compression method of the present invention may include the steps of:

step 101, acquiring a first model, wherein the first model is a floating point model obtained by training by using a training set;

102, scaling the weight of each convolution layer in the first model to obtain a second model;

step 103, quantizing the second model based on a preset quantization mode to obtain a third model; the third model is a fixed-point model;

104, determining a target convolution layer in the third model, and correcting the shift times of the target convolution layer to obtain a target model; the target convolution layer refers to a convolution layer of which the shift times in the third model do not meet a preset range; the shift number of each convolution layer in the target model satisfies the predetermined range.

The method provided by the invention can be applied to the field of artificial intelligence, such as the fields of intelligent manufacturing, intelligent transportation, intelligent home, intelligent medical treatment, intelligent security, automatic driving, safe city and the like. Specifically, the method provided by the invention can be applied to the fields of automatic driving, image classification, image segmentation, target detection, image retrieval, image semantic segmentation, image quality enhancement, image super-resolution, natural language processing and the like which need to use (depth) neural networks.

For example, the method of the invention can be used for obtaining a target model which can detect targets such as pedestrians, vehicles, traffic signs or lane lines; as another example, using the method of the present invention, a target model is obtained that can identify a target, such as a face, a vehicle, an object, etc., by analyzing an input image; etc. The model mentioned in the embodiment of the invention refers to a neural network model. The embodiment of the invention does not limit the network structure of the neural network model. For example, the network structure of the neural network model may include a Backbone network (Backbone), a Neck (Neck), and an output header (Head). A Backbone network (Backbone) is used to extract features of an input image to obtain multi-level (multi-scale) features of the image. The Neck (Neck) is used for screening and fusing the multi-scale features to generate a more compact and expressive feature vector. The output Head (Head) is used to transform the features into predicted results that ultimately meet the needs of the task. For example, the prediction result finally output in the image classification task is a probability vector of each category to which the input image belongs; the prediction result in the target detection task is the coordinates of all candidate target frames in the image and the probability that the candidate target frames belong to various categories, which exist in the input image; the prediction module in the image segmentation task needs to output a class classification probability map at the image pixel level.

The network structure of the backbone network may include a plurality of stages (stages), each stage may include at least one block (block), the number of blocks in different stages may be different, and the super parameters (e.g., expansion coefficient, convolution kernel size, etc.) in each block may also be different. Wherein, the blocks may be composed of basic atoms in the convolutional neural network, including convolutional layers, pooling layers, fully-connected layers, or nonlinear activation layers, etc. A block may also be referred to as a base unit or base module.

Model quantization refers to converting a floating point model to a fixed point model. The input of the floating point model and the weighting of the model are floating point type data. The input of the fixed-point model and the weight of the model are fixed-point type data. For example, the floating point model of float32 is quantized to the fixed point model of int 8.

The basic formula for quantization is as follows:

（1）

wherein Q represents the fixed point value after quantization, R represents the real floating point value, S represents the minimum scale that quantization can represent, namely the proportional relation between the floating point value and the fixed point value, Z represents the fixed point value corresponding to the quantized floating point value 0. The floating point value refers to a specific value of the floating point number, and the fixed point value specifies a specific value of the point number. The calculations of S and Z are as follows:

（2）

Represents the maximum floating point value, +.>

Representing the smallest floating point value,/->

The maximum set point value is indicated and,

representing the smallest setpoint value.

（3）

round is a round function, round up or round down can be chosen.

Quantization of the model refers to quantization of the weights of the convolution kernels of each convolution layer in the model. The formula of the convolution calculation performed by the quantized convolution layer is as follows:

（4）

wherein i, w and o represent input, weight and output, respectively; r is a floating point number; b is the original floating point offset (bias). According to formula (1) above:

（5）

（6）

（7）

in the above steps (5) - (7),

、/>

、/>

respectively representing the quantized fixed-point numbers; />

、/>

、/>

Respectively representing the proportional relation between floating point number and fixed point number; />

、/>

、/>

And respectively represent fixed point numbers corresponding to quantized 0 in the floating point numbers. From the above formula (7), it can be obtained:

（8）

and the result of convolution calculation output is represented, and the result is the corresponding fixed point number after quantization.

Substituting the above formulas (4), (6) and (7) into the above formula (8) can result in the quantization formulas for convolution calculation as follows:

（9）

in the above-mentioned (9),

，/>

，/>

。

where a is a floating point number, (x+b) is a fixed point number, and the convolution calculation is converted into a floating point number multiplied by a fixed point number. Because fixed-point reasoning of the fixed-point model requires that each step of calculation is calculation among fixed-point numbers, and any floating-point number can be expressed in a form of n times of a number multiplied by 2, the following conversion can be further performed on a:

（10）

In the above formula (10), a is a floating point number, such as float or double typeIs a floating point number of (c),

normalized to the range of [0.5, 1), for any floating point number a, it can be approximated as a fixed point number (i.e., integer) a:

（11）

（12）

（13）

thus, the convolution calculation can be converted into a multiplication operation of the following fixed point number:

（14）

in the above formula (14), R _int The fixed point number, a is a floating point number, Q is a fixed point number, and the above formula (13) is substituted into the above formula (14) to obtain:

（15）

wherein, the liquid crystal display device comprises a liquid crystal display device,<<indicating a left shift operation, the number of left shifts indicating the number of times divided by 2;>>indicating a right shift operation, the number of right shifts indicating the number of times multiplied by 2. n represents the shift number of convolution calculation of one convolution layer, R _int Representing the result of the convolution calculation. After the floating point model is quantized into the fixed point model, the weight of each convolution layer in the fixed point model is a fixed point number. The second model of float32 is quantized, for example, to obtain a third model of int 8. The convolution layer in the third model can be subjected to convolution calculation by the above expression (15), and a result of the convolution calculation is obtained.

In the above formula (15), the shift number n of the convolution calculation is 32 (0 to 31) possible, so that 32 hardware correlation logics are required to realize the shift calculation of 32 times at maximum, and hardware resources are still relatively consumed. Wherein, the hardware related logic refers to the combination logic and the wiring resource of the hardware. In addition, there is some redundancy in the neural network model. The invention aims to reduce redundancy of a neural network model as far as possible on the premise of ensuring model accuracy by reducing the value range of shift times of convolutional calculation of a fixed point model, thereby reducing hardware resources required by the convolutional calculation of the fixed point model. In one example, on the premise of ensuring the model accuracy, the embodiment of the invention reduces the original value range of the shift number n of the convolution calculation to 6-25, and only 20 hardware related logics are needed to realize the shift calculation for 20 times at most, so that 12 hardware related logics can be reduced.

The above formula (15) represents the result of performing convolution calculation output by a certain convolution layer in the quantized model (fixed-point model), taking the quantization as int8 as an example, and the input, weight and output of the convolution layer are all fixed-point numbers of int 8. In implementations, to ensure data accuracy, the fixed point number of int8 may be stored using int32 bits. Equation (15) above represents the use of int32 bits to store the result of the convolution calculation. Of course, according to actual needs, the fixed point number of int8 can also be stored by using int16 bits, and then the above formula (15) can be modified as follows:

（15）'

the accuracy of the stored data may be reduced and may be selected as desired in practical applications.

In order to achieve the purpose of reducing hardware resources required by fixed-point model convolution computation, the model compression method provided by the embodiment of the invention can be divided into the following two stages: a model training phase and a parameter calibration phase.

The model training stage includes step 101 of obtaining a first model, where the first model is a floating point model obtained by training with a training set.

The training set is determined according to the target task. The target model finally obtained by the embodiment of the invention can be deployed on target equipment and can be used for executing target tasks. The target task is not limited by the embodiment of the invention, and may be an image classification task, a target detection task, an image segmentation task, or the like. Different training sets may be collected depending on the target task. Taking the target task as face detection as an example, the training set may include a plurality of collected pictures including face images. The training set can be used for training to obtain a neural network model for face detection, namely a first model.

It should be noted that, the network structure of the first model is not limited in the embodiment of the present invention.

Illustratively, the target task is face detection, and the embodiment of the invention adopts a network structure of a CenterNet as a first model. Referring to fig. 2, a network architecture diagram of a first model in one example of the invention is shown. As shown in fig. 2, the network structure of the first model includes a Backbone network (Backbone), a Neck (Neck), and an output header (Head). Illustratively, a backlight employs a lightweight shufflelet. Neck employs FPN (Feature Pyramid Network ). The Head consists of three prediction heads of a Heatmap Head, an Offset Head, and a Size Head, which are used for outputting a thermodynamic diagram (Heat Map), a target position Offset (Offset), and a target Width and Height (Height & Width), respectively. And predicting the position and the size of the face according to the output information of the Head.

After training the network structure shown in fig. 2 using the collected training set, a floating point model (first model) after training is obtained.

According to the embodiment of the invention, ceneterNet is adopted as a one-stage anchor-free frame detection model, compared with the common models such as YOLOV3 and the like, a complex pre-selection frame is not required to be arranged, complex post-processing is not required, and the pressure of a CPU (Central processing Unit) can be reduced; in addition, ceneterNet has no special structure, is easy to deploy, and can realize positive end-to-end reasoning.

In particular implementations, the number of shifts by which a convolutional layer performs a convolutional calculation (i.e., the n value of the convolutional layer) is determined by the range of weights for the convolutional layer, where the n value refers to the n value in equation (15) above. If the range of weights for a certain convolution layer fluctuates widely, this will result in the value of n for that convolution layer possibly occurring around a maximum or minimum value. For example, for a certain neural network model, the neural network model may be used for different target tasks. For a certain convolution layer (e.g., referred to as convolution layer 1) in the neural network model, when the neural network model is used for a face detection task, the number of shifts of the convolution layer 1 performed by the convolution calculation is 3 (i.e., the n value of the convolution layer 1 is 3) under the influence of the weight range of the convolution layer 1. When the neural network model is used for a vehicle detection task, the number of shifts by which the convolution layer 1 performs convolution calculation is 31 (i.e., the n value of the convolution layer 1 is 31) under the influence of the weight range of the convolution layer 1. The range fluctuation of the weight of the convolution layer 1 causes the shift number of the convolution calculation performed by the convolution layer 1 to occur near the maximum value (the maximum value is 31) (the n value is 31) and to occur near the minimum value (the minimum value is 0) (the n value is 3), thereby causing the selection range of the shift number of the convolution layer 1 to be 3-31, and 29 hardware correlation logics are still needed; the expected predetermined range of the embodiment of the invention is 6-25, and only 20 hardware related logics are needed. At this time, the range fluctuation of the weight of the convolution layer 1 is large, and the range fluctuation of the weight of the convolution layer 1 is scaled by scaling the weight of the convolution layer 1, so that the range of the scaled weight of the convolution layer 1 affects that the n value of the convolution layer 1 is within the predetermined range 6-25.

The invention aims to enable the n value of each convolution layer of the target model to be in a fixed range, namely, the shift times of convolution calculation of each convolution layer of the target model meet a preset range.

In one example, the floating point model of float32 is quantized to be the fixed point model of int8, and a predetermined range can be set to be 6-25, and then the predetermined range is 0-31 relative to the original range, so that 12 hardware related logics can be reduced while the model accuracy is not reduced, and therefore hardware resources required by convolution calculation of the fixed point model obtained after quantization are saved.

After a floating point model (a first model) is obtained through training, the embodiment of the invention enters a parameter calibration stage, and the shift times of convolution calculation of each convolution layer of a target model are fixed within a preset range through the parameter calibration stage. The parameter calibration phase includes steps 102 to 104.

Specifically, after training to obtain a first model, scaling the weight of each convolution layer in the first model to obtain a second model. The method comprises the steps of scaling the weight of each convolution layer in a first model, and aims to calibrate the distribution range of the weight of each convolution layer in the first model, so that the distribution range of the weight of each convolution layer in a second model obtained after calibration is in a relatively gentle interval, and the condition that the fluctuation of the weight range of certain convolution layers is large can be avoided. Therefore, after the second model is quantized to obtain the third model, the shift times of convolution calculation performed by each convolution layer in the third model are distributed in a relatively middle area range (a preset range in the embodiment of the invention), so that hardware resources required by the convolution calculation can be reduced.

Further, the predetermined range may be determined according to the accuracy of quantization. For example, when the quantization precision is int8, the predetermined range may be 6 to 25. It will be appreciated that the predetermined range 6 to 25 is only an example of the present invention, and in a specific implementation, the predetermined range may be set according to actual requirements. For example, the predetermined range may be obtained from experimental data of a plurality of different target tasks, under which the plurality of different target tasks may each ensure that model accuracy is not degraded. For example, assuming that the predetermined range 6 to 25 is obtained by statistics of experimental data of 5 target tasks, the predetermined range can be narrowed by reducing the number of target tasks. For example, only one target task of vehicle detection is selected, and the predetermined range obtained by statistics is smaller than 6-25. Of course, if more target tasks are selected, the statistically derived predetermined range will be correspondingly expanded. It will be appreciated that in case the accuracy of quantization is either int4 or int16, a corresponding predetermined range may be set, respectively. In practical application, a corresponding preset range can be set according to different precision requirements of a practical scene. For example, in a scene where the accuracy requirement is higher, the predetermined range in the above example may be enlarged; in a scene where the accuracy requirement is lower, the predetermined range in the above example can be further narrowed.

Before the floating point model is quantized, the embodiment of the invention calibrates the weight range of each convolution layer of the floating point model, so that the weight of each convolution layer is distributed in a relatively gentle interval, and the value range of shift times when each convolution layer of the fixed point model obtained after quantization carries out convolution calculation can be reduced.

In an alternative embodiment of the present invention, the scaling the weight of each convolution layer in the first model may include: for a current convolutional layer in the first model, scaling the weight of the current convolutional layer based on the weight of a preceding convolutional layer of the current convolutional layer and the weight of the current convolutional layer.

In the neural network model, for two successive convolutional layers, the calculation of the latter convolutional layer is based on the output of the former convolutional layer. Therefore, when scaling the weight of the current convolution layer, the embodiment of the invention refers to the weight of the current convolution layer and the weight of the previous convolution layer at the same time. That is, for two consecutive convolutional layers, the scaling of the weights of the previous convolutional layer affects the latter convolutional layer. For example, for two consecutive convolution layers in the first model, if the weight of the previous convolution layer is larger (e.g., exceeds a preset value), the weight of the previous convolution layer may be reduced, and the weight of the subsequent convolution layer may be increased, so that the weight distribution ranges of the two consecutive convolution layers are more balanced.

In an alternative embodiment of the present invention, the scaling the weight of the current convolutional layer based on the weight of the previous convolutional layer and the weight of the current convolutional layer may include:

step S11, calculating an optimal scaling coefficient of each convolution kernel in the current convolution layer; the optimal scaling coefficient of the ith convolution kernel in the current convolution layer is obtained by calculation according to the weight of the ith convolution kernel in the previous convolution layer of the current convolution layer and the weight of the ith convolution kernel in the current convolution layer; i is 1-m, and m is the number of convolution kernels in the current convolution layer;

and step S12, scaling the weights of the m convolution kernels in the current convolution layer by utilizing the optimal scaling coefficients respectively corresponding to the m convolution kernels in the current convolution layer.

The convolution layer has a telescopic equivalence. That is, for a convolutional layer, multiplying the weight of the convolutional kernel in the convolutional layer by a scaling factor is equivalent to multiplying the output of the convolutional layer by the same scaling factor. According to the embodiment of the invention, the optimal scaling coefficient is calculated for each convolution kernel of each convolution layer in the first model by utilizing the telescopic equivalence of the convolution layers, and the weight of the convolution kernel is scaled by utilizing the optimal scaling coefficient of the convolution kernel.

For two successive convolution layers in the first model, assume that the convolution calculation for the previous one of the convolution layers is expressed as:

the convolution calculation for the latter convolution layer is expressed as: />

Then the convolution calculations for these two successive convolution layers can be expressed as follows:

（16）

wherein, the liquid crystal display device comprises a liquid crystal display device,

weights representing the previous convolution layer, +.>

Representing the weights of the latter convolutional layer. />

Bias, which represents the previous convolutional layer,>

the bias of the latter convolutional layer is represented. />

And x in (a) represents the input of the previous convolutional layer. />

And x in (a) represents the input of the latter convolutional layer. It will be appreciated that the input to the latter convolutional layer is the output of the former convolutional layer. />

Scaling factor representing the previous convolution layer, < ->

Representing the scaling factor of the latter convolutional layer. />

Representing weight +.>

Multiplying by a scaling factor->

The new weight is obtained. />

Representing weight +.>

Multiplying by a scaling factor->

The new weight is obtained.

As can be seen from equation (16) above, for two consecutive convolution layers in the first model, the calculation of the former convolution layer affects the latter convolution layer. Therefore, when scaling the weight of the current convolution layer, the embodiment of the invention refers to the weight of the current convolution layer and the weight of the previous convolution layer at the same time.

The embodiment of the invention is to the above scaling factor

The calculation method of (2) is not limited. For example, for a convolution layer, embodiments of the present invention calculate a corresponding optimal scaling factor for each convolution kernel of the convolution layer, multiplying the weight of each convolution kernel in the convolution layer by the optimal scaling factor corresponding to the convolution kernel is equivalent to multiplying the output of the convolution layer by the optimal scaling factor. Further, for two successive convolution layers, embodiments of the present invention calculate the optimal scaling factor for the ith convolution kernel in the subsequent one of the convolution layers using the following equation:

（17）

the weight of the ith convolution kernel representing the previous convolution layer in the two consecutive convolution layers; />

The weight of the ith convolution kernel representing the latter one of the two successive convolution layers; />

Representing the optimal scaling factor of the ith convolution kernel of the latter of the two successive convolution layers.

For each convolution layer in the first model, calculating an optimal scaling coefficient corresponding to each convolution kernel in each convolution layer based on the above formula (17), and scaling the weight of the convolution kernel in each convolution layer by using the optimal scaling coefficient corresponding to each convolution kernel in each convolution layer.

It will be appreciated that, for the first convolution layer in the first model, since it has no previous convolution layer, the weight of the first convolution layer may not be scaled, or the weight of the first convolution layer may be scaled according to a preset scaling factor, or the weight of the first convolution layer may be scaled based on the weight of the second convolution layer, or the like, which is not a limitation of the embodiments of the present invention.

It should be noted that, the fully-connected layer constructed according to the convolution layer calculation is substantially equivalent to the convolution layer, and thus, the convolution layer described in the embodiment of the present invention may further include the fully-connected layer. In addition, an activation function such as a ReLU class may be connected after the convolutional layer. Because the activation functions of the full-connection layer and the ReLU class have telescopic equivalence, the embodiment of the invention scales the weight of each convolution layer in the first model, and the output results of the activation functions of the full-connection layer and the ReLU class are not affected.

According to the embodiment of the invention, the weights of every two continuous convolution layers in the first model are scaled, and when the weights of the current convolution layers are scaled, the weights of the current convolution layers and the weights of the previous convolution layers are referenced at the same time, so that the distribution range of the weights of each convolution layer after adjustment is in a relatively gentle interval, and the condition that the fluctuation of the range of the weights of certain convolution layers is large can be avoided. In addition, because the neural network is a series computing mode, the scale scaling of the current convolution layer refers to the weight of the previous convolution layer, so that the accuracy of the neural network can be ensured.

Scaling the weight of each convolution layer in the first model to obtain a second model, and then quantizing the second model based on a preset quantization mode to obtain a third model; the third model is a fixed-point model.

The embodiment of the invention does not limit the preset quantization mode. Illustratively, the preset quantization mode may include layer-by-layer quantization and channel-by-channel quantization according to quantization granularity division. Layer-by-layer quantization is in units of one layer (convolution, pooling, etc.), the weights of the entire layer sharing a set of scaling factors S and offsets Z. The channel-by-channel quantization is in units of channels, each channel using a set of scaling factors S and offsets Z separately.

The embodiment of the invention scales the weight of each convolution layer in the first model, and aims to calibrate the distribution range of the weight of each convolution layer in the first model, so that the distribution range of the weight of each convolution layer in the second model obtained after calibration is in a relatively gentle interval, and the condition that the fluctuation of the weight range of certain convolution layers is large can be avoided. Therefore, after the second model is quantized to obtain a third model, the shift times of convolution calculation of each convolution layer in the third model are distributed in a relatively middle area range (such as a preset range), and therefore hardware resources required by the convolution calculation can be reduced.

However, in practical applications, there may still be convolution layers with large fluctuation ranges of weights by scaling, which in turn results in that the number of shifts of the convolution calculations of these convolution layers cannot meet the predetermined range. Therefore, after quantifying the second model to obtain a third model, the embodiment of the present invention corrects the shift number of the target convolution layer in the third model. Specifically, a target convolution layer is determined in the third model, and the shift times of the target convolution layer are corrected to obtain a target model. The target convolution layer refers to a convolution layer in which the shift times in the third model do not meet a preset range, and after correction, the shift times of each convolution layer of the target model finally obtained meet the preset range.

In an alternative embodiment of the present invention, the determining the target convolution layer in the third model may include:

determining a first boundary value and a second boundary value corresponding to the preset range;

and for the current convolution layer in the third model, if the shift frequency of the current convolution layer is smaller than the first boundary value or larger than the second boundary value, determining that the current convolution layer is a target convolution layer.

For example, if the predetermined range is 6 to 25, the first boundary value is 6 and the second boundary value is 25. Each convolution layer in the third model may be traversed to determine if a number of shifts for each convolution layer is between the first boundary value and the second boundary value. And if the shift number of a certain convolution layer is smaller than the first boundary value or larger than the second boundary value, determining the convolution layer as a target convolution layer.

And aiming at different preset quantization modes, the meanings of the shift times of the convolution layers in the third model are different. In an optional embodiment of the present invention, if the preset quantization mode is layer-by-layer quantization, the shift number of a certain convolution layer in the third model is the shift number of each weight sharing in the convolution layer; or if the preset quantization mode is channel-by-channel quantization, the shift times of a certain convolution layer in the third model comprise shift times corresponding to weights in the convolution layer respectively.

In implementations, a layer of convolution may include multiple convolution kernels, each of which has a respective weight. If a layer-by-layer quantization (performer) scheme is employed, the weights in one convolutional layer share one shift number (the shift number is also referred to as the n value in the embodiment of the present invention). If a channel-by-channel quantized (per channel) model is used, each layer weight in one convolutional layer corresponds to a respective shift number (n-value). If there are 32 channels in one convolution layer, the convolution layer corresponds to 32 weights, each weight corresponds to a respective n value, e.g., n values corresponding to the 32 weights are denoted as n1, n2, n3, n32.

In an optional embodiment of the present invention, the correcting the shift number of the target convolutional layer may include:

if the preset quantization mode is layer-by-layer quantization, correcting the shift times of each weight sharing in the target convolution layer; or alternatively, the process may be performed,

if the preset quantization mode is channel-by-channel quantization, correcting the shift times corresponding to the target weight in the target convolution layer; the target weight refers to a weight in which the number of shifts in the target convolutional layer does not satisfy the predetermined range.

For different preset quantization modes, the meanings of the shift times of the convolution layers in the third model are different, so that the mode of correcting the shift times of the target convolution layers is also different.

If the preset quantization mode is layer-by-layer quantization, each weight in the target convolution layer shares a shift number, so that the shift number shared by each weight in the target convolution layer can be corrected. If the preset quantization mode is channel-by-channel quantization, each weight in the target convolution layer corresponds to a respective shift number, so that the shift number corresponding to the target weight in the target convolution layer can be corrected; the target weight refers to a weight in which the number of shifts in the target convolutional layer does not satisfy the predetermined range.

Further, in practical application, under the condition of using a channel-by-channel quantization mode, each layer of target weight corresponds to a respective n value, for example, one convolution layer has 32 weights, and the 32 weights correspond to n1 to n32 and have a total of 32 n values. If there are 10 target weights in the 32 weights, the n values (n 1 to n10 are assumed) corresponding to the 10 target weights need to be corrected respectively, resulting in a large calculation amount. In order to reduce the calculated amount, the embodiment of the invention unifies the n values corresponding to the weights in one convolution layer respectively to obtain a unified n value. For example, for one convolution layer, n values corresponding to the weights in the convolution layer may be averaged, weighted averaged, median, or the like, and the obtained result may be taken as the n value corresponding to the convolution layer.

For example, in the above example, an average value may be calculated for n values (such as n1 to n 32) corresponding to 32 weights in the convolution layer, and the obtained average value may be used as the n value corresponding to the convolution layer. Thus, when a layer-by-layer quantization or channel-by-channel quantization mode is adopted, each convolution layer corresponds to one n value, so as to reduce computing resources and storage resources.

determining a correction error according to the shift times of the target convolution layer and the preset range;

the shift times of the target convolution layer are modified to be a first boundary value or a second boundary value of the preset range, and the approximate integer parameter in the convolution calculation of the target convolution layer is modified according to the correction error.

In a specific implementation, after scaling, a convolution layer which does not meet a predetermined range may still exist in the model, and in order to avoid the problem, the convolution calculation of the convolution layer beyond the predetermined range may not normally operate because the convolution calculation of the embodiment of the present invention may only use the hardware-related logic of the predetermined range, and in order to avoid the problem, the embodiment of the present invention corrects the shift number of the target convolution layer in the third model.

Specifically, after the target convolutional layer is determined, the shift number of the target convolutional layer may be corrected by the following equation.

（19）

Wherein n represents a n value corresponding to the target convolutional layer, and in a layer-by-layer quantization mode, the n value can be a n value shared by weights in the target convolutional layer; in the channel-by-channel quantization mode, the n value may be obtained by uniformly calculating (e.g., averaging) the n values corresponding to the weights in the target convolutional layer. M is the absolute value of the numerical value that N is greater than 25, and N is the absolute value of the numerical value that N is less than 6. After the correction of the above formula (19), the shift frequency of each convolution layer in the third model meets the predetermined range, that is, the n value of each convolution layer in the third model is between 6 and 25, and the third model obtained at this time is the target model.

In one example, assume that a certain target convolutional layer is shifted 26 times, 1 more than the second boundary value, i.e., one bit more to the right. And according to the shift times 26 of the target convolution layer and the preset range 6-25, determining that the correction error is shifted one bit to the right. And correcting the approximate integer parameter in the convolution calculation of the target convolution layer according to the correction error so as to eliminate one bit which is shifted to the right.

Wherein the approximate integer parameter in the convolution calculation refers to a in the above formula (15). The one bit of the right shift is convolved for the target convolutional layer, and the above equation (15) for the convolution calculation can be modified as follows:

（20）

the shift number of the target convolution layer is corrected to 25 by the correction of the above formula (20) from 26, and one bit which is originally shifted to the right is fused into A by the correction error to calculate, for example, the correction error is converted into the number of times of multiplying A by 2 or dividing A by 2. Wherein the number of times a times 2 or the number of times a divided by 2 is determined by the number of right-shifted bits or the number of right-shifted bits less indicated by the correction error. For example, if the correction error indicates one more right shift, a may be left shifted by one to ensure that the final result is unchanged.

It should be noted that, as shown in the above formulas (19) and (20), the results of the int 32-bit storage convolution calculation are taken as examples in the embodiment of the present invention.

The determining a correction error according to the shift number of the target convolution layer and the predetermined range may include: if the shift times of the target convolution layer are larger than a second boundary value of the preset range, taking a difference value between the shift times of the target convolution layer and the second boundary value as a correction error; and if the shift times of the target convolution layer are smaller than the second boundary value of the preset range, taking the difference value between the first boundary value and the shift times of the target convolution layer as a correction error. Further, if the shift number of the target convolution layer is greater than the second boundary value of the predetermined range, modifying the shift number of the target convolution layer to the second boundary value of the predetermined range; and if the shift times of the target convolution layer are smaller than the first boundary value of the preset range, modifying the shift times of the target convolution layer to the first boundary value of the preset range.

The correction of the approximate integer parameter in the convolution calculation of the target convolution layer according to the correction error refers to the conversion of the correction error into the number of times of A multiplied by 2 or the number of times of A divided by 2. For example, the correction error is one bit shifted right, and may be converted to a divided by 2.

After correcting the shift number of each target convolution layer in the third model, a target model may be obtained. The target model can be deployed on target equipment and can execute target tasks.

Taking a target task as a vehicle detection task as an example, inputting an image to be detected into the target task model, firstly extracting features through a backstone, then extracting features with different scales from features output by a Backbone network by utilizing a Neck, processing the features, then decoding a feature map from an output result of the Neck by utilizing a Head, and finally outputting position information of the vehicle in the image.

The embodiment of the invention does not limit the specific form of the target equipment. By way of example, the target device may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a personal digital assistant (Personal Digital Assistant, PDA), a handheld device, a computing device, an in-vehicle device, a wearable device, or the like.

In summary, the embodiment of the invention provides a model compression method, before a floating point model is quantized, the weight range of each convolution layer of the floating point model (a first model) is calibrated, so that the weight of each convolution layer is distributed in a relatively gentle interval, the condition that the range fluctuation of the weight of certain convolution layers is large can be avoided, the value range of shift times when the fixed point model obtained after quantization carries out convolution calculation can be reduced, and further hardware resources required by the fixed point model for carrying out convolution calculation can be reduced. In addition, in the embodiment of the invention, the range of the weight of each convolution layer of the floating point model (the first model) is calibrated to obtain the second model, the second model is quantized to obtain the third model, and then the shift times of the target convolution layer in the third model are further corrected to ensure that the convolution calculation of the target convolution layer can be correctly executed. The shift times of each convolution layer of the finally obtained target model meet a preset range, and the preset range is the minimum value range of the shift times of each convolution layer of the fixed-point model for convolution calculation under the premise of ensuring the accuracy of the fixed-point model. Therefore, the embodiment of the invention can reduce redundancy of the fixed-point model as much as possible on the premise of ensuring the precision of the fixed-point model, thereby reducing hardware resources required by convolution calculation of the fixed-point model.

It should be noted that, for simplicity of description, the method embodiments are shown as a series of acts, but it should be understood by those skilled in the art that the embodiments are not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred embodiments, and that the acts are not necessarily required by the embodiments of the invention.

Referring to fig. 3, there is shown a block diagram of an embodiment of a model compression device of the present invention, the device comprising:

the model training module 301 is configured to obtain a first model, where the first model is a floating point model obtained by training with a training set;

the weight calibration module 302 is configured to scale the weight of each convolution layer in the first model to obtain a second model;

the model quantization module 303 is configured to quantize the second model based on a preset quantization mode, so as to obtain a third model; the third model is a fixed-point model;

the numerical value correction module 304 is configured to determine a target convolution layer in the third model, and correct the shift number of the target convolution layer to obtain a target model; the target convolution layer refers to a convolution layer of which the shift times in the third model do not meet a preset range; the shift number of each convolution layer in the target model satisfies the predetermined range.

Optionally, the weight calibration module is specifically configured to scale, for a current convolutional layer in the first model, a weight of the current convolutional layer based on a weight of a preceding convolutional layer of the current convolutional layer and a weight of the current convolutional layer.

Optionally, the weight calibration module includes:

the coefficient calculation sub-module is used for calculating the optimal scaling coefficient of each convolution kernel in the current convolution layer; the optimal scaling coefficient of the ith convolution kernel in the current convolution layer is obtained by calculation according to the weight of the ith convolution kernel in the previous convolution layer of the current convolution layer and the weight of the ith convolution kernel in the current convolution layer; i is 1-m, and m is the number of convolution kernels in the current convolution layer;

and the weight scaling sub-module is used for scaling the weights of the m convolution kernels in the current convolution layer by utilizing the optimal scaling coefficients respectively corresponding to the m convolution kernels in the current convolution layer.

Optionally, if the preset quantization mode is layer-by-layer quantization, the shift number of a certain convolution layer in the third model is the shift number of each weight sharing in the convolution layer; or if the preset quantization mode is channel-by-channel quantization, the shift times of a certain convolution layer in the third model comprise shift times corresponding to weights in the convolution layer respectively.

Optionally, the numerical correction module includes:

the first correction submodule is used for correcting the shift times of each weight sharing in the target convolution layer if the preset quantization mode is layer-by-layer quantization; or alternatively, the process may be performed,

the second correction submodule is used for correcting the shift times corresponding to the target weight in the target convolution layer if the preset quantization mode is channel-by-channel quantization; the target weight refers to a weight in which the number of shifts in the target convolutional layer does not satisfy the predetermined range.

Optionally, the numerical correction module includes:

the first determining submodule is used for determining a first boundary value and a second boundary value corresponding to the preset range;

and the second determining submodule is used for determining the current convolution layer in the third model as a target convolution layer if the shift times of the current convolution layer are smaller than the first boundary value or larger than the second boundary value.

Optionally, the numerical correction module includes:

an error determination submodule, configured to determine a correction error according to the shift number of the target convolutional layer and the predetermined range;

and the numerical value correction sub-module is used for correcting the shift times of the target convolution layer to be a first boundary value or a second boundary value of the preset range, and correcting the approximate integer parameter in the convolution calculation of the target convolution layer according to the correction error.

Optionally, the predetermined range is determined according to the accuracy of quantization.

The embodiment of the invention provides a model compression device, which is used for calibrating the weight range of each convolution layer of a floating point model (a first model) before quantizing the floating point model, so that the weight of each convolution layer is distributed in a relatively gentle interval, the condition that the weight range of certain convolution layers fluctuates greatly can be avoided, the value range of shift times when the fixed point model obtained after quantization carries out convolution calculation can be reduced, and further hardware resources required by the fixed point model for carrying out convolution calculation can be reduced. In addition, in the embodiment of the invention, the range of the weight of each convolution layer of the floating point model (the first model) is calibrated to obtain the second model, the second model is quantized to obtain the third model, and then the shift times of the target convolution layer in the third model are further corrected to ensure that the convolution calculation of the target convolution layer can be correctly executed. The shift times of each convolution layer of the finally obtained target model meet a preset range, and the preset range is the minimum value range of the shift times of each convolution layer of the fixed-point model for convolution calculation under the premise of ensuring the accuracy of the fixed-point model. Therefore, the embodiment of the invention can reduce redundancy of the fixed-point model as much as possible on the premise of ensuring the precision of the fixed-point model, thereby reducing hardware resources required by convolution calculation of the fixed-point model.

For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

The embodiment of the present invention further provides a non-transitory computer readable storage medium, where the instructions in the storage medium, when executed by a processor of an apparatus (server or terminal), enable the apparatus to perform the description of the model compression method in the embodiment corresponding to fig. 1, so that a detailed description thereof will be omitted herein. In addition, the description of the beneficial effects of the same method is omitted. For technical details not disclosed in the computer program product or the computer program embodiments related to the present application, please refer to the description of the method embodiments of the present application.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It is to be understood that the invention is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

The foregoing has outlined a detailed description of the method for compressing a model, the apparatus for compressing a model and the machine-readable storage medium, wherein specific examples are provided herein to illustrate the principles and embodiments of the present invention, and the description of the examples is only intended to facilitate the understanding of the method and core idea of the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. A method of model compression, the method comprising:

obtaining a first model, wherein the first model is a floating point model obtained by training by using a training set, and the training set is determined according to a target task;

2. The method of claim 1, wherein scaling the weights of each convolution layer in the first model comprises:

for a current convolutional layer in the first model, scaling the weight of the current convolutional layer based on the weight of a preceding convolutional layer of the current convolutional layer and the weight of the current convolutional layer.

3. The method of claim 2, wherein the scaling the weights of the current convolutional layer based on the weights of a previous convolutional layer to the current convolutional layer and the weights of the current convolutional layer comprises:

calculating an optimal scaling factor of each convolution kernel in the current convolution layer; the optimal scaling coefficient of the ith convolution kernel in the current convolution layer is obtained by calculation according to the weight of the ith convolution kernel in the previous convolution layer of the current convolution layer and the weight of the ith convolution kernel in the current convolution layer; i is 1-m, and m is the number of convolution kernels in the current convolution layer;

and scaling the weights of the m convolution kernels in the current convolution layer by utilizing the optimal scaling coefficients respectively corresponding to the m convolution kernels in the current convolution layer.

4. The method of claim 1, wherein if the predetermined quantization mode is layer-by-layer quantization, the number of shifts of a certain convolutional layer in the third model is the number of shifts of each weight share in the convolutional layer; or if the preset quantization mode is channel-by-channel quantization, the shift times of a certain convolution layer in the third model comprise shift times corresponding to weights in the convolution layer respectively.

5. The method of claim 4, wherein correcting the number of shifts of the target convolutional layer comprises:

6. The method of claim 1, wherein said determining a target convolutional layer in the third model comprises:

7. The method of claim 1, wherein said correcting the number of shifts of the target convolutional layer comprises:

8. A method according to any one of claims 1 to 7, wherein the predetermined range is determined based on the accuracy of the quantization.

9. A model compression apparatus, the apparatus comprising:

the model training module is used for acquiring a first model, wherein the first model is a floating point model obtained by training by using a training set, and the training set is determined according to a target task;

10. A machine readable storage medium having instructions stored thereon, which when executed by one or more processors of an apparatus, cause the apparatus to perform the model compression method of any of claims 1 to 8.