CN117787378A

CN117787378A - Model compression method, device, equipment and storage medium

Info

Publication number: CN117787378A
Application number: CN202311615248.4A
Authority: CN
Inventors: 杨启航
Original assignee: Inspur Communication Technology Co Ltd
Current assignee: Inspur Communication Technology Co Ltd
Priority date: 2023-11-29
Filing date: 2023-11-29
Publication date: 2024-03-29

Abstract

The application relates to the field of data processing, and provides a model compression method, device, equipment and storage medium, wherein the method comprises the following steps: constructing a characteristic model based on at least one characteristic parameter of the target platform; performing feature analysis on the model to be compressed to obtain a feature analysis result; determining a pruning mode based on the characteristic model and the characteristic analysis result; pruning is carried out on the model to be compressed based on a pruning mode, and quantized compression is carried out on the model to be compressed after pruning. According to the method and the device, the proper compression method is dynamically selected and applied according to the resource limit and the demand of the target platform, so that the compression process is more flexible and adaptive, and customized optimization can be performed for different platforms and application scenes.

Description

Model compression method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of data processing, and in particular, to a method, apparatus, device, and storage medium for model compression.

Background

With the rapid development of artificial intelligence technology, algorithm models are widely applied in various fields. However, large-scale complex models often occupy a large amount of memory and computing resources, and deploying and running these models is a challenging task for resource-constrained devices and platforms. Therefore, the compression algorithm model becomes an important research direction.

At present, existing model compression methods, such as weight pruning, quantization, low-rank decomposition and the like, are static and cannot adapt to the requirements of different platforms and application scenes, so that the model compression process lacks flexibility and adaptability.

Disclosure of Invention

The application provides a model compression method, device, equipment and storage medium, which are used for solving the problems that the existing model compression method cannot adapt to the requirements of different platforms and application scenes, so that the model compression process lacks flexibility and adaptability.

The application provides a model compression method, which comprises the following steps:

constructing a characteristic model based on at least one characteristic parameter of the target platform;

performing feature analysis on the model to be compressed to obtain a feature analysis result;

determining a pruning mode based on the characteristic model and the characteristic analysis result;

and pruning the model to be compressed based on the pruning mode, and quantitatively compressing the pruned model to be compressed.

In one embodiment, the determining a pruning manner based on the characteristic model and the feature analysis result includes:

analyzing the characteristic model to determine resource limitation information of the target platform;

determining the score of each model parameter in the model to be compressed based on the feature analysis result; the score characterizes the importance of the model parameters;

and determining the pruning mode based on the resource limit information and the scores of the model parameters.

In one embodiment, the determining the score of each model parameter in the model to be compressed based on the feature analysis result includes:

determining a gradient value of each model parameter in the model to be compressed for a loss function based on the feature analysis result, and determining a score of each model parameter based on the gradient value; or,

determining a norm value of each model parameter in the model to be compressed based on the feature analysis result, and determining a score of each model parameter based on the norm value; or,

and determining the contribution degree of each model parameter in the model to be compressed in the model output based on the feature analysis result, and determining the score of each model parameter based on the contribution degree.

In one embodiment, pruning the model to be compressed based on the pruning mode includes:

if the pruning mode is weight threshold pruning, deleting the model parameters with the score smaller than a first threshold;

if the pruning mode is channel pruning, deleting the channel with the weight value smaller than a second threshold value;

if the pruning mode is model structure pruning, deleting a model structure with a weight value smaller than a third threshold value, wherein the model structure at least comprises a convolution layer, a full connection layer and neurons.

In one embodiment, the compressing the pruned model to be compressed in a quantization mode includes:

adopting a sparse matrix to represent and store the pruned model to be compressed, and training the pruned model to be compressed;

determining a quantization compression mode based on a model architecture or model complexity of the model to be compressed;

and carrying out quantization compression on the trained model to be compressed based on the quantization compression mode.

In one embodiment, the training the pruned model to be compressed includes:

determining a loss function of the pruned model to be compressed;

and training the pruned model to be compressed based on the loss function to adjust model parameters of the pruned model to be compressed.

In an embodiment, after pruning the model to be compressed and quantitatively compressing the pruned model to be compressed based on the pruning mode, the method further includes:

and deploying the quantized and compressed model to the target platform, and processing input data by adopting the quantized and compressed model.

The application provides a model compression device, including:

the characteristic model construction module is used for constructing a characteristic model based on at least one characteristic parameter of the target platform;

the feature analysis module is used for carrying out feature analysis on the model to be compressed to obtain a feature analysis result;

the pruning mode determining module is used for determining pruning modes based on the characteristic model and the characteristic analysis result;

and the quantization compression module is used for pruning the model to be compressed based on the pruning mode and carrying out quantization compression on the pruned model to be compressed.

The application also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the model compression method as described in any one of the above when executing the program.

The present application also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a model compression method as described in any of the above.

The method, the device, the equipment and the storage medium for compressing the model construct a characteristic model based on at least one characteristic parameter of a target platform; performing feature analysis on the model to be compressed to obtain a feature analysis result; determining a pruning mode based on the characteristic model and the characteristic analysis result; pruning is carried out on the model to be compressed based on a pruning mode, and quantized compression is carried out on the model to be compressed after pruning. According to the method and the device, the proper compression method is dynamically selected and applied according to the resource limit and the demand of the target platform, so that the compression process is more flexible and adaptive, and customized optimization can be performed for different platforms and application scenes.

Drawings

For a clearer description of the present application or of the prior art, the drawings that are used in the description of the embodiments or of the prior art will be briefly described, it being apparent that the drawings in the description below are some embodiments of the present application, and that other drawings may be obtained from these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow diagram of a model compression method provided herein;

FIG. 2 is a schematic structural view of a model compression device provided herein;

fig. 3 is a schematic structural diagram of an electronic device provided in the present application.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the present application more apparent, the technical solutions in the present application will be clearly and completely described below with reference to the drawings in the present application, and it is apparent that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

The model compression method, apparatus, device and storage medium of the present application are described below in conjunction with fig. 1-3.

Specifically, the present application provides a model compression method, and referring to fig. 1, fig. 1 is a schematic flow chart of the model compression method provided in the present application.

The model compression method provided by the embodiment of the application comprises the following steps:

step 100, constructing a characteristic model based on at least one characteristic parameter of a target platform;

at least one characteristic parameter of the target platform, such as parameters of hardware configuration, storage capacity, computing resources and the like, is acquired, and then a characteristic model is constructed based on the acquired characteristic parameter. The target platform refers to a platform needing to deploy a model. The characteristic model refers to a model or set of parameters describing characteristics of the target platform, including characteristic information of various hardware and software aspects, such as hardware configuration, computing resources, storage capacity, network bandwidth, power consumption, and the like.

Alternatively, the property model may be in the form of a structured data set containing information such as names, values, or ranges of various properties for evaluating and comparing different hardware platforms, guiding the design and deployment of the model. The specific form and content of the characteristic model depends on the application scenario and requirements. For example, in a deep learning task, the characteristic model may include information such as a model number, a core number, a main frequency, a memory size, a video memory size, a storage capacity, and the like of the hardware device, and an index related to performance of the model such as a calculation speed, a memory bandwidth, and the like.

Alternatively, if a model needs to be deployed in the device, the characteristic parameters of the device may be collected and then the characteristic model of the device may be constructed.

Step 200, performing feature analysis on the model to be compressed to obtain a feature analysis result;

when the model to be compressed is deployed, feature analysis is needed to be performed on the model to be compressed to obtain a feature analysis result, wherein the model to be compressed can be a pre-trained neural network model or an algorithm model. For example, feature analysis of the model to be compressed in terms of model structure, parameter distribution, hierarchical connection and the like is implemented to evaluate importance of each parameter or layer of the model so as to determine the parameter or layer which has a large contribution to the performance of the model.

For example, gradient importance assessment: when the model is trained, gradient information can be used to evaluate the importance of the parameters, and the importance score of the parameters can be obtained by calculating and normalizing the gradient of the loss function for each parameter, wherein a larger gradient indicates that the parameter has a larger influence on the performance of the model. For each layer, an average gradient of the layer parameters may be calculated as an evaluation index, and other more specific gradient information, such as a weight gradient or a bias gradient, may be considered.

Parameter sensitivity analysis: parameter sensitivity analysis evaluates the importance of a parameter by slightly varying the value of each parameter and observing changes in the model performance, which can be considered to have a large contribution to the performance of the model if a small change in a certain parameter results in a significant decrease in the model performance. Parameter sensitivity analysis may also be performed by perturbing the values of the parameters, e.g., increasing or decreasing the values of the parameters one by one, and observing the performance change of the model.

Layer connection analysis: the method is used for evaluating the contribution of each layer to the performance of the model, and the layers with larger influence on the performance of the model can be judged by observing the connection mode and information transfer condition among different layers. For example, the connection between layers may be analyzed by visualizing the structure of the neural network, and observing the path of the information flow, from which layers that play a key role in the model may be deduced.

Step 300, determining a pruning mode based on the characteristic model and the characteristic analysis result;

and determining a pruning mode based on the characteristic model and the characteristic analysis result, wherein the pruning mode comprises sparsity pruning, L1/L2 regularization pruning, parameter importance pruning, group sparsity pruning, dynamic pruning and the like.

Step 400, pruning the model to be compressed based on the pruning mode, and quantitatively compressing the pruned model to be compressed.

After the pruning mode is determined, pruning is carried out on the model to be compressed by adopting the pruning mode. Specifically, if the pruning mode is weight threshold pruning, deleting model parameters with scores smaller than a first threshold; if the pruning mode is channel pruning, deleting the channel with the weight value smaller than the second threshold value; if the pruning mode is model structure pruning, deleting the model structure with the weight value smaller than the third threshold value, wherein the model structure at least comprises a convolution layer, a full connection layer and neurons.

For example, pruning based on weight thresholds: and setting a first threshold according to the weight of the model parameters, and removing the model parameters corresponding to the weights with smaller scores, thereby reducing the size of the model.

Pruning based on channel importance: by calculating the importance index, such as norm, gradient, etc., of each channel, a second threshold of the channel is set, and channels with smaller contributions (i.e. weight values smaller than the second threshold) are removed, so as to reduce the computational complexity.

Pruning in combination with the model structure: the characteristics of the model structure, such as a convolution layer, a full connection layer, neurons and the like, are considered, and a targeted pruning strategy is designed so as to reduce the size and the calculation cost of the model to the greatest extent. For example, for a convolutional layer, the model size and computational complexity may be reduced by removing unimportant convolutional kernels (i.e., model structures with weight values less than a third threshold), while unimportant neurons in the network may be removed, i.e., the number of connections to inputs and outputs is reduced to reduce the model size.

Further, after pruning, the size of the model is further reduced by applying a quantization compression technique, which reduces the storage requirements by reducing the accuracy of the representation of the parameters, e.g. converting floating point parameters into fixed point parameters.

Further, deploying the quantized and compressed model to a target platform, and processing input data by adopting the quantized and compressed model. For example, when the compressed model is converted into a form suitable for a target platform, such as when the model is deployed on a mobile device, the model needs to be converted into a TensorFlow Lite or Core ML format, and the formats not only can reduce the size of the model, but also can improve the running speed and efficiency of the model at a mobile end. In using the model to make inferences, it is necessary to send input data into the model and use the model to process the data and calculate the predicted output. Wherein the inferred input data may be a single sample or a plurality of samples processed in bulk.

According to the model compression method provided by the embodiment of the application, the characteristic model is built based on at least one characteristic parameter of the target platform; performing feature analysis on the model to be compressed to obtain a feature analysis result; determining a pruning mode based on the characteristic model and the characteristic analysis result; pruning is carried out on the model to be compressed based on a pruning mode, and quantized compression is carried out on the model to be compressed after pruning. According to the method and the device, the proper compression method is dynamically selected and applied according to the resource limit and the demand of the target platform, so that the compression process is more flexible and adaptive, and customized optimization can be performed for different platforms and application scenes.

Based on the above embodiment, the determining a pruning manner based on the characteristic model and the feature analysis result includes:

step 311, analyzing the characteristic model to determine resource limitation information of the target platform;

step 312, determining the score of each model parameter in the model to be compressed based on the feature analysis result; the score characterizes the importance of the model parameters;

and step 313, determining the pruning mode based on the resource limitation information and the scores of the model parameters.

And comprehensively analyzing the characteristic model of the target platform, including performance indexes, storage capacity, calculation resources and the like of hardware equipment, so as to determine resource limit information required by the pruned model and provide a basis for the subsequent pruning scheme design. Wherein the resource constraint information includes computing resources such as computing power of a processor, the number of parallel processing units, and the like; storage capacity, such as storage space required by the model on the target platform; memory limitations, such as the amount of memory required by the model at run-time; bandwidth limitations, such as the bandwidth required by the model during data transmission.

And determining the score of each model parameter in the model to be compressed based on the feature analysis result, specifically determining the gradient value of each model parameter in the model to be compressed for the loss function based on the feature analysis result, and determining the score of each model parameter based on the gradient value. For example, the magnitude of the gradient of each parameter in the model to be compressed for the loss function is calculated, and the larger the gradient, the larger the influence of the model parameter on the model is, and the higher the score is.

For example, the gradient value of each model parameter in the model to be compressed with respect to the loss function is calculated by back propagation, which is as follows:

1) Forward propagation: and forward transmitting the input data through a neural network to obtain the prediction output of the model.

2) And (3) loss function calculation: and comparing the predicted output of the model with the actual label, and calculating the numerical value of the loss function.

3) Back propagation: from the loss function, the gradient value of each model parameter to the loss function is calculated using the chain law. Specifically, for each model parameter, the gradient value calculation method is as follows:

3.1 Initializing: the gradient values of the model parameters are initialized to 0.

3.2 Back propagation): starting from the loss function, gradient values for each model parameter are calculated layer by layer along the connection of the neural network.

3.3 For the output layer): and calculating the gradient value of the output layer parameter according to the gradient value of the output of the loss function pair.

3.4 For the hidden layer): according to the chain rule, multiplying the gradient value transmitted by the previous layer by the derivative of the activation function of the current layer to obtain the gradient value of the current layer parameter, and continuing to transmit the gradient value forward until the current layer is input.

3.5 Cumulative gradient): and accumulating the parameter gradient values of each sample to obtain a final gradient value.

Optionally, a norm value of each model parameter in the model to be compressed is determined based on the feature analysis result, and a score of each model parameter is determined based on the norm value. For example, the norms (e.g., L1 norms or L2 norms) for each model parameter are calculated, the larger the norms, which indicate the higher the importance of the model parameter, the higher the score.

For example, the L1 norm refers to the sum of the absolute values of the individual elements in the vector. For a model parameter (w), the L1 norm may be obtained by summing the absolute values. For example, for a vector (v) corresponding to an L1 norm, for a matrix (M), each row or column can be considered a vector, and then the L1 norms of the respective vectors are summed.

For example, the L2 norm refers to the sum of squares and the square root of the individual elements of the vector. For a parameter (w), its L2 norm can be obtained by taking the sum of squares of the individual elements and the square root of the square root. For example, for a vector (v) corresponding to an L2 norm, for a matrix (M), each row or column can be considered a vector, and then the L2 norms of the respective vectors are summed.

Optionally, based on the feature analysis result, determining a contribution degree of each model parameter in the model to be compressed in the model output, and based on the contribution degree, determining a score of each model parameter, wherein the higher the contribution degree is, the higher the score is. For example, a gradient input method or a gradient weighting method may be used to calculate the contribution of each model parameter in the model output:

gradient input method: the method calculates the contribution degree of the model parameters in the model output by calculating the product of the gradient of each model parameter and the corresponding input, and comprises the following specific steps: for one parameter (w), forward propagation is first performed to obtain the predicted output of the model. The gradient of the loss function output for the model, i.e. the gradient over the output layer, is then calculated. For each parameter (w), the product of its gradient and the corresponding input, i.e. the gradient multiplied by the input value, is calculated. A larger gradient times a larger input value indicates that the parameter has a larger contribution to the model output.

Gradient weighting method: the method calculates the contribution degree of the model parameters in the model output by calculating the product of the gradient of each model parameter and the weight corresponding to the gradient, and comprises the following specific steps: for one parameter (w), forward propagation is first performed to obtain the predicted output of the model. The gradient of the loss function output for the model, i.e. the gradient over the output layer, is then calculated. For each parameter (w), the product of its gradient and the corresponding weight is calculated. A larger gradient times a larger weight indicates that the parameter has a larger contribution to the model output.

Further, pruning modes are determined based on the resource limitation information and the scores of the model parameters. For example, assuming that the target platform is a mobile device with limited computing resources, after the model feature analysis, the weight of the convolution kernel a is found to have a smaller influence on the model, in which case a pruning manner based on the convolution kernel may be selected. Alternatively, assuming that the target platform is an embedded system with limited storage capacity, after model feature analysis, the contribution of the channel a in the output result is found to be smaller, and in this case, a pruning manner based on the channel may be selected.

According to the embodiment of the application, the proper parameter pruning scheme is designed according to the characteristic model and the characteristic analysis result so as to meet the resource limit of the target platform and ensure the performance and accuracy of the model.

Based on the above embodiment, the performing quantization compression on the pruned model to be compressed includes:

step 321, storing the pruned model to be compressed by adopting sparse matrix representation, and training the pruned model to be compressed;

step 322, determining a quantization compression mode based on the model architecture or the model complexity of the model to be compressed;

and step 323, performing quantization compression on the trained model to be compressed based on the quantization compression mode.

For pruned models to be compressed, sparse matrix representations may be used to store and transfer model parameters. Wherein, in the neural network model, a large number of zero elements are generated after pruning, and the zero elements can be regarded as blank positions in a sparse matrix, so that model parameters can be stored and transferred by using sparse matrix representation; the sparse matrix representation uses a large number of zero elements in the parameters to compress, only stores non-zero elements and index information related to the non-zero elements, and further reduces the storage space of the model. In addition to reducing the storage space, the sparse matrix representation may also bring performance advantages in the computation process, and in the multiplication operation of the sparse matrix, the number of multiplications may be reduced by using the information of zero elements, so as to accelerate the computation process.

The pruned model may lose some performance, thus improving the performance of the model by fine tuning or retraining. Specifically, a loss function of the pruned model to be compressed is determined, and then the pruned model to be compressed is trained based on the loss function so as to adjust model parameters of the pruned model to be compressed. It can be appreciated that the loss function is a function for measuring the difference between the model prediction result and the real label, and in the process of fine tuning or retraining, the loss function is optimized to adjust the model parameters after pruning, so that the model parameters can be better adapted to new tasks or data.

For example, determining a loss function based on task type, such as cross entropy loss function, softmax loss function, etc., assuming the task type is a classification task; assuming that the task type is a regression task, adopting a mean square error loss function; assuming that the task type is a target detection task, bounding box regression loss, classification loss, etc. are employed. And then training the pruned model to be compressed based on the defined loss function to adjust the model parameters of the pruned model to be compressed.

Alternatively, during fine tuning, a smaller learning rate is required to avoid over-tuning the model, since pruning operations have already been parameter-deleted from the model. Meanwhile, a regularization method can be used for controlling the complexity of the model, so that overfitting is prevented.

And determining a quantization compression mode based on a model framework or model complexity of the model to be compressed, and performing quantization compression on the trained model to be compressed based on the quantization compression mode. Different model architectures are different in the application of quantization compression modes. For example, for densely connected structures such as Convolutional Neural Networks (CNNs), the applicability of fixed-point quantitative or mixed-precision quantitative is better; and for sparse connection structures such as a Recurrent Neural Network (RNN), matrix decomposition quantization is more applicable.

The higher the model complexity, the more difficult it is to use quantization compression, so a suitable quantization compression method needs to be selected according to the specific model complexity. For example, when using mixed precision quantization, it is necessary to select a division point between high precision and low precision, which may otherwise cause degradation of the model performance.

Optionally, different application scenarios have different requirements on the accuracy and the size of the model, so that the quantization compression mode needs to be selected according to the specific application scenario. For example, for low power consumption devices or mobile end applications, the requirements of size and calculation amount are high, and fixed point number quantization or mixed precision quantization can be selected; for high performance computing or cloud applications, methods such as floating point number quantization may be selected.

Optionally, when selecting the quantization compression method, experiments and tests may be performed first, and different quantization compression methods may be evaluated and compared, for example, using some evaluation indexes, such as model size, performance, accuracy, etc., to evaluate and compare different compression methods, and finally, the most suitable quantization compression method is selected.

When the embodiment of the application is used for carrying out quantization compression, a proper quantization compression method is selected according to specific conditions, and parameter adjustment and optimization are carried out on the quantization compression method so as to achieve the best compression effect.

For a further analytical description of the model compression method proposed in the present application, reference is made to the following examples.

To address challenges faced by deploying and running large-scale complex models on resource-constrained devices or platforms. By the self-adaptive compression method, the size and the calculated amount of the model can be reduced, so that the storage space and the calculation resources are saved, and the performance and the efficiency of the model in a limited environment are improved.

The embodiment of the application specifically provides a pruning algorithm-based adaptive algorithm model compression method, wherein the aim of the adaptive algorithm model compression method comprises the following aspects:

resource saving: large-scale complex algorithmic models typically occupy a significant amount of memory and computing resources. On resource-constrained devices and platforms, compressing the model may reduce the size and computational effort of the model, thereby saving storage space and reducing computational costs.

Platform adaptability: different devices and platforms have different hardware configurations and resource limitations. The adaptive algorithm model compression method can select and apply a proper compression method according to the characteristics and requirements of the platform so as to achieve the best compression effect and performance. This ensures that the compressed model can be run efficiently on a specific device and platform.

Performance retention: in the process of model compression, it is important to maintain reasonable model performance. The self-adaptive algorithm model compression method can comprehensively consider the characteristics of the model and the limitation of a target platform, and selects a proper compression method so as to keep the performance of the compressed model as much as possible and avoid the performance degradation of the model caused by excessive compression.

Flexibility and customization: the self-adaptive algorithm model compression method has the characteristics of flexibility and customization. According to the requirements of different platforms and application scenes, different compression methods can be dynamically selected and applied to achieve the best compression effect and performance, so that a customized model compression solution can be provided for various different devices and platforms.

The core idea of the self-adaptive algorithm model compression method is to prune, fine tune and parameterize a complex model according to the resource limit and the demand of a target platform, further reduce the size of the algorithm model, reduce the calculation amount of the model and the storage space of the model, ensure that the model still has certain performance, and can perform model reasoning operation on the target platform. The core of pruning and quantization methods is divided into 3 subclasses: quantization and binarization, network pruning, and structure matrix. Quantization compresses the original network by reducing the number of bits needed to represent each weight, which can be quantized to 16-bit, 8-bit, 4-bit, or even 1-bit,8-bit parameter quantization can achieve substantial acceleration while losing a small fraction of accuracy. Pruning refers to significantly reducing the storage and computation costs of DNN models by pruning connections that have less impact, mainly involving weight pruning (pruning unimportant connection weights) and few neuron pruning (directly removing some redundant neurons). The parameterized structure matrix can reduce the memory cost, and the training speed can be greatly increased through matrix vector and gradient calculation.

The self-adaptive algorithm model compression method based on the pruning algorithm specifically comprises the following steps:

step 1: target platform information is collected. The system collects key information such as hardware configuration, storage capacity, computing resources and the like of the target platform, and establishes a characteristic model of the target platform.

Step 2: and analyzing the model characteristics. The system performs feature analysis on the model to be compressed, including features of model structure, parameter distribution, hierarchical connection and the like, and needs to evaluate importance of each parameter or layer of the model to determine the parameter or layer with great contribution to the performance of the model.

Step 3: and (5) parameter pruning. Pruning is carried out on parameters in the model to be compressed according to the characteristic model of the target platform and the characteristic analysis result of the model to be compressed. For example, according to the pruning proportion or pruning threshold, the parameters with smaller weights or less importance are set to zero or removed, so as to reduce the size of the model and reduce the calculation amount.

Step 4: sparse matrix representation. For pruned models to be compressed, sparse matrix representations may be used to store and transfer model parameters. The sparse matrix representation uses a large number of zero elements in the parameters to compress, only stores non-zero elements and index information related to the non-zero elements, and further reduces the storage space of the model.

Step 5: fine tuning or retraining. Some performance may be lost in the pruned model to be compressed, so fine tuning or retraining is performed to improve the performance of the model. During fine tuning, the pruned model to be compressed is retrained for a period of time using a smaller learning rate to restore the previous performance level.

Step 6: and (5) quantizing and compressing. After pruning, the size of the model is further reduced by applying quantization compression techniques. Wherein quantization compression reduces storage requirements by reducing the accuracy of representation of parameters, such as converting floating point parameters to fixed point parameters.

Step 7: deployment and reasoning. The model to be compressed which is pruned and quantitatively compressed can be deployed on a target platform for reasoning. In the reasoning process, the compressed model to be compressed is used for processing input data and calculating prediction output.

According to the self-adaptive algorithm model compression method based on the pruning algorithm, which is provided by the embodiment of the application, the proper compression method is dynamically selected and applied according to the resource limit and the demand of the target platform, so that the compression process has higher flexibility and adaptability, and customized optimization can be performed for different platforms and application scenes. By comprehensively considering the characteristics of the model features and the characteristics of the target platform, the system can compress the model and simultaneously maintain reasonable performance, and avoid the performance degradation of the model caused by excessive compression. The size and the calculated amount of the model are effectively reduced, and the storage space and the calculation resources are saved, so that the algorithm model can be better adapted to equipment and platforms with limited resources. The method can be used in combination with different model compression technologies, can be expanded and updated according to new compression methods and platform requirements, and has higher flexibility and expandability.

Fig. 2 is a schematic structural diagram of a model compression device provided in the present application, and referring to fig. 2, an embodiment of the present application provides a model compression device, which includes a characteristic model building module 201, a feature analysis module 202, a pruning manner determining module 203, and a quantization compression module 204.

A characteristic model construction module 201, configured to construct a characteristic model based on at least one characteristic parameter of the target platform;

the feature analysis module 202 is used for carrying out feature analysis on the model to be compressed to obtain a feature analysis result;

the pruning mode determining module 203 is configured to determine a pruning mode based on the characteristic model and the feature analysis result;

and the quantization compression module 204 is configured to prune the model to be compressed based on the pruning mode, and perform quantization compression on the pruned model to be compressed.

The model compression device provided by the embodiment of the application constructs a characteristic model based on at least one characteristic parameter of a target platform; performing feature analysis on the model to be compressed to obtain a feature analysis result; determining a pruning mode based on the characteristic model and the characteristic analysis result; pruning is carried out on the model to be compressed based on a pruning mode, and quantized compression is carried out on the model to be compressed after pruning. According to the method and the device, the proper compression method is dynamically selected and applied according to the resource limit and the demand of the target platform, so that the compression process is more flexible and adaptive, and customized optimization can be performed for different platforms and application scenes.

In one embodiment, the pruning method determining module 203 specifically includes:

In one embodiment, the quantization compression module 204 specifically includes:

determining a loss function of the pruned model to be compressed;

In one embodiment, quantization compression module 204 further comprises:

Fig. 3 illustrates a physical schematic diagram of an electronic device, as shown in fig. 3, where the electronic device may include: processor 310, communication interface 320, memory 330 and communication bus 340, wherein processor 310, communication interface 320 and memory 330 communicate with each other via communication bus 340. The processor 310 may invoke logic instructions in the memory 330 to perform a model compression method comprising:

Further, the logic instructions in the memory 330 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-only memory (ROM), a random access memory (RAM, randomAccessMemory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present application also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform a model compression method provided by the above methods, the method comprising:

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims

1. A method of model compression, comprising:

2. The model compression method according to claim 1, wherein the determining a pruning manner based on the characteristic model and the feature analysis result includes:

3. The method for compressing a model according to claim 2, wherein determining the score of each model parameter in the model to be compressed based on the feature analysis result comprises:

4. The method for compressing a model according to claim 2, wherein pruning the model to be compressed based on the pruning method comprises:

5. The method for compressing models according to claim 1, wherein the compressing the pruned model to be compressed in a quantization mode includes:

6. The method for compressing models according to claim 5, wherein training the pruned model to be compressed comprises:

determining a loss function of the pruned model to be compressed;

7. The method for compressing a model according to claim 1, wherein after pruning the model to be compressed and quantitatively compressing the pruned model to be compressed based on the pruning method, the method further comprises:

8. A model compression device, characterized by comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the model compression method according to any one of claims 1 to 7 when executing the program.

10. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the model compression method according to any one of claims 1 to 7.