CN113780535A

CN113780535A - Model training method and system applied to edge equipment

Info

Publication number: CN113780535A
Application number: CN202111137522.2A
Authority: CN
Inventors: 李瑞轩; 辜希武; 高鑫; 李玉华; 王号召
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2021-09-27
Filing date: 2021-09-27
Publication date: 2021-12-10

Abstract

The invention discloses a model training method and a system applied to edge equipment, belonging to the field of model compression and transfer learning and comprising the following steps: loading a pre-trained original model in edge equipment, and identifying a residual block formed by sequentially connecting a Point-wise convolutional layer, a Depth-wise convolutional layer and a Point-wise convolutional layer; adding a corresponding light framework on the basis of the residual block to convert the original model into a target model; the light framework consists of a Group-wise convolutional layer and a Point-wise convolutional layer; training a target model by using a target task data set, and sequentially compressing shared parameters and non-shared parameters in the target model in a two-step compression mode, thereby completing the training and compression of the target model; the two-step compression is a conventional two-step compression, a two-step compression based on activation values, or a two-step compression based on compression training. The invention can improve the model training efficiency in the edge equipment and reduce the storage space and the calculation complexity occupied in the model training process and the reasoning process.

Description

Model training method and system applied to edge equipment

Technical Field

The invention belongs to the field of model compression and transfer learning, and particularly relates to a model training method and system applied to edge equipment.

Background

With the popularization of intelligence, personal intelligent devices are idle most of the time. The computing power of the edge device can be fully utilized by migrating the training task from the cloud to the edge. The amount of data on the edge is much less than the cloud, and the time required to train the model is typically not very long. The data needing to be uploaded is reduced, and the training efficiency can be improved. Meanwhile, personal sensitive data do not need to be uploaded to the cloud, and the privacy problem can be well solved.

The storage and computational resources on edge devices are limited, and how to effectively train neural network models on resource-constrained devices requires two problems to be solved: firstly, the development speed of hardware no longer follows Moore's law and gradually tends to be flat, while the size of the model and the data scale are continuously increased; to reduce the computing load of the cloud, training needs to be transferred from the cloud platform to the edge platform, and personal idle device (edge device) hardware resources are less compared to the cloud, so the problem of not being able to train increasingly larger models on resource-constrained edge devices needs to be solved. Secondly, the distribution and the variety of data are poor on the edge device, and more data are generally needed for training to obtain a good model effect, so that the problem that the model effect is not good on the edge device needs to be solved.

Aiming at the first problem, the existing model with good effect can be compressed by using a model compression technology, and a model framework with huge parameter quantity and redundancy is compressed into a model framework with small parameter quantity and exquisite quality, so that the compressed model can be trained on edge equipment. The current model compression algorithm mainly comprises: model quantization, model pruning (or model sparsification) and model structure design.

For the second problem, transfer learning is a good solution. The transfer learning can help to improve the model effect, the model is firstly trained on a larger data set, then the trained model is transferred to a smaller data set to be retrained (generally only fine tuning is needed), and finally the purpose of improving the model training effect is achieved. Migration of a model, e.g., trained on ImageNet, to a small dataset, e.g., CIFAR-10, is generally more effective. The method for transfer learning mainly comprises two main categories of data-based methods and model-based methods.

The data-based migration learning approach focuses on adjusting and transforming the data and then applying the modified data to the model. Based on the transfer learning of the model, the training effect of the model can be improved through some control strategies in the training stage, and the control strategies mainly comprise a model control strategy and a parameter control strategy. The main idea of the model control strategy is that the model trained on the source data set can help the target model train on the target data set. The objects of interest for the parametric control strategy are mainly the parameters of the model.

Therefore, how to effectively train the neural network model on the resource-limited equipment is researched, so that the problems that the training efficiency of a large model on the edge equipment is poor and the training cannot be carried out (training memory is insufficient) are effectively solved, and the method has important significance for the development of artificial intelligence.

Disclosure of Invention

Aiming at the defects and improvement requirements of the prior art, the invention provides a model training method and a model training system applied to edge equipment, and aims to solve the problems that a large model is poor in training efficiency and cannot be trained on the edge equipment with limited resources.

To achieve the above object, according to one aspect of the present invention, there is provided a model training method applied to an edge device, including:

loading a pre-trained original model in edge equipment, and identifying a residual block formed by sequentially connecting a Point-wise convolutional layer, a Depth-wise convolutional layer and a Point-wise convolutional layer; adding a corresponding light framework on the basis of the residual block to convert the original model into a target model; the light framework comprises a Group-wise convolutional layer and a Point-wise convolutional layer which are connected;

training a target model by using a target task data set, and compressing shared parameters and non-shared parameters in the target model in sequence in a two-step compression mode, thereby completing the training and compression of the target model;

the shared parameters are parameters belonging to the original model in the target model, and the unshared parameters are parameters belonging to the light architecture in the target model.

The bottleneck that the model occupies the main memory in the training process is intermediate data generated by the model, so that how to reduce the size of data generated by model calculation (i.e. the activation value in the model) in the actual training is the key to reduce the memory (video memory) actually required by the model training; after the pre-trained original model is recorded on the edge device, a light framework is introduced on the basis of a residual block, specifically a two-layer structure formed by connecting a Point-wise convolutional layer and a Depth-wise convolutional layer, the light framework is a light-weight module, superior structural design strategies such as grouping convolution and 1 x 1 convolution are utilized, compared with a three-layer structure of the residual block, the light framework directly reduces output data of a middle layer, and generated parameters and generated middle data are fewer; by introducing the light architecture into the original model and converting the original model into the target model, the method can help reduce the resources such as computing resources and storage required during training and optimize the training process of the new model on the edge device.

The parameter sharing refers to a model-based parameter sharing method for transfer learning, and can help a model to quickly train on a target data set and also improve the training effect of the model by using learned characteristics; the invention adopts a parameter sharing method, takes the parameters of the original model in the target model as the shared parameters, so that the model added with the light framework can utilize the original network to reserve the learned characteristics, and utilizes the light framework to carry out personalized training on the target task, namely, only the light framework is trained, and the network shared by the parameters is frozen to reserve the original characteristics. After the parameters are shared, partial parameters of the shared parameters in the training process are updated without back propagation of the gradient, namely, the parameters in the original network in the model share the parameters of the corresponding network in the source model (namely, the model part of the non-light architecture), and the shared network only provides a feature extraction function and does not participate in the training process any more.

The method of the light architecture is essentially to additionally increase the structure of the network, so that more resources are used by the model in the forward propagation process, the thought of the two-step compression mainly comes from a parameter sharing method of the transfer learning, and when the model is transferred from a source task to a target task in the parameter sharing method of the transfer learning, the value of the shared parameter is not trained on the target task. The shared parameters only need to be used as a fixed parameter matrix to participate in calculation in a feature extraction stage (namely forward propagation), and the gradient of the shared parameters is not calculated in the backward propagation process, so that a reasonable compression method is adopted for the shared parameters, the calculation influence of the compressed parameters on the activation values can be small, the overall effect of the model can also reach the level before compression, and the calculation process can also be accelerated by applying shift operation aiming at the compressed parameters. After the model is trained, the non-shared full-precision parameters are not updated any more, and at this time, other parameters (networks) which are not compressed can be compressed again like shared parameters, so that the model is further compressed. The modified model is compressed in the training stage and the post-training stage respectively, so that the storage space and the computational complexity occupied by the model in the training process and the reasoning process after the training are greatly reduced.

Generally speaking, the invention introduces a light framework corresponding to the residual block into the pre-trained original model, converts the original model into the target model, takes the parameters belonging to the original model in the target model as shared parameters, does not update in the training process, can help reduce the resources such as computing resources and storage required in the training process, optimizes the training process of the new model on the edge equipment, and improves the training efficiency of the model.

In some alternative embodiments, the two-step compression comprises:

before training the target model, compressing the shared parameters;

and after the training of the target model is finished, compressing the unshared parameters.

The two-step compression method is a conventional two-step compression method, only the shared parameters and the unshared parameters after model training is finished are compressed, and the two-step compression method is simple in calculation and can ensure certain compression efficiency and model precision.

In some alternative embodiments, the two-step compression comprises:

before training the target model, compressing the shared parameters, compressing the activation values of the parts, which belong to the original model and are not bordered by the light framework, of the target model by using the compressed shared parameters, and then transmitting the compressed activation values into a next layer of network;

The two-step compression method is an activation value-based two-step compression method, and when the shared parameter is compressed, the activation value is also compressed, so that the compression efficiency of the model can be further improved.

In some alternative embodiments, the two-step compression comprises:

before training the target model, compressing the shared parameters;

in each round of training of the target model, firstly compressing the non-shared parameters, and performing forward propagation by using the compressed non-shared parameters so as to correct the uncompressed non-shared parameters by using the compressed non-shared parameters; the gradient is calculated using the unquantized but corrected unshared parameters and propagated backwards to update the unshared parameters.

The two-step compression method is based on compression training, and the two-step compression method compresses the unshared parameters in the training process, so that the precision of the target model after the training is finished can be further improved.

Further, in the two-step compression, the compression mode for the shared parameter and/or the unshared parameter is quantization.

Model quantization is used as a model compression algorithm, and storage resources required by the model are reduced by quantizing the model; according to the method, the parameters in the target model are compressed by adopting a quantization method in the two-step compression, so that the calculation complexity of model training can be further reduced while the compression efficiency is ensured.

Further, the light architecture also comprises a down-sampling module before the Group-wise convolutional layer and an up-sampling module after the Point-wise convolutional layer.

The down-sampling module can reduce the resolution of the characteristic diagram and reduce the size of intermediate data output by the module, and the up-sampling module is matched with the down-sampling module to enable the output dimensionality of the light framework layer to be matched with the input dimensionality of the next layer.

According to another aspect of the present invention, there is provided a model training system applied to an edge device, including: a light architecture combination module and a two-step compression module;

the light framework combined module is used for loading a pre-trained original model in the edge device, identifying a residual block formed by sequentially connecting a Point-wise convolutional layer, a Depth-wise convolutional layer and a Point-wise convolutional layer, and adding a corresponding light framework on the basis of the residual block to convert the original model into a target model; the light framework comprises a Group-wise convolutional layer and a Point-wise convolutional layer which are connected;

the two-step compression module is used for training a target model by using a target task data set and sequentially compressing shared parameters and non-shared parameters in the target model in a two-step compression mode so as to finish the training and compression of the target model; the shared parameter is a parameter belonging to the original model in the target model, and the unshared parameter is a parameter belonging to the light architecture in the target model.

Generally, by the above technical solution conceived by the present invention, the following beneficial effects can be obtained:

(1) according to the method, a two-layer light framework corresponding to a three-layer residual block is introduced into a pre-trained original model, the original model is converted into a target model, original model parameters in the original model are used as shared parameters, so that the reduction of resources such as computing resources and storage required during training can be facilitated, the training process of a new model on edge equipment is optimized, and the training efficiency of the model is improved.

(2) According to the method, three different modes of performing two-step compression on the target model are provided according to different quantization objects and different quantization modes, namely a conventional two-step compression method, a two-step compression method based on an activation value and a two-step compression method based on compression training, wherein the conventional two-step compression method only compresses shared parameters and unshared parameters after model training is finished, the calculation is simple, and certain compression efficiency and model accuracy can be guaranteed; when the two-step quantization method based on the activation value is used for compressing the shared parameter, the activation value is also quantized, so that the compression efficiency of the model can be further improved; the two-step compression based on the compression training compresses the unshared parameters in the training process, and can further improve the precision of the target model after the training is finished.

Drawings

Fig. 1 is a flowchart of a model training method applied to an edge device according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a light architecture according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating parameter sharing according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating the quantization of a shared parameter in a two-step quantization according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a conventional two-step quantization method provided by an embodiment of the present invention; the method comprises the following steps of (a) quantifying shared parameters in a target model, and (b) quantifying non-shared parameters in the target model;

FIG. 6 is a diagram illustrating the two-step quantization based on activation values for the shared parameter and the activation values;

fig. 7 is a schematic diagram of quantizing an unshared parameter in two-step quantization based on quantization training according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

In the present application, the terms "first," "second," and the like (if any) in the description and the drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

In order to solve the problems of poor training efficiency and insufficient training memory of a large model on edge equipment with limited resources, the invention provides a model training method and a system applied to the edge equipment, and the overall thought is that a light framework is added on the basis of an original model which is pre-trained, the original model is converted into a target model, parameters of the original model are used as shared parameters, and the target model is trained by using a target task data set, so that the target model can be better applied to a target scene by a transfer learning mode, resources such as calculation resources and storage required during training are reduced by parameter sharing, the training process of a new model on the edge equipment is optimized, the model training efficiency is improved, and on the basis, the shared parameters and non-shared parameters in the target model are compressed in sequence by a two-step compression mode, the storage space and the calculation complexity occupied by the model in the training process and after the training are finished are greatly reduced.

The model quantization is one of model compression, and the computation complexity is lower than that of other model compression methods such as model pruning and model structure design, so that, without loss of generality, in the following embodiments, the technical solution of the present invention is explained by taking a quantization method as an example of a specific model compression method.

The following are examples.

Example 1:

a model training method applied to an edge device, as shown in fig. 1, includes:

loading a pre-trained original model in edge equipment;

identifying a residual block formed by sequentially connecting a Point-wise convolutional layer, a Depth-wise convolutional layer and a Point-wise convolutional layer; adding a corresponding light framework on the basis of the residual block to convert the original model into a target model; the light framework comprises a Group-wise convolutional layer and a Point-wise convolutional layer which are connected; fig. 2 is a schematic diagram of a light architecture with two layers added on the basis of a three-layer residual block, and as shown in fig. 2, the light architecture added in the original model in this embodiment further includes a down-sampling module before the Group-wise convolutional layer and an up-sampling module after the Point-wise convolutional layer; the down-sampling module can reduce the resolution of the feature map and reduce the size of intermediate data output by the module, and the up-sampling module is matched with the down-sampling module, so that the influence on the input dimensionality of the next layer can be avoided; optionally, in this embodiment, the downsampling module is a pooling layer, and downsampling is implemented using a pooling operation;

taking parameters belonging to an original model in the target model as shared parameters, and initializing the shared parameters in the newly generated target model by using the pre-trained original model parameter values through parameter sharing so that the newly generated target model has rich feature extraction capability of the pre-trained original model; initialization methods typically use model loading methods, such as the store.load method in pystore; accordingly, the parameters belonging to the light architecture in the target model are taken as the unshared parameters, as shown in fig. 3;

training a target model by using a target task data set, and compressing shared parameters and non-shared parameters in the target model in sequence in a two-step compression mode, thereby completing the training and compression of the target model; according to specific application requirements, target tasks executed by the target model can be mode recognition tasks such as face recognition and license plate recognition, image processing tasks, natural language processing tasks and the like, and when the target model is trained, samples in a target task data set are adapted to the target tasks; by training the target model with the target task data set, the target task can be well applied to executing the target task only by fine tuning.

In the embodiment, a target model is deeply compressed by a two-step quantization mode, so that the model consumes less computer resources under the condition of keeping the feature extraction capability of the target model; the first quantization of the two-step quantization is to quantize the shared parameter, because the shared parameter does not need to participate in the training, the quantization does not affect the gradient back propagation in the training process, as shown in fig. 4; as an alternative implementation, the two-step quantization adopted in this embodiment, specifically, a conventional two-step quantization manner, as shown in fig. 5, includes:

before the training of the target model, the shared parameters are compressed, as shown in (a) of fig. 5, wherein,

and

respectively representing a shared parameter and a non-shared parameter,

representing the quantized sharing parameter, A_k-1And A_kRespectively represents the activation values of the k-1 th layer and the k-th layer, and the Quantization represents the quantification;

after the training of the target model is finished, the unshared parameters are compressed, as shown in (b) of fig. 5, wherein,

representing quantized non-shared parameters;

the two-step quantization method shown in fig. 5 is a conventional two-step quantization method, and only quantizes shared parameters and unshared parameters after model training is finished.

Generally, in this embodiment, a light framework corresponding to a residual block is introduced into an original model which is pre-trained, the original model is converted into a target model, parameters in the target model which belong to the original model are used as shared parameters, and the parameters are not updated in the training process, so that the reduction of resources such as computing resources and storage required in the training process can be facilitated, the training process of a new model on edge equipment is optimized, and the model training efficiency is improved. Therefore, the embodiment can effectively solve the problems that the training efficiency of the large model on the edge device with limited resources is poor and the large model cannot be trained.

Example 2:

a model training method applied to an edge device, which is similar to embodiment 1, except that in this embodiment, a mode of performing two-step quantization on a model is specifically a two-step quantization method based on an activation value, and specifically includes:

before the target model is trained, the shared parameters are compressed, and the compressed shared parameters are used for compressing the activation values of the parts of the target model, which belong to the original model and are not in the neighborhood of the light framework, and then transmitting the compressed activation values into the next layer of network, as shown in fig. 6, wherein Q is_AkRepresenting the quantized activation value A_k；

After the target model training is finished, compressing the non-shared parameters; this step the conventional two-step quantization method quantizes the unshared parameters in the same way.

In the two-step quantization method based on the activation value adopted by the embodiment, when the shared parameter is compressed, the activation value is also quantized, so that the compression efficiency of the model can be further improved; considering that in the target model, the full-precision activation value generated by the unshared parameter network exists in the partial activation value bordered by the shared parameter partial network (namely the original model) and the unshared parameter partial network (namely the light architecture), so that quantification is not performed.

Example 3:

a model training method applied to an edge device, which is similar to embodiment 1, but differs in that in this embodiment, a mode of performing two-step quantization on a model is specifically a two-step quantization method based on quantization training, and specifically includes:

before training a target model, compressing shared parameters, wherein the step is consistent with the step of quantizing the shared parameters in the conventional two-step quantization process, and the step is to quantize the shared parameters, reduce the size of the model and accelerate the calculation process of forward propagation;

in each round of training of the target model, firstly compressing the non-shared parameters, and performing forward propagation by using the compressed non-shared parameters so as to correct the uncompressed non-shared parameters by using the compressed non-shared parameters; the gradient is calculated using the unquantized but corrected unshared parameters and propagated backwards to update the unshared parameters, as shown in fig. 7, wherein "Assignment" indicates that the full-precision (corrected) parameters are recalculated according to the quantized parameters for temporarily participating in the gradient descent process.

The two-step quantization based on quantization training adopted by the embodiment quantizes the unshared parameters in the training process, and can further improve the accuracy of the target model after the training is finished.

Example 4:

a model training system for application to an edge device, comprising: a light architecture combination module and a two-step compression module;

the two-step compression module is used for training a target model by using a target task data set and sequentially compressing shared parameters and non-shared parameters in the target model in a two-step compression mode so as to finish the training and compression of the target model; the shared parameter is a parameter belonging to an original model in the target model, and the unshared parameter is a parameter belonging to a light framework in the target model;

in this embodiment, the two-step compression mode adopted by the two-step compression module may be any one of the conventional two-step quantization method, the two-step quantization method based on the activation value, and the two-step quantization method based on the quantization training; in this embodiment, the detailed implementation of each module may refer to the description in the above method embodiment, and will not be repeated here.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A model training method applied to edge equipment is characterized by comprising the following steps:

training the target model by using a target task data set, and compressing shared parameters and non-shared parameters in the target model in sequence in a two-step compression mode, so as to finish the training and compression of the target model;

the shared parameter is a parameter belonging to the original model in the target model, and the unshared parameter is a parameter belonging to the light architecture in the target model.

2. The model training method applied to the edge device according to claim 1, wherein the two-step compression comprises:

compressing the shared parameters prior to training the target model;

and after the target model is trained, compressing the unshared parameters.

3. The model training method applied to the edge device according to claim 1, wherein the two-step compression comprises:

before the target model is trained, compressing the shared parameters, and compressing the activation values of the parts, which belong to the original model and are not bordered by the light framework, of the target model by using the compressed shared parameters, and then transmitting the compressed activation values into a next layer of network;

and after the target model is trained, compressing the unshared parameters.

4. The model training method applied to the edge device according to claim 1, wherein the two-step compression comprises:

compressing the shared parameters prior to training the target model;

in each round of training of the target model, the non-shared parameters are compressed, and the compressed non-shared parameters are used for forward propagation, so that the compressed non-shared parameters are used for correcting the uncompressed non-shared parameters; gradients are calculated using unquantized but corrected unshared parameters and propagated backwards to update the unshared parameters.

5. The model training method applied to the edge device according to any one of claims 1 to 4, wherein in the two-step compression, the compression manner for the shared parameter and/or the unshared parameter is quantization.

6. The model training method applied to the edge device according to any one of claims 1 to 4, wherein the light architecture further comprises a down-sampling module before the Group-wise convolutional layer and an up-sampling module after the Point-wise convolutional layer.

7. A model training system for use with an edge device, comprising: a light architecture combination module and a two-step compression module;

the light architecture combination module is used for loading a pre-trained original model in edge equipment, identifying a residual block formed by sequentially connecting a Point-wise convolutional layer, a Depth-wise convolutional layer and a Point-wise convolutional layer, and adding a corresponding light architecture on the basis of the residual block to convert the original model into a target model; the light framework comprises a Group-wise convolutional layer and a Point-wise convolutional layer which are connected;

the two-step compression module is used for training the target model by using a target task data set and sequentially compressing shared parameters and non-shared parameters in the target model in a two-step compression mode so as to complete the training and compression of the target model; the shared parameter is a parameter belonging to the original model in the target model, and the unshared parameter is a parameter belonging to the light architecture in the target model.