CN114897155A

CN114897155A - Integrated model data-free compression method for satellite

Info

Publication number: CN114897155A
Application number: CN202210328123.2A
Authority: CN
Inventors: 胡晗; 郝志伟; 徐冠宇; 安建平
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2022-03-30
Filing date: 2022-03-30
Publication date: 2022-08-12

Abstract

The invention discloses an integrated model data-free compression method for a satellite, and belongs to the field of satellite communication deep learning. Aiming at the problems that a great amount of storage and calculation resources are consumed for deploying a neural network model on a satellite, and original training data cannot be obtained, the invention trains a generator model to synthesize substitute data by establishing a minimum-maximum optimization target, and compresses a plurality of historical version models of the satellite end into a lightweight model with a multi-branch structure by using the generated data; and then, training an attention model to dynamically aggregate prediction results of all branches by using a small amount of labeled data when the satellite terminal updates the model. The invention compresses a plurality of historical version models generated in the updating process of the satellite end model in a data-free mode, greatly reduces the storage space and the floating point operation times required by the models at the cost of smaller precision loss, and saves precious storage and calculation resources on the satellite.

Description

Integrated model data-free compression method for satellite

Technical Field

The invention relates to a neural network model compression method, in particular to an integrated model data-free compression method for a satellite, and belongs to the field of satellite communication deep learning.

Background

In recent years, Deep Neural Networks (DNNs) have become one of the most common machine learning models, and have achieved near or superior performance to human experts in a number of tasks in the field of computer vision or natural language processing. With the development of aerospace technology and the improvement of national defense requirements, a DNN model is deployed on a satellite, and the intelligent reasoning task completed at the satellite end is gradually an urgent need. In this scenario, the DNN model is typically trained at the ground end and then deployed remotely at the satellite end. Since the distribution of input data usually changes slowly with time, the satellite needs to update the model periodically according to newly acquired observation data to ensure the model performance. For DNN, models at different update stages can be considered as different individuals, and evaluating the prediction results of these different models can usually achieve significant accuracy improvement. Since this prediction approach is similar to the integration approach, we refer to the ensemble of these models as an integrated model. However, since the satellite end usually has the constraint of limited storage and computing resources, and is hard to bear the overhead of performing inference by using all models for each input data, the integrated model needs to be compressed first to realize the following inference process.

Common model compression techniques include Quantization (Quantization), Pruning (Pruning) and Knowledge Distillation (KD), all of which have in common the need for the participation of the original training data. Where quantization and pruning require Fine-tuning (Fine-tune) of the compressed model using training data, KD requires training a small-sized model from scratch using training data. However, on the satellite side, the original training data existing on the ground is often not available due to transmission overhead or security problems of cross-domain transmission, so that the above method is difficult to work.

Some recent papers propose KD-based data-free model compression methods, however, these methods are designed for a scene where only a single model to be compressed exists, and cannot be applied to compression of an integrated model in a data-free scene. In addition, a small amount of real transfer measurement data used when the model is updated at the satellite side can be obtained, and a reasonable method also needs to be designed for improving the performance of the compressed model by using the assistance of the data.

Disclosure of Invention

The invention aims at the situation that the computing and storage resources of a satellite end are limited and all models are difficult to be used for reasoning on each input data, and mainly aims to provide an integrated model data-free compression method for a satellite, so that a prediction result aggregation mechanism under the scene that only a small amount of data can be obtained is used, the integrated model compression without data is realized, the utilization rate and reasoning capability of the model are improved, and the storage and computing resources on the satellite are saved.

The purpose of the invention is realized by the following technology:

the invention discloses an integrated model data-free compression method for a satellite, which trains a generator model to synthesize substitute data by establishing a minimum-maximum optimization target, and compresses a plurality of historical version models at the satellite end into a lightweight model with a multi-branch structure by using the generated data; and then, training an attention model to dynamically aggregate the prediction results of all branches by using a small amount of labeled data when the satellite terminal updates the model. A plurality of historical version models generated in the updating process of the satellite-side model are compressed in a data-free mode, the storage space and the floating point operation times required by the model are greatly reduced at the cost of small precision loss, and precious storage and calculation resources on the satellite are saved.

The invention discloses an integrated model data-free compression method for a satellite, which specifically comprises the following steps:

the method comprises the following steps: preparing an available integration model and marking data at the current stage;

after the satellite terminal is updated with the local model for a plurality of times, DNN models of a plurality of versions and a small amount of marking data collected in the model updating process exist. These models will collectively be used as an integrated model, compressed in subsequent steps, and the annotation data will be used to further improve the performance of the compressed model.

Step two: determining the structure of the compressed model according to the number of the integrated models, and reserving the performance gain brought by the integrated models;

the compressed model consists of two parts, namely a main network part and a plurality of branch network parts, wherein each branch network is directly connected with the main network. After the satellite terminal counts the number of the models to be compressed, the number of the branch networks is set to be the number, the branch networks are sequentially paired with the models to be compressed one by one, and each branch network is used for learning the feature representation of the corresponding model to be compressed. For input data, firstly, features of the input data are extracted by a main network, then the features are transmitted to each branch network, finally, each branch network gives respective prediction at the tail end of the branch network, and the average of the prediction results is used as the final output of a compressed model. Because the compressed model has a plurality of branch network structures which are isolated from each other, and each branch can learn the feature representation of the model to be compressed corresponding to the branch after training, the model after final compression has the feature representations of all the models to be compressed, and the performance gain brought by the integrated model is reserved.

Step three: training a target model by using a data-free integrated model compression method, and ensuring that the compressed model has the performance gain of an integrated model;

given the compressed model, the generator model synthesis data is used at the satellite side as a substitute for the original training data, and the training of the compressed model is completed using these substitute data. The generator model consists of DNN, which is input as a stochastic vector that obeys gaussian distribution, and output as synthesized surrogate data. The compressed model needs to maintain an output result similar to the model to be compressed on the substitute data, specifically, each branch of the compressed model needs to have a similar output to its corresponding model to be compressed, which is expressed as:

wherein G is the generator model, S ₀ For the backbone network of the compressed model, S _n For the nth branch network of the compressed model, T _n Is a reaction with S _n Corresponding to be compressedThe model is a model of a human body,

is a measure of T _n And S _n The function of the difference between the outputs may be the L-P norm or KL divergence, and z is a random vector that follows a multivariate Gaussian distribution.

In each iteration, the generator model needs to synthesize samples as difficult as possible to facilitate learning of the feature representation of the model to be compressed on the difficult samples of the compressed model. Specifically, the difficult samples are defined as samples that make the branches of the compressed model and the corresponding model to be compressed have different outputs, so the objective function of the generator model is defined as follows:

in the whole training process, the training of the generator model and the compressed model is synchronously carried out, and the whole body forms a minimum-maximum optimization problem expressed as follows:

furthermore, in order to make the surrogate samples as similar as possible to the original training data, the output of the normalization layer constraint generator model in the model to be compressed should also be used. Specifically, the generator model should also ensure that the output has a similar mean and variance representation as the original training data at the normalization layer of the model to be compressed:

wherein mu _n，l (x) And σ _n，l (x) Respectively representing the mean and variance of the activation values at the i normalization layer when the n model to be compressed takes x as input,

and

respectively representing the mean and variance of the activation values at the i normalization layer of the n-th model to be compressed with the original training data as input, and the values are intrinsic parameters of the model to be compressed. After the optimization problem is solved by using random gradient descent, the compression of the integrated model is completed, and the compressed model is ensured to have the performance gain of the integrated model.

Step four: training a prediction result aggregation model by using the labeled data, and further improving the precision;

the satellite terminal collects a small amount of labeled data in the updating process of the model at all times, and the data are used for training the attention model to dynamically aggregate the prediction results of all branches of the compressed model, so that the accuracy is further improved compared with the method of taking the average as the prediction result. The dynamic aggregation process of branch prediction results based on the attention mechanism is expressed as follows:

wherein attn (S (x), q) is the prediction result after aggregation, S (x) is the matrix formed by all branch outputs of the compressed model when x is input, q is the trainable query vector, S (-) is the attention scoring function, and the calculation is generally carried out by using a dot product model, namely S (S (X) _n (x)，q)＝S _n (x) ^T q is calculated. The objective function of the query vector q is expressed as follows:

wherein

Indicating that a set of training samples in the raw data can be acquired,

to representA corresponding set of supervisory signals is set up,

to evaluate the loss function of the difference between the aggregated results and the tags. The solution of the query vector is also completed by a random gradient descent algorithm.

Step five: the obtained compression model is deployed to a satellite terminal and used for executing a preset intelligent reasoning task, so that the utilization rate and reasoning capability of the model are improved, and storage and calculation resources on the satellite are saved.

Advantageous effects

1. Compared with a method for directly deploying an original DNN model on a satellite, the method compresses a model needing to be operated on the satellite with original training data difficult to obtain, and greatly saves limited storage and calculation resources on the satellite;

2. the invention discloses an integrated model data-free compression method for a satellite, which has a multi-branch integrated model structure after compression, and compared with a method for respectively compressing a model to be compressed into independent small models, the method can realize higher model compression rate; compared with the method for compressing the model to be compressed into a single model, the method can keep the performance gain brought by various feature representations in the integrated model as much as possible while compressing the size of the integrated model, and realizes higher model compression ratio and smaller precision loss;

3. compared with the traditional integrated model prediction result aggregation method using an averaging method, a voting method and the like, the branch result aggregation weight in the method is self-adaptive along with the input sample, and the aggregated result has higher prediction accuracy.

Drawings

Fig. 1 is a schematic flowchart of an integrated model data-free compression method for a satellite according to the present invention.

FIG. 2 is an example of a compressed model structure with multiple branches in an embodiment of the invention.

Detailed Description

The present invention will be described in detail with reference to the accompanying drawings and embodiments, and the technical problems and advantages solved by the technical solutions of the present invention will be discussed at the same time. It should be noted that the described embodiments are intended to facilitate the understanding of the present invention and do not limit the present invention in any way.

Example 1

The embodiment discusses the application of the integrated model no-data compression method for the satellite disclosed by the invention in a scene that a plurality of historical update models exist at the satellite end and original training data is unavailable. The specific implementation steps are as follows:

firstly, preparing available integrated models and marking data;

the satellite terminal updates the model with a plurality of historical versions through the past model, and in the embodiment, 3 ResNet34 models trained on the CIFAR100 data set by different random initializations are used as the plurality of historical version models. In training, each ResNet34 model was trained 200 rounds on a CIFAR100 dataset with the batch size set to 256 and the model parameters optimized by a batch stochastic gradient descent with momentum method. Setting the initial learning rate to 0.1, and attenuating the learning rate by 10 times after the 80 th iteration and the 120 th iteration respectively, setting the momentum coefficient to 0.9, setting the weight attenuation coefficient to 5 multiplied by 10 ^-4 . Marking the trained models as T respectively _n Where n ∈ {1, 2, 3 }.

The available annotation data was modeled by a 1% sample randomly drawn in the CIFAR100 dataset, noting that the set of data is

The supervision signals are integrated

Step two, determining a compressed model structure;

each branching network in this embodiment is comprised of a series of separable convolution modules. The compressed model structure comprises 1 main network and a plurality of branch networks, and the number of the branch networks is equal to the number of the available integrated models at the satellite end. The backbone network uses a ResNet18 model, and is divided into 4 modules at the point where the feature diagram size changes; each branch network is also divided into 4 modules, wherein each module is composed of two groups of separable convolution layers-two-dimensional batch normalization layer-ReLU activation function. The output of the mth module of the backbone network is connected to the input of the mth module of each of the branch networks, where m is {1, 2, 3, 4 }. For the input data, the backbone network first extracts 4 sets of features at different levels from it, and then these features are fed into each of the branch networks. For the mth module of the nth branch network, when n is 1, it only takes the output of the backbone network module 1 as input; when n ≠ 1, which takes as input both the output of the m-1 th module of the nth branch network and the output of the backbone network module m, these two sets of input data are first spliced together in the channel dimension and then fed into the mth module of the nth branch network. Fig. 2 shows the structure of the compressed model with multiple branches used in the present embodiment.

Step three, training a target model by using a data-free integrated model compression method;

this step entails first synthesizing surrogate data using the generator model, and then simultaneously training the compressed model in an antagonistic manner based on these surrogate data, with the training of the generator model and the compressed model alternating in an antagonistic manner. The training algorithm is as follows:

referring to the above algorithm, the parameters of the generator model and the compressed model were optimized using a batch stochastic gradient descent with momentum method with a total number of iterations set to 15000, a batch size set to 256, an initial learning rate set to 0.1, at 5000 th and 10 th, respectivelyAfter 000 iterations, the learning rate is attenuated by 10 times, the momentum coefficient is set to 0.9, and the weight attenuation coefficient is set to 5 multiplied by 10 ^-4 。

Step four, training a prediction result aggregation model by using the labeling data;

this step trains a query vector using the available label data, aggregates the prediction results using the query vector and the attention mechanism. The training algorithm is as follows:

reference is made to the algorithm described above. The query vector was trained by Adam optimizer for 30 rounds on available annotation data with batch size set to 128, learning rate set to 0.001, and first and second order momentum coefficients set to 0.9 and 0.999, respectively.

And step five, deploying the compressed model at a satellite end for executing a prediction task.

In this embodiment, when the number of available integration models is 3, the following results are obtained:

it can be seen that, in the case that the original training data is unavailable, the compressed model has reduced parameters by 72% compared with the original integrated model, and the number of floating point operations required is reduced by 82%, at the cost of only requiring less than 3% precision loss; compared with a knowledge distillation method which needs an original training data compression model, the integrated model data-free compression method provided by the invention can also realize the precision similar to that of the original training data compression model at the cost of only increasing a small number of model parameters and floating point operation amount.

Therefore, the integrated model data-free compression method for the satellite disclosed by the invention can greatly compress the model to be operated on the satellite of which the original training data is difficult to obtain at the cost of smaller precision loss, and realizes the performance similar to that of the traditional model compression method requiring data. Compared with the method of directly deploying the original DNN model on the satellite, the model compressed by the method of the invention needs less storage space and floating point operation times, and can greatly save the limited storage and calculation resources on the satellite.

The above detailed description is intended to illustrate the objects, aspects and advantages of the present invention, and it should be understood that the above detailed description is only exemplary of the present invention and is not intended to limit the scope of the present invention, and any modifications, equivalents, improvements and the like within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. An integrated model data-free compression method for a satellite, characterized by: comprises the following steps of (a) carrying out,

step three: training a target model by using a data-free integrated model compression method, and ensuring that the compressed model has the performance gain of the integrated model;

2. An integrated model data-free compression method for a satellite according to claim 1, characterized in that: the implementation method of the first step comprises the following steps:

after the satellite terminal is updated with the local model for a plurality of times, DNN models of a plurality of versions and a small amount of label data collected in the model updating process exist; these models will collectively be used as an integrated model, compressed in subsequent steps, and the annotation data will be used to further improve the performance of the compressed model.

3. An integrated model data-free compression method for a satellite according to claim 1, characterized in that: the implementation method of the second step is as follows:

the compressed model consists of two parts, namely a main network part and a plurality of branch network parts, wherein each branch network is directly connected with the main network; after the satellite terminal counts the number of the models to be compressed, the number of the branch networks is set as the number, the branch networks are sequentially paired with the models to be compressed one by one, and each branch network is used for learning the feature representation of the corresponding model to be compressed; for input data, firstly extracting features of the input data by a backbone network, then transmitting the features to each branch network, finally giving respective prediction at the tail end of each branch network, and taking the average of prediction results as the final output of a compressed model; because the compressed model has a plurality of branch network structures which are isolated from each other, each branch can learn the feature representation of the model to be compressed corresponding to the branch after training, so that the finally compressed model has the feature representations of all the models to be compressed, and the performance gain brought by the integrated model is reserved.

4. An integrated model data-free compression method for a satellite according to claim 1, characterized in that: the third step is realized by the following steps:

after the compressed model is given, synthesizing data by using a generator model at a satellite end to be used as a substitute of original training data, and finishing training of the compressed model by using the substitute data; the generator model is composed of DNN, the input of the generator model is a random vector which obeys Gaussian distribution, and the output of the generator model is synthesized substitute data; the compressed model needs to maintain an output result similar to the model to be compressed on the substitute data, specifically, each branch of the compressed model needs to have a similar output to its corresponding model to be compressed, which is expressed as:

wherein G is the generator model, S ₀ For the backbone network of the compressed model, S _n For the nth branch network of the compressed model, T _n Is a reaction with S _n The corresponding model to be compressed is then compressed,

is a measure of T _n And S _n The function of the difference between the outputs can be L-P norm or KL divergence, and z is a random vector obeying multivariate Gaussian distribution;

in each iteration, the generator model needs to synthesize samples which are as difficult as possible so as to promote the learning of the feature representation of the model to be compressed on the difficult samples of the compressed model; in particular, the difficult samples are defined as samples that have different outputs for the branches of the compressed model and the corresponding model to be compressed, so the objective function of the generator model can be defined in mathematical language as follows:

furthermore, in order to make the surrogate samples as similar as possible to the original training data, the output of the normalization layer constraint generator model in the model to be compressed should also be used; specifically, the generator model should also ensure that the output has a similar mean and variance to the original training data at the normalization layer of the model to be compressed, expressed as:

wherein mu _n,l (x) And σ _n,l (x) Respectively representing the mean and variance of the activation values at the i normalization layer when the n model to be compressed takes x as input,

and

respectively representing the mean value and the variance of an activation value at the ith normalization layer when the nth model to be compressed takes the original training data as input, wherein the values are intrinsic parameters of the model to be compressed; after the optimization problem is solved by using random gradient descent, the compression of the integrated model is completed, and the compressed model is ensured to have the performance gain of the integrated model.

5. An integrated model data-free compression method for a satellite according to claim 1, characterized in that: the implementation method of the fourth step is as follows:

the satellite terminal collects a small amount of labeled data in the updating process of the model at all times, and the data are used for training the attention model to dynamically aggregate the prediction results of all branches of the compressed model, so that the precision is further improved compared with the method of taking the average as the prediction result; the dynamic aggregation process of branch prediction results based on the attention mechanism is expressed as follows:

wherein attn (S (x), q) is the prediction result after aggregation, S (x) is the matrix formed by all branch outputs of the compressed model when x is input, q is the trainable query vector, S (-) is the attention scoring function, and the calculation is generally carried out by using a dot product model, namely S (S (X) _n (x)，q)＝S _n (x) ^T q; the objective function of the query vector q is expressed as follows:

wherein

Indicating that a set of training samples in the raw data can be acquired,

which represents the corresponding set of supervisory signals,

a loss function to evaluate the difference between the aggregated results and the tags; the solution of the query vector is also completed by a random gradient descent algorithm.