CN115035595A

CN115035595A - 3D model compression method based on spatio-temporal information transfer knowledge distillation technology

Info

Publication number: CN115035595A
Application number: CN202210624609.0A
Authority: CN
Inventors: 郭竞; 康金龙; 梁伟; 姚丽娜; 张靖; 史哲; 朱文娟; 卫毅; 韩枫; 许鹏飞
Original assignee: Northwest University
Current assignee: Northwest University
Priority date: 2022-06-02
Filing date: 2022-06-02
Publication date: 2022-09-09

Abstract

The invention discloses a 3D model compression method based on a spatio-temporal information transfer knowledge distillation technology, which adopts a multi-layer characteristic distillation module MFDM as a core element of a 3D model to perform model compression on a behavior recognition network, adopts a public data set containing various action types and multiple video segments as an experimental data set, and takes part of frames which are taken out at equal intervals from all behavior video segments in the data set as the input of a teacher model and a student model in the spatio-temporal characteristic distillation method respectively for characteristic extraction; extracting the characteristics of each layer, carrying out a multilevel characteristic distillation algorithm on the characteristics of each layer, and calculating the transfer loss of multilayer space-time characteristics; putting the characteristics of the last layer into the classification, classifying to obtain a logistic regression probability, and calculating a loss function by respectively using the logistic regression probability and a soft label and a real label generated by a teacher model; and finally, updating all parameters of the student model through back propagation according to the overall loss function, and updating the parameters of the multilayer characteristic distillation module.

Description

3D model compression method based on spatio-temporal information transfer knowledge distillation technology

Technical Field

The invention belongs to the field of model compression, and particularly relates to a 3D model compression method based on a spatio-temporal information transfer knowledge distillation technology.

Background

Behavior recognition is a direction with higher heat in computer vision, most models of behavior recognition are directed at video data, and if a higher recognition rate is required, the complexity of the models is relatively higher, so that the training efficiency is also reduced, which is a common problem in the field of deep learning. Since algorithms made in the laboratory pursue more and more accuracy, training efficiency and model complexity are less important if these problems occur in the laboratory. However, in practical applications, if the model needs to be deployed on the device, the computational resources and storage resources of the device are limited, such as in the applications of automatic driving and intelligent robots, which have more strict requirements on the accuracy and computational resources of the model. Therefore, the behavior recognition model with high complexity has the problems of low training efficiency and high space complexity, which are mainly considered problems.

Compared with the image-based algorithm model, the video-based algorithm model is more complex because the video has more time sequence relative to the picture and the data volume of the video is far higher than that of the picture on the premise of the same resolution. The existing knowledge distillation algorithm, including KD and FITINETS, is specific to a two-dimensional convolution feature extraction network, such as VGG and ResNet, however, the parameter number and the module complexity in a three-dimensional feature extraction network model are higher than those of the two-dimensional feature extraction network, and compression is more needed. Meanwhile, a three-dimensional feature extraction network is not only widely applied to behavior recognition, but also in the latest three-dimensional target detection and tracking, when input data is point cloud, feature extraction by adopting three-dimensional convolution becomes a common method, the scene with the most extensive application of three-dimensional target detection is the field of automatic driving, and for practical application, a model with small volume and high precision can be more easily applied to practical products.

Disclosure of Invention

Aiming at the defects existing in the prior art, the invention aims to provide a 3D model compression method based on the spatio-temporal information transfer knowledge distillation technology.

In order to realize the task, the invention adopts the following technical solution:

A3D model compression method based on space-time information transfer knowledge distillation technology is characterized in that a multi-layer characteristic distillation module MFDM is used as a core element of a 3D model to perform model compression on a behavior recognition network, and specifically comprises the following steps:

s1: the method comprises the steps that a public data set containing various action types and multiple video segments is used as an experimental data set, and partial frames which are extracted from all behavior video segments in the data set at equal intervals are respectively used as the input of a teacher model and a student model in a space-time characteristic distillation method for feature extraction;

s2: extracting the characteristics of each layer, carrying out a multilevel characteristic distillation algorithm on the characteristics of each layer, and calculating the transfer loss of multilayer space-time characteristics;

s3: putting the characteristics of the last layer into the classification, classifying to obtain a logistic regression probability, and calculating a loss function by respectively using the logistic regression probability and a soft label and a real label generated by a teacher model;

s4: and finally, updating all parameters of the student model through back propagation according to the overall loss function, and updating the parameters of the multilayer characteristic distillation module.

Specifically, 3D Resnet-18, 3D Resnet-34 and 3D Resnet-50 are respectively used as backbones of the student network to extract features; the characteristics are extracted by taking 3D Resnet-34, 3D Resnet-50 and 3D DResnet-101 as backsbones of the teacher model respectively.

Further, the multi-layer signature distillation module takes into account both the transfer of spatial signatures and the transfer of timing signatures, and employs multi-layer context losses to control the transfer of signatures, referred to as multi-layer spatio-temporal signature transfer losses.

The 3D model compression method based on the space-time information transfer knowledge distillation technology performs model compression on a behavior recognition network. The multi-layer characteristic distillation module is used, the module mainly divides the characteristics in time and space and compares the characteristics respectively, the output characteristics of each layer of the student model are close to the output characteristics of the teacher model of the corresponding layer as far as possible, the characteristics can be better transferred, a space-time characteristic transfer loss function is adopted, the parameters of the student model are updated by back propagation with the loss function of the soft label and the loss function of the classifier, model compression can be realized, and the training efficiency and the recognition accuracy of the model can be improved. The trend of the model towards lightweight development is promoted, and the application pace of the behavior recognition algorithm in real life is accelerated.

Drawings

FIG. 1 is a schematic diagram of a 3D network structure adopted by a 3D model compression method of the spatiotemporal information transfer knowledge distillation technology.

FIG. 2 is a schematic diagram of a multi-layer characteristic distillation module.

The invention will be described in further detail with reference to the figures and examples,

Detailed Description

The design idea of the application is that behavior recognition is taken as the direction with higher heat in computer vision, most models of the behavior recognition are directed at video data, if higher recognition rate is obtained, the complexity of the models is relatively higher, and therefore the training efficiency of the models is also reduced, which is a problem commonly existing in the field of deep learning. In practical applications, if the model needs to be deployed on a device, the computational resources and storage resources of the device are limited, such as in the applications of automatic driving and intelligent robots, which have more strict requirements on the accuracy and computational resources of the model. The behavior recognition model with high complexity has the problems of low training efficiency and high space complexity.

Secondly, since the video has one more time dimension than the image, compressing the model for processing the video requires not only learning the spatial features but also considering the retention of the temporal features. For a behavior recognition model, a three-dimensional convolution network is used as a feature extraction network, different from the conventional knowledge distillation method, when data is an image, only the transmission of spatial features is considered, but for a video, the module considers the transmission of the spatial features and the transmission of time sequence features. For the reasons described above, the present application employs a network framework of multi-layer feature distillation modules.

However, in knowledge distillation in behavior recognition, information transfer requires extraction of the time-series feature extraction capability of the teacher model, which is a different and same work as the acquisition of the time-series feature of the video in behavior recognition.

Therefore, by taking the thought of time sequence excitation and aggregation mechanism of Time Excitation and Aggregation (TEA) as a reference, model compression is carried out on the behavior recognition network, and a knowledge distillation method is particularly adopted, so that the trend of the model towards lightweight development is promoted, and the application pace of the behavior recognition algorithm in real life is accelerated.

The characteristics output by the layer of the teacher model are used as a standard and compared with the characteristics output by the layer corresponding to the student model for training, and when the difference of the output characteristics is smaller, the characteristic extraction capability of the student model on the layer is closer to that of the teacher model on the layer, so that the spatiotemporal information with different scales is transferred.

The embodiment provides a 3D model compression method based on a spatio-temporal information transfer knowledge distillation technology, which adopts a multi-layer feature distillation module MFDM as a core element of a 3D model to perform model compression on a behavior recognition network, and specifically includes the following steps:

s1: the method comprises the steps that a UCF101 public data set containing various action classes and multiple video segments is used as an experimental data set, and partial frames which are extracted from all video segments in the data set at equal intervals are respectively used as the input of a teacher model and a student model in a space-time characteristic distillation method;

In this embodiment, the multi-layer feature distillation module (MFDM) controls the transfer of features by taking into account both the transfer of spatial features and the transfer of timing features, and by using multi-layer context loss, which is called multi-layer space-time feature transfer loss (STTL). A schematic diagram of a multi-layer signature distillation module is shown in figure 2.

The following is a specific implementation process:

s1: and acquiring an experimental data set, and preprocessing the data.

The public data set UCF101 is employed as an experimental data set, which UCF101 is focused on human actions in the video. The video is composed of 101 action classes, the total time of the video is about 27 hours, and the video specifically comprises 13320 action class videos. All data are real videos uploaded by various users, so that part of the data can be influenced by the motion and cluttered background of the camera, and the following are mainly included: human-computer interaction, body movement, human-human interaction, musical instrument playing, sports 5 categories.

S1.1: firstly, frame sampling is carried out on a video band, and due to different video lengths, the method uses a specific frame as the length of a default whole video and uses the specific video frame to represent the whole video as an input frame of a network.

S1.2: the number of the behavior categories of the training set videos is 101, the official training set is composed of 9537 sections of videos, the official training set is divided again in the experiment, 9337 sections of videos are divided in the training set, and 200 sections of videos are divided in the verification set. The test set has official split01 as the test set, and the number of videos is 3783.

S2: and extracting the characteristics of each layer, performing a multilevel characteristic distillation algorithm on the characteristics of each layer, and calculating the transfer loss of the multilayer space-time characteristics.

S2.1: conv of the three-dimensional space-time feature learning module represents that convolution operation is adopted, and convolution with a specific step is applied to each feature map. A group of super parameters alpha, beta, gamma and mu are set, and the feature map is converted into alpha T multiplied by beta H multiplied by gamma W and mu C through three-dimensional convolution. And setting the convolution size of the ith layer of the student model as W multiplied by H multiplied by T, and converting the output characteristics of the student model and the teacher model into alpha multiplied by beta H multiplied by gamma W and mu C through convolution with specific steps by C. The 3D network model is schematically shown in fig. 1.

S2.2: the 3D Resnet-18, 3D Resnet-34 and 3D Resnet-50 are respectively used as backsbones of the student network to extract features. Subscript S represents student model, video data X is input into student model S, let Y _s (x) wherein Y _s And representing the feature diagram of the student model finally output by the backbone network. Since the entire network S can be divided into different parts, where S ⁱ Characteristic Y representing the output of the i-layer network, and therefore of the backbone network _s Can also be represented by Y _s ＝S ⁿ …Δ…S ⁱ Δ…S ² ΔS ¹ (X). Wherein, Y _s Output feature graph representing a student model, S ⁱ ΔS ^i-1 Denotes S ⁱ (S ^i-1 (X)), S is obtained by using the characteristic diagram of the i-1 layer output as the input of the i layer ⁱ The output characteristic map of (1).

S2.3: the characteristics are extracted by taking 3D Resnet-34, 3D Resnet-50 and 3D Resnet-101 as backsbones of the teacher model respectively. The teacher model is also denoted by the subscript t as the teacher model, similarly to the student model.

S2.4: feature extraction capability in the teacher model is effectively transferred to the student model. Since knowledge migration is required for the middle feature extraction layer, the i-th layer network S can be divided into ⁱ Is extracted and is set as F ⁱ The output characteristic matrix of the middle layer of the whole backbone network can be set as F ═ (F ═ F) ¹ ,F ² ,…,F ⁱ ,…,F ⁿ )。

S2.5: monolayer knowledge distillation can be expressed as:

the characteristic diagram of the i layer of the student model is put into the same characteristic space,

and (4) placing the feature diagram of the i layer of the teacher model into the same feature space. Since L of each layer is composed of a plurality of L2,

d represents the distance function of the two feature maps. Where M is the number representing the placement of features into the same feature space. The multilayer knowledge distillation adopted by the method can be expressed as:

wherein I represents the distillation is requiredThe set of layer numbers corresponds to the loss function of the entire multi-layer signature distillation module.

S3: putting the characteristics of the last layer into a classifier, classifying to obtain a logistic regression probability, and calculating a loss function by respectively using the logistic regression probability and a soft label and a real label generated by a teacher model;

s3.1: in neural networks, typically softmax is used as the output layer, outputting the probability z for each class _i Then by summing with other logistic regression probability values

By contrast, the calculated logistic regression probability for each class can be converted to a probability q _i And then:

where T is a temperature that is typically set to 1, using a higher value for T results in a weaker probability distribution over the classes.

S3.2: the output after the full connection layer is Z _s ＝FCL(Y _s )，Z _s And expressing the logistic regression probability value output by the backbone network through the full connection layer.

S3.3: the network model also combines "soft tag" losses, which are cross-entropies with soft tags, which may be referred to simply as "soft targets," and "hard tag" losses, which are cross-entropies with correct tags, which may be referred to as "hard targets. To obtain a better result when the "soft target" is used together with the "hard target", it is necessary to reduce the weight of the second objective function, and it is necessary to target the score of the category to which the data does not belong, which requires increasing the weight of the "soft target".

S4.1: back propagation updates all parameters of the student model.

S4.2: parameters of the multi-layer signature distillation module are updated.

S4.3: and obtaining the accuracy after the model is converged.

In order to verify the effectiveness of the 3D model compression method based on the spatio-temporal information transfer knowledge distillation technology of this embodiment, the applicant has also performed a plurality of comparative experiments mainly on the three-dimensional convolution network, and analyzed the difference between the efficiency and accuracy of training the model by using the separate training and the spatio-temporal feature distillation method, and the result shows that the spatio-temporal feature distillation method not only can realize model compression, but also can improve the training efficiency and recognition accuracy of the network model.

1. The experimental environment was set up as follows:

the experiment adopts python language to realize the algorithm, the version of python is 3.6, the specific code is written by using a deep learning framework pytorch, and the important python used in the experiment includes: 1.1.0 version of pyrrch, 1.4. version of ffmpeg, 1.3.17 version of mmcv, 4.5 version of opencv-python, 1.14.0 version of tensorboard. The experimental equipment comprises three video cards, more models need to be trained, a single GPU trains one model, the experimental equipment can simultaneously train 3 models, the Batchsize of each model is the same and is 64 videos, 9337 videos are taken as a training set for one epoch, the remaining 200 videos in the training set are taken as a verification set, generally 80 minutes are required for training one epoch, the total epoch frequency of each model is set to be 150, an optimizer is SGD, the learning rate is 0.01, moment is 0.09, and weight _ decade is 0.0005.

2. Network model training

The main process of network model training comprises the following steps:

(1) firstly, carrying out an ablation experiment: the method mainly aims at analyzing a multilayer characteristic distillation module, and a knowledge distillation scheme used by the method consists of three important loss functions, corresponding to the loss of a hard label, the loss of a soft label and STTL (space-time characteristic transfer loss), the knowledge distillation scheme is trained respectively in the section, a main network adopted by a model is 3D Resnet, and the model is written as Res3D for simplification.

(2) Researching the information transfer capability of different teacher models to the same student model: experiments were performed with 3D Resnet-18, 3D Resnet-34, and 3D Resnet-50 as student models, and 3D Resnet-34, 3D Resnet-50, and 3D Resnet-101 as teacher models, respectively.

(3) Training a model: the proposed model compression scheme is compared to existing methods.

3. Network performance evaluation

The adopted evaluation indexes are a loss value and accuracy, if the prediction result is consistent with the real label, namely the classification is correct, the prediction is correct, otherwise, the prediction is wrong;

the following is compared to behavioral models common in recent years:

table 1: spatio-temporal feature distillation method ablation experiments on UCF101 data set

As shown in table 1, in the experiment, 3D Resnet-18 is used as a student model, 3D Resnet-50 is used as a teacher model, and when the student model is trained alone to achieve the accuracy of 84.4% after convergence, if only teacher soft labels or only spatio-temporal feature transfer loss is used to achieve knowledge distillation, it can be seen that the accuracy is improved compared with that of training alone, so that it can be illustrated that the 3D model compression method (hereinafter referred to as text design) based on spatio-temporal information transfer knowledge distillation technology according to the present embodiment can be used to effectively transfer the feature extraction capability in the teacher model to the student model. As can be seen from the table, the accuracy of the space-time feature transfer loss training designed by the method is higher than that of the training only by adopting the soft label, which shows that the space-time feature transfer method designed by the method has stronger information transfer capability than that of the method by adopting the soft label, and the classification accuracy of the student model obtained by adopting the knowledge distillation method of the two methods is the highest, thus the effectiveness of the design of the method is shown.

Table 2: test results of different student models on UCF101 using different teacher models

As can be seen from table 2, as the depth of the network increases, the time consumed for detecting a single video increases, because the depth of the network increases, the internal parameters also increase, and the time consumed for the tested data to be calculated by the network also increases. In a certain range, the deeper the network has stronger characteristic extraction capability on video data, and at the moment, the more obvious the guiding effect on the student model is taken as a teacher model, so that more information is transmitted to the student model through the multilayer characteristic distillation modules, and the accuracy is improved more.

Table 3: comparison table of knowledge distillation algorithm

In the experiment, the proposed model compression scheme is compared with the existing method, and the research on applying knowledge distillation to behavior recognition is less, so that the accuracy of the student model is improved, and the accuracy of the multi-level space-time distillation module designed in the experiment is higher than that of STDDCN (standard deviation data decomposition) as shown in the table 3. The multi-layer characteristic distillation module designed by the method not only aims at the characteristic diagram finally output by the model, but also compares the characteristic diagram of each layer of the student model with the corresponding layer of the teacher model to obtain a space-time characteristic transfer loss, so that the characteristic diagram of each layer of the student model is approximate to the corresponding layer of the teacher model, and the superiority of the multi-layer space-time distillation module designed by the method is explained.

4. Conclusion

The experiment mainly carries out a plurality of comparison experiments on the 3D model compression method based on the space-time information transfer knowledge distillation technology and the existing three-dimensional convolution network, analyzes the difference between the efficiency and the accuracy of the model training by adopting the independent training and the space-time characteristic distillation method, and shows that the space-time characteristic distillation method not only can realize the model compression, but also can improve the model on the aspects of the training efficiency and the identification accuracy.

In summary, the 3D model compression method based on the spatio-temporal information transfer knowledge distillation technology provided in this embodiment is characterized in that the behavior recognition network is subjected to model compression, specifically, a spatio-temporal feature distillation method is adopted, a plurality of comparison experiments are performed with the existing three-dimensional convolution network, and a difference between efficiency and accuracy of model training by adopting separate training and a spatio-temporal feature distillation method is analyzed, so that the spatio-temporal feature distillation method not only can realize model compression, but also can improve the model in terms of training efficiency and recognition accuracy.

Finally, it should be noted that the above-mentioned embodiments are only used for illustration and understanding of the technical solutions of the present application, and the present invention is not limited to the above-mentioned embodiments. Those of ordinary skill in the art will understand that: in the technical solutions of the present application, simple modifications, substitutions or additions may be made to the technical features, and these simple modifications, substitutions or additions shall fall within the protective scope of the present application.

Claims

1. A3D model compression method based on space-time information transfer knowledge distillation technology is characterized in that a multi-layer characteristic distillation module MFDM is used as a core element of a 3D model to perform model compression on a behavior recognition network, and specifically comprises the following steps:

2. The method of claim 1, wherein:

respectively taking the 3D Resnet-18, the 3D Resnet-34 and the 3D Resnet-50 as backbones of the student network to extract features;

and respectively taking the 3D Resnet-34, the 3D Resnet-50 and the 3D Resnet-101 as a backbone of the teacher model to extract features.

3. The model compression method of claim 1, wherein the multi-layer signature distillation module takes into account both the transfer of spatial signatures and the transfer of timing signatures, and employs a multi-layer context penalty to control the transfer of signatures, referred to as multi-layer spatio-temporal signature transfer penalty.