WO2021023202A1

WO2021023202A1 - Self-distillation training method and device for convolutional neural network, and scalable dynamic prediction method

Info

Publication number: WO2021023202A1
Application number: PCT/CN2020/106995
Authority: WO
Inventors: 马恺声; 张林峰
Original assignee: 交叉信息核心技术研究院(西安)有限公司
Priority date: 2019-08-07
Filing date: 2020-08-05
Publication date: 2021-02-11
Also published as: CN110472730A

Abstract

The present invention provides a self-distillation training method for a convolutional neural network, for use in significantly improving the performance of a convolutional neural network by reducing the size of the convolutional neural network instead of expanding the size of the network. When knowledge is distilled within a network itself, the network is first divided into several parts; then, knowledge in a deep part of the network is pressed into a shallow part. Without taking the response time as the cost, self-distillation greatly improves the performance of a convolutional neural network, achieving an average accuracy improvement of 2.65%; the 0.61% accuracy improvement for a data set ResNeXt is the minimum value and the 4.07% accuracy improvement for VGG19 is the maximum value. In combination with enhanced extraction of features of shallow classifiers by an attention layer, the accuracy of the shallow classifiers is significantly improved; thus, a convolutional neural network having multiple outputs can be regarded as multiple convolutional neural networks, and the output of each shallow classifier can be used according to different needs.

Description

Self-distillation training method, equipment and scalable dynamic prediction method of convolutional neural network

Technical field

The invention relates to the training of convolutional neural networks, in particular to a self-distillation training method, equipment and scalable dynamic prediction method of convolutional neural networks.

Background technique

Convolutional neural networks have been widely deployed in various application scenarios. In order to extend the scope of applications to some areas where accuracy is critical, researchers have been studying ways to improve accuracy through deeper or wider network structures, which will bring exponential growth in computing and storage costs, thereby Will delay response time.

With the help of convolutional neural networks, applications such as image classification, object detection, and semantic segmentation are currently developing at an unprecedented speed. However, in some applications that require no fault tolerance, such as automatic driving and medical image analysis, it is necessary to further improve the accuracy of prediction and analysis, while requiring a shorter response time. This leads to huge challenges for current convolutional neural networks. The methods in the prior art focus on performance improvement or reduction of computing resources, which can reduce response time. For example, on the one hand, ResNet 150 or even larger ResNet 1000 has been proposed to improve a very limited performance margin, but with a large computational cost. On the other hand, in the case of a predefined performance loss compared to a best-effort network, various techniques have been proposed to reduce the amount of calculation and storage to match the limitations imposed by hardware implementation. Such technologies include lightweight network design, pruning and quantification, among which knowledge distillation (KD) is one of the feasible methods to achieve model compression.

As one of the common compression methods, the inspiration of knowledge distillation comes from the transfer of knowledge from teachers to students. The key strategy is to position the compact student model as an approximation to an over-parameterized teacher model. Therefore, the student model can obtain significant performance improvements, sometimes even better than the teacher's model. By replacing the over-parameterized teacher model with a compact student model, high compression and rapid acceleration can be achieved; the implementation of knowledge distillation includes two steps. The first step is to train a large teacher model, and the second step is to distill knowledge from the teacher model to The student model; however, it also has the following problems; the first problem is about the inefficiency of knowledge transfer, which means that the student model hardly uses all the knowledge from the teacher model. An outstanding student model that is superior to its teacher model is still rare. Another problem is how to design and train an appropriate teacher model. The existing distillation framework requires a lot of effort and experimentation to find the best structure of the teacher model, which will take a relatively long time. The third problem is that the teacher model and the student model work in their own way, and the knowledge transfer flows between different models, which involves the establishment of multiple models, which is cumbersome and low in accuracy.

In the prior art, the proposed self-distillation training method is used for efficient training, but the accuracy of the classifier in the self-distillation process is low, and its own functions cannot be automatically separated, which affects the function of the classifier, thereby reducing the accuracy of the training method .

At the same time, neural networks have incomparable advantages in dealing with non-linear problems. Predictive control has a very good pertinence for constrained card edge operation problems. Therefore, neural networks and predictive control are combined to play Their respective advantages provide a good solution to the control of non-linear, time-varying, strongly constrained, and large-lag industrial processes. Therefore, convolutional neural networks are widely used in the field of prediction; the existing technology is based on convolutional neural networks All predictions need to consider the response speed and the confidence of the prediction results. Therefore, for the prediction requirements of different needs, the algorithms of multiple models will be stored at the same time. For different response speeds and accuracy requirements, different models will be replaced. A vacuum period is formed during the switching process, which brings security risks to real applications.

Summary of the invention

In view of the problems in the prior art, the present invention provides a self-distillation training method, equipment and scalable dynamic prediction method of a convolutional neural network. The design is reasonable, efficient and simple, the self-distillation training model is flatter, and the parameters are optimized. More robust.

The present invention is realized through the following technical solutions:

A self-distillation training method of convolutional neural network, including the following steps,

Step 1. According to the depth and original structure of the target convolutional neural network, divide the convolutional layer of the target convolutional neural network into n parts in the set depth interval, where n is a positive integer and n≥2, where the nth layer Is the deepest part, and the rest are shallow parts;

Step 2. Set a shallow classifier after each shallow part for classification, and set the deepest classifier after the deepest part for classification; the shallow classifier includes the bottleneck layer, the fully connected layer and the softmax layer set in sequence for classification , The deepest classifier includes a fully connected layer and a softmax layer set in sequence for classification;

The specific features of the shallow classifier are obtained by the following attention module,

AttentionMaps(W _conv ,W _deconv ,F)=σ(φ(ψ(F,W _conv )),W _deconv )

Among them, ψ and φ respectively represent the convolution function of the convolution layer used for downsampling and the deconvolution function of the deconvolution layer used for upsampling, F represents the input feature, σ represents the Sigmoid function, and W _conv represents the convolution The weight of the layer, W _deconv represents the weight of the deconvolution layer;

Step 3. During training, the deepest part is regarded as a teacher model, and all shallow parts with corresponding classifiers are trained as student models by distilling from the deepest part, thus realizing self-distillation training of the convolutional neural network .

A scalable dynamic prediction method of a convolutional neural network, the convolutional neural network is a scalable convolutional neural network obtained by the self-distillation training method described in any one of the above, and the scalable dynamic prediction method includes the following steps ,

Step 1. Set the thresholds of all shallow classifiers and the deepest classifier respectively;

Step 2. According to the depth from shallow to deep, judge the confidence and threshold of the prediction result of each layer of classifier; if the confidence of the prediction result of the current layer is greater than the threshold of the classifier of the current layer, it is considered The classifier predicts successfully; otherwise, the deeper classifier will continue to predict until the last classifier; as the depth increases, the prediction accuracy rate will increase layer by layer;

Step 3: Under the requirement of the prediction confidence, select the shallowest prediction result or the prediction result of the optimal accuracy rate as the output of the scalable dynamic prediction according to the prediction demand.

The present invention also provides a self-distillation training device for a convolution application network, including a memory for storing a computer program; a processor, used to implement the self-distillation training method of the convolution application network as described above when the computer program is executed A step of.

Compared with the prior art, the present invention has the following beneficial technical effects:

The self-distillation training method of the convolutional neural network of the present invention significantly enhances the performance of the convolutional neural network by reducing the size of the convolutional neural network instead of expanding the size of the network, that is, improving the accuracy. Different from traditional knowledge distillation-it is a method of knowledge transfer between networks, which prompts the student neural network to approach the softmax layer output of the pre-trained teacher neural network. The self-distillation framework proposed here distills knowledge within the network itself. The network is first divided into several parts. Then, the knowledge in the deeper part of the network is squeezed into the shallow part. Without the cost of response time, self-distillation greatly improves the performance of the convolutional neural network, and obtains an average accuracy increase of 2.65%; for different data sets, the range of accuracy improvement is from the data set ResNeXt The accuracy increase of 0.61% is taken as the minimum value to the accuracy increase of 4.07% in VGG19 as the maximum value. Coupled with the enhanced extraction of the features of the shallow classifier by the attention layer, the accuracy of the shallow classifier is significantly improved, so that a convolutional neural network with multiple outputs can be regarded as multiple convolutional neural networks, according to different Need to use the output of each shallow classifier.

The scalable dynamic prediction method of the present invention can dynamically adjust the trade-off balance between the prediction accuracy rate and the response speed based on the available output of each of the above-mentioned shallow classifiers and through a reasonable adjustment of the threshold. Efficiently schedule multiple classifiers in the network; the ability to dynamically adjust the response speed of the model in the deployed state greatly improves the flexibility of the convolutional neural network in prediction applications; only the threshold value needs to be modified when switching models There is no need to change the model, which can avoid the vacuum period of the model during the switching process, and bring safety guarantee to the real application.

Furthermore, in the scalable dynamic prediction, an automated threshold search is realized by the genetic algorithm, which further improves the acceleration effect of the neural network, thereby realizing the synergistic improvement of acceleration and accuracy.

Description of the drawings

FIG. 1 is a schematic diagram of the comparison of training complexity, training time and accuracy between traditional distillation and distillation of the present invention for the CIFAR100 data set.

Figure 2 is a schematic diagram of the self-distillation method for ResNet described in the example of the present invention.

Figure 3 shows the accuracy of the classifiers trained by different methods in the example of the present invention.

Fig. 4 is a diagram showing the relationship between the amount of calculation of the scalable network and the accuracy in the example of the present invention.

Fig. 5 is a diagram showing the relationship between the amount of parameters of the scalable network and the accuracy in the example of the present invention.

6 is a diagram showing the relationship between the speed-up ratio and the accuracy of the scalable dynamic prediction in the scalable dynamic prediction method described in the example of the present invention.

Fig. 7 shows the visualization results of attention maps of different classifiers in the scalable neural network in the example of the present invention.

Fig. 8 is a schematic diagram of the number of classifications completed by each classifier obtained by the prediction method in the example of the present invention on different data sets.

detailed description

The present invention will be further described in detail below in conjunction with specific embodiments, which are to explain rather than limit the present invention.

As shown in Fig. 1, the present invention proposes a self-distillation training method of convolutional neural network, which can achieve the highest possible accuracy and overcome the shortcomings of traditional distillation when training compact models. Instead of implementing two steps in traditional distillation, that is, the first step is to train a large teacher model, and the second step is to distill knowledge from the teacher model to the student model; the one-step self-distillation framework provided by the method of the present invention directs its training to students model. The proposed self-distillation not only requires less training time (from 26.98 hours to 5.87 hours on CIFAR100, the training time is shortened by 4.6 times), but also can achieve higher accuracy (from 79.33 in traditional distillation on ResNet50) % To 81.04%). In order to be better used in real application scenarios in the present invention, the performance of the shallow classifier is enhanced by better improving the accuracy of the shallow classifier. The present invention can be used in any system based on convolutional neural network, such as image classification system, face recognition system, target detection system, image semantic segmentation system. When training the neural network required by the above system, the training method described in the present invention can be used to improve the performance of the system, which not only has high accuracy, but also has a high speed, and can synergistically improve the speed and accuracy.

As shown in Figure 3, the accuracy comparison of four methods of training shallow classifiers in ResNet50 on CIFAR100 is provided. The X-axis is the depth of the classifiers, where x=5 indicates the integration of all the classifiers, and the Y-axis indicates the Top-1 accuracy rate on CIFAR100. Observation shows that as the neural network becomes shallower, the prediction accuracy of the classifier decreases rapidly. Among them, the shallowest classifier and the sub-shallow classifier are reduced by 13% and 8%, respectively. Although the self-distillation algorithm has been significantly improved compared with the deep supervision algorithm and the separate training method, it still cannot meet the needs of practical applications. In addition, in the experimental results of the third classifier, the accuracy of the separately trained network is better than that of the self-distillation algorithm and the deep supervision algorithm, which shows that in the structure of the latter's corresponding shared backbone network, different classifiers There is a negative interaction between them. Since the features that the backbone network can obtain are limited by the number of network channels, the features corresponding to different classifiers are confused. For each classifier, it is almost impossible to automatically separate its own features from the mixed features.

In order to solve this problem and further enhance the performance of shallow classifiers, the attention layer is used to obtain the characteristics of a specific classifier from the shared backbone neural network, so that each classifier can learn how to obtain the features it needs from the backbone network .

In order to ensure that the attention layer does not bring additional calculation and storage costs, we propose a simplified attention layer, which includes a convolutional layer for downsampling and a deconvolutional layer for upsampling. The attention layer is followed by an S-shaped stimulus to obtain an attention map between 0 and 1. Then, the dot product of the attention map and the original feature is performed to generate the specific feature of the classifier. Its forward calculation can be formulated as:

AttentionMaps(W _conv ,W _deconv ,F)=σ(φ(ψ(F,W _conv )),W _deconv )

Among them, ψ and φ represent the convolution function and deconvolution function, respectively, F represents the input feature, and σ represents the sigmoid function. Note that the batch normalization and ReLU activation functions after the convolution and deconvolution layers are omitted here.

The experimental results show that, as shown in Figure 2, the attention layer in SCAN achieves a significant accuracy improvement in the shallow classifier. For example, compared with the self-distillation without the attention layer, the accuracy gains of 5.46%, 4.13%, and 5.16% can be observed on the shallow classifier in ResNet50 on CIFAR100.

The scalable neural network enables different classifiers to extract suitable features from the backbone network through the attention layer, which greatly improves the accuracy of the shallow neural network prediction. Therefore, by visualizing the attention map output by the attention layer, the process of selecting features of the neural network can be observed. Figure 7 shows the output results of the attention layer for two images. Among them, the leftmost picture is the input image. In the six images on the right, from left to right, the output results of the three classifier attention layers from shallow to deep are shown. The picture in the first row represents the heat map representation of the attention map, and the picture in the second row represents the input image after the dot multiplication operation with the attention map as a mask.

The position of attention: In the heat map, the position of the shark and the cat has a higher value, which means that different classifiers put the main attention on the most important position of the information in the input picture, that is, the body of the shark and the cat. The background or other irrelevant elements are ignored. This shows that even a shallow classifier has the ability to judge the importance of each pixel.

Attention granularity: the attention of different classifiers is also different. As shown in Figure 7, the shallow classifier pays more attention to the contours of sharks and cats, that is, it pays more attention to local information and high-frequency information. The deep classifier pays more attention to its body and texture, that is, it pays more attention to global information and low-frequency information. This law conforms to the information processing mechanism in the fish neural network. As the network becomes deeper, the receptive field of the neural network continues to grow, which gives the deep classifier the ability to focus on global features in the attention layer.

As a basis, the self-distillation method in the present invention as depicted in FIG. 2. The self-distillation training is carried out through the following steps to build the self-distillation framework: First, modify the original neural network on any computer that can run text editing software, and convert the target convolutional neural network according to the depth and original structure of the target convolutional neural network Divide into several shallow parts. For example, according to ResBlocks, ResNet50 is divided into 4 parts. Second, modify the original neural network on any computer that can run text editing software, and set a classifier after each shallow part, which is the same as the bottleneck layer and the bottleneck layer that are only used in training and can be removed in inference. Fully connected layers are combined. The main consideration for adding a bottleneck layer is to reduce the impact between each shallow classifier and add the L2 loss from the hint. During the training of neural networks using NVIDIA graphics cards, Intel high-performance CPUs or Google TPU chips, all shallow parts with corresponding classifiers are trained as student models by distilling from the deepest part, which can be conceptually regarded as teachers model.

As shown in Figure 2, taking ResNet as an example, ResNet has been divided into four parts according to the depth. After constructing each part of multiple classifiers, an additional bottleneck layer and a fully connected layer are set; in different accuracy and corresponding time In this case, all classifiers can be used independently; as shown in Figure 2, each classifier is trained under three types of supervision, and the part under the dotted line can be removed during inference. The three types of supervision are corresponding losses Source 1’s supervision from the label, corresponding to the loss source 2’s supervision from distillation, and the corresponding loss source 3’s supervision from the prompt, their corresponding participation flows are shown in the figure.

In order to improve the performance of the student model, three losses are introduced during the training process:

Loss source 1: Cross-entropy loss from the label, not only for the deepest classifier, but also for all shallow classifiers. It is calculated using the labels from the training data set and the output of the softmax layer of each classifier. In this way, the knowledge hidden in the training data set is directly introduced from the label to all classifiers.

Loss source 2: KL (Kullback-Leibler) divergence loss guided by the teacher model. Use the softmax layer output between the student model and the teacher model to calculate the KL divergence, and introduce it to the softmax layer of each shallow classifier. By introducing KL divergence, the self-distillation framework affects the teacher network model and can pass its deepest classifier to each shallow classifier.

Loss source 3: L2 loss from the prompt. It can be obtained by calculating the L2 loss between the feature map of the deepest classifier and each shallow classifier. With the help of L2 loss, the ambiguous knowledge in the feature map is introduced into the bottleneck layer of each shallow classifier, which will induce all the classifier feature maps in their bottleneck layer to adapt to the feature map of the deepest classifier.

For this reason, only apply all newly added layers during training, as shown in the part below the dotted line in Figure 2. They do not exert any influence during reasoning. Adding these parts during inference provides another option for dynamic inference of energy-constrained edge devices.

Specifically, the specific calculation of the self-distillation method of the present invention is as follows.

Given N samples from M categories

We denote the corresponding label set as

y _i ∈{1,2,...,M}. The classifier in the trained convolutional neural network, that is, the proposed self-distillation has multiple classifiers in the entire network, and is represented as

Where C represents the number of classifiers in the convolutional neural network. And set the softmax layer after each classifier.

among them,

It is the output of the C-th classifier fully connected layer in the i-th category (FC). q _i ^c ∈ R ^M is the class i probability of the classifier θ _i/C . T represents the temperature hyperparameter of distillation and is usually set to 1. The larger the value, the smoother the predicted probability distribution.

The above-mentioned neural network is self-distilled training on NVIDIA graphics card, Intel high-performance CPU or Google TPU chip. The supervision of each classifier θ _i/C except the deepest classifier θ _C comes from three sources. Two hyper-parameters α and λ are used to balance them. α and λ are hyper-parameters that control the ratio of the KL divergence loss function to the feature loss function. For the deepest classifier, α and λ are zero.

(1-α)·Cross Entropy(q ⁱ ,y) (2)

As in formula (2), the first source is the cross entropy loss calculated using q ⁱ and label Y. Among them, q ⁱ represents the output of the softmax layer of each classifier θ _i/C , and CrossEntropy is the cross entropy function.

α·KL(q ⁱ ,q ^C ) (3)

As in the above formula (3), the second source is the Kullback-Leibler divergence between q ⁱ and q ^C. Our goal is to make the shallow classifier approximate the deepest classifier, which indicates supervision from distillation. q ⁱ represents the output of the softmax layer of each classifier θ _i/C ; q ^C means the output of the softmax layer of the deepest classifier, α is the hyperparameter that controls the proportion of the KL divergence loss function, and KL is the Kullback-Leibler divergence degree.

As in the above formula (4), the final supervision comes from the hint of the deepest classifier. The prompt is defined as the output of the hidden layer of the teacher model, and its purpose is to guide the learning of the student model. It works by reducing the distance between the feature map in the shallow classifier and the feature map in the deepest classifier. However, because feature maps of different depths have different sizes, additional layers should be added to align them. Instead of using convolutional layers, the present invention uses a bottleneck architecture, which shows a positive effect on model performance. F _i and F _C denote feature classifier θ _{i / C} in the deepest features and classification of θ _C.

In summary, the loss function of the entire neural network is composed of the loss function of each classifier, which can be written as:

Among them, q ⁱ represents the output of the softmax layer of each classifier θ _i/C ; the training set is given N samples from M categories

Denote the corresponding label set as

y _i ∈{1,2,...,M}; Cross Entropy is the cross-entropy function; KL is the Kullback-Leibler divergence; q ^C is the output of the softmax layer of the deepest classifier θ _C ; F _i and F _C Respectively represent the features in each classifier θ _i/C and the features in the deepest classifier θ _C. α and λ are hyperparameters that control the ratio of the KL divergence loss function to the feature loss function, and are used for the deepest classifier α and λ are zero.

The self-distillation training method of a convolutional neural network proposed by the present invention shows its advantages by comparing it with the deep supervision network and previous distillation methods. The present invention abandons the additional teacher model required in the previous distillation method, and provides an adaptive depth architecture for the time-accuracy trade-off at runtime. The specific experimental results on five convolutional neural networks and two data sets are as follows.

We evaluated self-distillation on five convolutional neural networks (ResNet, WideResNet, Pyramid ResNet, ResNeXt, VGG) and two data sets (CIFAR100, ImageNet). Use learning rate decay, L2 regularizer and simple data argumentation during the training process. All experiments are implemented by PyTorch on GPU devices.

1.1. Benchmark data set

CIFAR100: The CIFAR100 dataset consists of small (32x32 pixels) RGB images with 100 categories, and contains 50K images in the training set and 10K images in the test set. Adjust the core size and step size of the neural network to fit the size of the small image.

ImageNet: The ImageNet2012 classification data set consists of 1000 categories based on WordNet. Each category is depicted by thousands of images. We resize them to 256x256 pixel RGB images. Note that the reported accuracy of ImageNet is calculated on the validation set.

1.2. Comparison with standard training

The experimental results on CIFAR100 and ImageNet are shown in Table 1 and Table 2, respectively. The ensemble result is obtained by simply adding the weighted output of the softmax layer to each classifier. It is observed that (i) all neural networks benefit significantly from self-distillation, with an average increase of 2.65% in CIFAR100, and an average increase of 2.02% in ImageNet. (ii) The deeper the neural network, the more performance they get, such as a 4.05% increase in ResNet101 and a 2.58% increase in ResNet18. (iii) Generally speaking, the naive integration works effectively for CIFAR100, but it has less impact on ImageNet and sometimes has a negative impact. This may be due to the large decrease in accuracy of the shallow classifier compared with CIFAR100 . (iv) The depth of the classifier plays a more critical role in ImageNet, which shows that there is less redundancy in the neural network in complex tasks.

Table 1 The accuracy of different classifiers of the self-distillation algorithm on the CIFAR100 data set.

Table 2 The correct rate of different classifiers of the self-distillation algorithm on the ImageNet data set.

1.3. Comparison with distillation

Table 3 compares the results of self-distillation with the results of five traditional distillation methods for the CIFAR100 data set. Here, we will focus on the accuracy improvement of each method when the student model has the same amount of calculation and storage. From Table 3, we draw the following observations: (i) All performances of the distillation method are better than the directly trained student network. (ii) Although self-distillation does not have additional teachers, it is still superior to most other distillation methods.

A significant advantage of the self-distillation framework is that it does not require additional teachers. In contrast, traditional distillation first needs to design and train an over-parameterized teacher model. Designing a high-quality teacher model requires a lot of experimentation to find the best depth and architecture. In addition, training an over-parameterized teacher model takes much longer. These problems can be avoided directly in self-distillation, where both the teacher model and the student model are sub-parts of itself. As depicted in Figure 1, compared with other distillation methods, a 4.6 times acceleration of training time can be achieved by self-distillation.

Table 3 Comparison table of correct rate of self-distillation algorithm and traditional distillation algorithm

1.4. Comparison with Deep Supervision Network

The main difference between deep supervision network and self-distillation is that self-distillation trains the shallow classifier from the distillation of the deepest classifier instead of training the shallow classifier from the label. The advantages can be seen in experiments. As shown in Table 4, Table 4 compares the accuracy of each classifier in ResNet trained on CIFAR100 through deep supervision or self-distillation. The observation results can be summarized as follows: (i) Self-distillation is better than deep supervision in each classifier. (ii) Shallow classifiers benefit more from self-distillation.

Table 4 Comparison of the method proposed in this article and the deep supervision algorithm on the CIFAR100 data set.

The reason for this phenomenon is easy to understand. In self-distillation, (i) add an additional bottleneck layer to detect classifier-specific features, thereby avoiding conflicts between shallow classifiers and deepest classifiers. (ii) Instead of using labels, the shallow classifier has been trained by distillation to improve performance. (iii) Better shallow classifiers can obtain more distinguishing features, which in turn will enhance the performance of deeper classifiers.

1.5. The convolutional neural network trained by the present invention only applies all newly added layers (the part below the dotted line in Figure 2) during training. They do not exert any influence during reasoning. Adding these parts during inference provides another option for dynamic inference of energy-constrained edge devices. Can be used to adapt to the scalable depth of reasoning.

In the prior art, a popular solution for accelerating convolutional neural networks is to design a scalable network, which means that the depth or width of the neural network can be dynamically changed according to application requirements. For example, in scenarios where response time is more important than accuracy, some layers or channels can be discarded at runtime for acceleration.

In the case of using a shared backbone network, the adaptive accuracy-acceleration tradeoff in reasoning becomes possible on resource-constrained edge devices, which means that different applications can be automatically used in accordance with the dynamic accuracy requirements in the real world. Deep classifier. As can be observed in Table 5: (i) Through the classifier 3/4, three of the four neural networks are better than their baselines, with an average speedup of 1.2 times. When using the classifier 2/4, a speedup of 3.16 times can be achieved with a loss of accuracy of 3.3%. (ii) Since different classifiers share a backbone network, the integration of the three deepest classifiers can increase the average level of accuracy by 0.67% with a computational cost of only 0.05%.

Table 5 Comparison of the method proposed in this article and the deep supervision algorithm on the CIFAR100 data set.

After comparing with other methods and analyzing the advantages of the self-distillation method, the self-distillation method itself will be further analyzed. The following analyzes the advantages of the self-distillation method from the perspective of flat minimum, gradient and distinguishing characteristics.

The self-distillation method of the present invention is a training technique for improving the performance of the model, rather than a method for compressing or accelerating the model. Unlike most previous studies that focused on knowledge transfer between different models, the self-distillation provided by the present invention is a method of knowledge transfer within a model and has a broad application prospect. The self-distillation method of the present invention can help the trained model, that is, the convolutional neural network, to converge to a flat minimum value that is inherently universal. Self-distillation can prevent the model from encountering the vanishing gradient problem. Deeper classifiers are used in self-distillation to extract more distinguishing features.

Based on the above-mentioned self-distillation training convolutional neural network, through the control of the threshold, a scalable dynamic prediction method is realized.

The higher the confidence of the prediction result of the deep neural network (the maximum value of the softmax layer output), the higher the probability that the prediction result is correct. The present invention provides a scalable dynamic prediction method of a convolutional neural network, which first makes each classifier have a corresponding threshold. If the confidence of the prediction result of the current classifier is greater than the threshold, it is considered that the prediction of the classifier is successful. Otherwise, the deeper classifier will continue to predict until the last classifier. When the prediction result of the deep classifier is better than multiple classifier Ensemble, the scalable dynamic prediction mechanism only sets the threshold for the first three shallow networks, and the prediction of the deep classifier is the final result. Since most of the calculations of the shallow classifiers are part of the calculations of the deep classifiers, such gradually deepening dynamic prediction will hardly bring additional calculations.

However, the scalable dynamic prediction based on threshold control introduces another problem, that is, how to select the appropriate threshold for different classifiers. A suitable threshold is very important: (1) A lower threshold will make most of the predictions completed by the shallow classifier, which can effectively reduce the response time, but it also leads to a decrease in the accuracy of the prediction. (2) In the same way, a higher threshold will make most of the predictions completed by the deep classifier, which can achieve a higher prediction accuracy, but at the same time it will lead to a longer response time. (3) By adjusting the threshold reasonably, the trade-off balance between the prediction accuracy rate and the response speed can be dynamically adjusted. In order to further explore the space for acceleration and improvement of the accuracy rate, the present invention further uses a genetic algorithm to optimize the threshold value.

The genetic algorithm obtains the optimal solution or the approximation of the optimal solution to the formulation of optimization goals by simulating the behaviors of different biological individuals to survive, eliminate and multiply in nature. The main process includes: (1) Initializing genes, that is, randomly generating a certain amount of individuals with different genes as the first-generation organisms. (2) Calculate environmental suitability, that is, for each individual organism, calculate the suitability for the environment determined by its genes. This calculation process is determined by the optimization goal. (3) Elimination, that is, elimination of organisms that are not suitable for the environment based on the results of the previous step. (4) Cross-matching, that is, cross-matching the genes of the eliminated individual organisms to simulate the process of biological reproduction to obtain the next generation of individuals. (5) Gene mutation, that is, the genes of the uncultivated individuals and the genes of the new-born individuals are changed with a certain probability to prevent the optimization process from falling into the local optimum. Through multiple iterations of the above process, the genetic algorithm can find the optimal or better solution for the optimization goal.

In a scalable network, the threshold search problem is modeled as an optimization problem solved by genetic algorithms. The optimization goal is fast neural network model response speed and high prediction accuracy. The optimized solution corresponds to the shallow classifier in the scalable network. The threshold. In the process of using genetic algorithms to solve the threshold search problem, it is necessary to define the mutual mapping relationship between genes and thresholds, and to solve the environmental suitability according to the speedup and accuracy of the scalable network.

Define the decoding relationship from genes to threshold in genetic algorithm. The gene in the genetic algorithm is a binary code sequence. In the iterative process of the genetic algorithm, the gene needs to be decoded to obtain the corresponding threshold to calculate the degree of adaptation of the gene to the environment. In order to avoid the phenomenon that the threshold is too small and the correct rate is too low, the lower bound of the threshold is set to 0.70. The decoding relationship can be as follows.

Among them, S(n) represents the value of the nth position in the gene sequence, and σ represents the threshold corresponding to the i-th gene. N represents the length of the gene sequence. In the gene sequence, the greater the number of "1"s, the lower the threshold.

Define the method for measuring the suitability of genes for the environment in the genetic algorithm. Since the goal of the algorithm includes response speed and prediction accuracy, the definition of environmental suitability should also include these two indicators, as shown in the following formula.

fitness=acceleration ratio+γ·(accuracy-baseline)

Among them, fitness represents the environmental suitability corresponding to each gene; acceleration ratio is the acceleration ratio, which represents the ratio of the predicted response speed of the scalable dynamic prediction to the predicted response speed of the original scalable convolutional neural network. Acceleration effect. Accuracy and baseline respectively represent the predicted response speed of the scalable dynamic prediction and the prediction accuracy rate of the original scalable convolutional neural network. γ is a balance factor between response acceleration and prediction accuracy. By dynamically adjusting γ, multiple threshold solutions with different speedups and different accuracy rates can be obtained.

The benefits of the scalable dynamic prediction method are not only higher acceleration effects than static acceleration, but also that it provides the ability to dynamically adjust the response speed of the model in the deployment state, which is extremely flexible for applications. Important. For example, in unmanned driving applications, when the speed of the unmanned vehicle is high, the model can use a lower threshold to ensure a higher processing frame rate. When the speed of unmanned vehicles is low, the model can use a higher threshold to obtain the best prediction accuracy. Compared with the traditional algorithm that stores multiple models at the same time, this method only needs to modify the threshold value when switching models without changing the model, which can avoid the vacuum period of the model during the switching process and bring safety guarantee to real applications.

Compared with the static acceleration method, the scalable dynamic prediction method not only has a higher acceleration ratio, but also has more reliability. The requirement for the correct rate of the neural network model after compression is often one of the most important evaluation criteria for neural network compression algorithms. However, the compression and acceleration of neural networks are often accompanied by a decrease in accuracy. Such results are unacceptable in some safety-related application scenarios, such as unmanned driving, security systems, and so on. In the scalable dynamic prediction method, even if the accuracy of all shallow classifiers is lower than the original scalable convolutional neural network model, reasonable classifier scheduling can still be achieved through a lower threshold and the original accuracy of the neural network can be maintained.

The experimental results of the scalable dynamic prediction method of the convolutional neural network of the present invention on the CIFAR100 data set. As shown in Figure 4 and Figure 5, the relationship between the calculation amount, parameter amount and prediction accuracy of seven different deep neural networks on the CIFAR100 data set. Among them, the horizontal axis represents the number of multiplication and addition operations required for deep neural network prediction, and the vertical axis represents its prediction accuracy rate. The dotted lines and dots of each gray scale correspond to the same deep neural network. The marked points of the same shape on the dashed line represent the experimental results of four (or three) deep classifiers of the same scalable network, and the marked points of the same shape outside the dashed line represent the comparison results of the original model experiment without the scalable network. .

It can be seen that on the CIFARA100 data set:

(1) In all cases, the second shallow classifier of the scalable convolutional neural network can exceed the original model in prediction accuracy. (2) Without losing any correctness, a statically running scalable network can achieve 2.17 times acceleration and 3.20 times compression. (3) Compared with the comparison test results of the original model, on average, each neural network increases the prediction accuracy rate by 4.05% at the cost of only 4.4% additional calculations. (4) Compared with the deepest classifier, the integrated prediction results of all models can improve the correct rate by 1.11%. (5) In the same deep neural network, compared with the deep classifier, the accuracy rate of the shallow classifier is improved by a lot, which is mainly brought by the attention layer in the shallow classification. (6) Overall, the deeper the neural network, the greater its performance improvement.

At the same time, the correct rate of different classifiers on the CIFAR100 data set of the scalable convolutional neural network in Table 6 can be obtained; in the CIFAR100 experiment, the correct rate of the different classifiers of each network is used as an analysis of Figure 4 and Figure 5. Numerical addition of results.

Table 6 The correct rate of different classifiers of the scalable neural network on the CIFAR100 dataset

From Table 6, (1) In the experiments of all network structures, even the shallowest classifier in the scalable neural network is already very close to the accuracy of the original model. On average, the shallowest classifier of each network is 2.8% lower than the original model. When the gap is the largest, it is 5.25% lower in ResNet18, and when the gap is the smallest, it is only 0.19% lower in WRN44-8. (2) In all experiments of the network structure, the sub-shallow classifier in the scalable neural network can exceed the effect of the original model. On average, the sub-shallow classifier of each network is 1.8% higher than that of the original model, and it is improved by 2.52% on WRN44-8, the effect is the most obvious, and the effect on ResNet18 is the smallest, reaching 0.65%. (3) In the experiments of all network structures, as a whole, the deeper the classifier in the scalable neural network, the higher the accuracy. This enhancement trend is most obvious in the shallowest and sub-shallow classifiers. For example, the first two shallow classifiers of ResNet18 have more than 5% difference in accuracy. The accuracy of the sub-deep classifier is almost the same as that of the deep classifier. In some cases (ResNet152), the accuracy of the sub-deep classifier may even be higher than that of the deepest classifier. This phenomenon may be caused by the relatively simple classification task of the CIFAR100 data set. (5) By simply integrating the prediction results of multiple classifiers, the scalable network achieves an increase in accuracy of more than 1%. (6) From the perspective of static compression and acceleration, the accuracy of the ResNet18 network trained by the scalable neural network has exceeded the ResNet152 network trained by the traditional method. In the application scenario, replacing the ResNet152 model with the ResNet18 model can achieve 5.33 times parameter compression and 6.27 times acceleration.

Table 7 shows the experimental results of the scalable convolutional neural network on the CIFAR10 dataset. The overall trend is the same as that of CIFAR100. It can be seen that all convolutional neural networks can achieve a significant improvement in accuracy. All the network structures for experiments Among them, the average increase is 0.98%, the highest is 1.28% on VGG16 (BN), and the lowest is 0.71% on ResNet18.

The absolute value of the increase in the accuracy of the CIFAR10 data set is slightly lower than the result on the CIFAR100 data. The main reason for this phenomenon is that the accuracy of the original network CIFAR10 is already very high. That is, because the neural network trained by the traditional method can already achieve a higher prediction accuracy, the difficulty of further improving the accuracy is greater than that of the CIFAR100 data set.

Table 7 The correct rate of different classifiers of the scalable convolutional neural network on the CIFAR10 data set.

Table 8 shows the correct rate of each classifier in the ResNet network of three different depths on the ImageNet dataset. The trend is roughly the same as the result on CIFAR100, but there are still the following differences:

(1) On average, each network can increase the prediction accuracy rate by 1.26%. When the effect is the most obvious, it increases by 1.41% on ResNet50, and when it is the least obvious, it increases by 1.08% on ResNet101. This result is worse than the CIFAR100 data The result on the set.

(2) Unlike the experimental results on CIFAR100, as the position of the neural network classifier on the ImageNet data set becomes deeper, its accuracy rate will vary greatly. In the three kinds of neural networks in the experiment, the prediction accuracy rate of the deep classifier is significantly higher than that of the shallow classifier. This shows that the depth of the neural network in the ImageNet dataset is very important, and the redundancy of its parameters is much smaller than the neural network trained on the CIFAR10 and CIFAR100 datasets. This phenomenon is most likely caused by the higher difficulty of ImageNet classification.

(3) Although the accuracy of the deepest classifier has improved compared with the original model, the accuracy of all shallow classifiers cannot exceed the original model. This phenomenon leads to the fact that simply replacing the original model with a shallow classifier can not maintain the accuracy of the original model while bringing acceleration and compression effects. Therefore, the neural network static compression and acceleration method that directly replaces the large model with a small model cannot be used on the ImageNet data set. The scalable dynamic prediction method proposed in this paper solves this problem through the reasonable scheduling of multiple classifiers.

In the experimental results of the ImageNet dataset, the accuracy of the shallow classifier cannot exceed the original model, which results in the model integration method used in the CIFAR100 and CIFAR10 datasets that cannot bring additional accuracy improvements. The experimental results show that even if more complex model integration methods are used, such as weighted ensemble algorithms, they cannot generate revenue for the classification accuracy. Therefore, the results are omitted in Table 8.

Table 8 The correct rate of different classifiers of the scalable network on the ImageNet dataset

As shown in Figure 6, it shows the relationship between the accuracy and speedup of each neural network obtained by dynamic scalable prediction when different threshold schemes are used on CIFAR100 and ImageNet. Among them, the horizontal axis represents the acceleration ratio of the model, and the vertical axis represents the prediction accuracy of the model. The same color points represent the experimental results under the same network and the same data set. The square in the range of x>1 indicates the experimental result corresponding to the searched threshold scheme. The triangle on the line x=1 represents the experimental result of the original model.

It can be seen from Figure 6: (1) On the CIFAR100 data set, ResNet18, ResNet50, and ResNet152 can achieve accelerations of about 2.5 times, 4.4 times, and 6.2 times, respectively, without loss of correctness. This result is significantly better than the static acceleration effect achieved by simple classifier replacement. (2) On the ImageNet dataset, ResNet50 and ResNet101 can achieve speedups of 1.5 and 2.5 times respectively without loss of correctness. (3) On the same data set, the deeper the neural network, the more obvious the acceleration effect. For example, on the ImageNet dataset, the acceleration effect of ResNet101 is significantly better than ResNet50. On the CIFAR100 data set, the acceleration effect of ResNet152 is better than that of ResNet50, and the acceleration effect of ResNet50 is better than that of ResNet18. (4) Observe the change trend of each curve, and its speedup ratio and accuracy rate show a clear negative correlation. Observed from its derivative relationship, as the acceleration ratio increases, the rate of decrease of the accuracy rate also increases. This phenomenon is caused by the defect of threshold control. Experiments have found that although the dynamic and scalable prediction method of threshold control does not require additional calculations, when the threshold is low, the judgment will be out of control, that is, although some decisions are higher than the threshold, the final classification results are wrong, leading to the overall model The correct rate is low.

In the prediction method of the present invention, the final acceleration effect is directly dependent on the number of classifications completed by each classifier in the scalable neural network. If a large number of classification decisions are completed by a shallow classifier, the acceleration effect of the entire neural network will be very obvious. If a large number of classification decisions are completed by the deep classifier, the response speed of the system is almost the same as the original network. By counting the number of decisions made by different depth classifiers, the acceleration effect of the system can be accurately recognized.

As shown in Figure 8, under the premise of maintaining the same threshold scheme and the same neural network (ResNet50), the prediction performance of the four classifiers on different data sets. among them. The 1/4 to 4/4 on the horizontal axis represent four classifiers from shallow to deep, and the value on the vertical axis represents the ratio of the number of predictions made by this classifier to the total number of times.

It can be seen from Figure 8 that in the CIFAR10 and CIFAR100 data sets, more than 60% of the images can be predicted by the shallowest classifier, and more than 90% of the images can be classified by the first two classifiers. This is the same as the CIFAR data in the experimental results. The higher speedup on the set is consistent. In the ImageNet dataset, only 20% of the images can be predicted by the shallowest classifier, and nearly half of the images must be classified by the two deeper classifiers, which leads to relatively unobvious results on the ImageNet dataset. Acceleration effect. The above conclusions provide two potential applications for deep scalable networks: 1. To measure the redundancy of neural networks. 2. Measure the difficulty of different data sets.

First, the number of predictions made by different classifiers in the same data set can be used to determine the redundancy in different network layers. For example, in the statistical results of CIFAR10 and CIFAR100, the number of predictions completed by the sub-deep classifier and the deepest classifier is close to zero, which shows that the neural network part of the two classifiers plays a small role in the overall classification. , There is a high degree of redundancy. It is suitable to continue to compress by algorithms such as pruning and quantization. The sum of the prediction numbers of the first two shallow classifiers is close to 100%, indicating that the neural network part of these two classifiers plays a great role in the classification task, and the redundancy is small, and it is not suitable for further compression. Or accelerate.

Secondly, the number of predictions made by different classifiers in different data sets can be used as a measure of the difficulty of different data sets. The easiest way to compare the difficulty of different data sets is to directly compare the prediction accuracy rates that the same network can achieve on each data set. However, the accuracy of classification tasks is also affected by the number of categories. The number of categories in different data sets is different, this measurement method will be affected by this, and thus underestimate the difficulty of the task of classification with fewer categories. Deep scalability provides another way of thinking, which is to compare the difficulty of different data sets by comparing the number of samples classified by the shallow classifier.

The present invention also provides a computer storage medium with a computer program stored on the computer storage medium, and when the computer program is executed by a processor, the steps of the self-distillation training method of the convolution application network described above are realized.

The present invention also provides a scalable dynamic prediction device of a convolutional neural network, including a memory, used for storing a computer program; a processor, used to implement the aforementioned scalable dynamics of a convolutional neural network when the computer program is executed The steps of the forecasting method.

The present invention also provides another computer storage medium with a computer program stored on the computer storage medium, and when the computer program is executed by a processor, the steps of the scalable dynamic prediction method of the convolutional neural network as described above are realized.

Those skilled in the art should understand that the embodiments of the present invention may be provided as methods, systems, or computer program products. Therefore, the present invention may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, the present invention may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.

The present invention is described with reference to flowcharts and/or block diagrams of methods, devices (systems), and computer program products according to embodiments of the present invention. It should be understood that each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be implemented by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing equipment to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing equipment are generated It is a device that realizes the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device. The device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment. The instructions provide steps for implementing functions specified in a flow or multiple flows in the flowchart and/or a block or multiple blocks in the block diagram.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and not to limit it. Although the present invention has been described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: Modifications or equivalent replacements of specific implementation manners without departing from the spirit and scope of the present invention shall be covered by the scope of protection of the claims of the present invention.

Claims

A self-distillation training method of convolutional neural network, which is characterized in that it comprises the following steps:

Step 1. According to the depth and original structure of the target convolutional neural network, divide the convolutional layer of the target convolutional neural network into n parts in the set depth interval, where n is a positive integer and n≥2, where the nth layer Is the deepest part, and the rest are shallow parts;

Step 2. Set a shallow classifier after each shallow part for classification, and set the deepest classifier after the deepest part for classification; the shallow classifier includes the bottleneck layer, the fully connected layer and the softmax layer set in sequence for classification , The deepest classifier includes a fully connected layer and a softmax layer set in sequence for classification;

The specific features of the shallow classifier are obtained by the following attention module,

Attention Maps(W conv ,W deconv ,F)=σ(φ(ψ(F,W conv )),W deconv )

Among them, ψ and φ respectively represent the convolution function of the convolution layer used for downsampling and the deconvolution function of the deconvolution layer used for upsampling, F represents the input feature, σ represents the Sigmoid function, and W conv represents the convolution The weight of the layer, W deconv represents the weight of the deconvolution layer;

Step 3. During training, the deepest part is regarded as a teacher model, and all shallow parts with corresponding classifiers are trained as student models by distilling from the deepest part, thus realizing self-distillation training of the convolutional neural network .
The self-distillation training method of a convolutional neural network according to claim 1, characterized in that, in step 3, during training, the following three losses are introduced to improve the performance of the student model;

Introduce the cross-entropy loss from the label; calculate the cross-entropy loss based on the label from the training data set and the output of the softmax layer of each classifier, and introduce it into all classifiers;

Introduce the KL divergence loss guided by the teacher model; calculate the KL divergence according to the output of the softmax layer between each student model and the teacher model, and introduce it into the softmax layer of each shallow classifier;

Introduce the L2 loss from the hint; by calculating the L2 loss between the deepest classifier and the feature map of each shallow classifier, it is correspondingly introduced to the bottleneck layer of each shallow classifier.
The self-distillation training method of a convolutional neural network according to claim 2, characterized in that, specifically, the introduction of cross-entropy loss from the label is obtained by the following formula:

(1-α)·Cross Entropy(q i ,y)

Among them, q i represents the output of the softmax layer of each classifier θ i/C ; the training set is given N samples from M categories
Denote the corresponding label set as
α is a hyperparameter that controls the proportion of KL divergence loss function, KL is Kullback-Leibler divergence, α of the deepest classifier is zero, and CrossEntropy is the cross-entropy function.
The self-distillation training method of a convolutional neural network according to claim 2, characterized in that, specifically, the KL divergence loss guided by the introduction of the teacher model is obtained by the following formula:

α·KL(q i ,q C )

Among them, α is the hyperparameter that controls the ratio of the KL divergence loss function, KL is the Kullback-Leibler divergence, q i represents the output of the softmax layer of each classifier θ i/C , and q C is the deepest classifier θ C The output of the softmax layer, the α of the deepest classifier is zero.
The self-distillation training method of a convolutional neural network according to claim 2, characterized in that, specifically, the L2 loss introduced from the prompt is obtained by the following formula:

Among them, F i and F C represent the features in each classifier θ i/C and the features in the deepest classifier θ C , λ is a hyperparameter that controls the ratio of the feature loss function, and the deepest classifier’s λ is zero .
The self-distillation training method of a convolutional neural network according to claim 2, characterized in that, during training, the loss function of the entire convolutional neural network is composed of the loss function of each classifier, expressed by the following formula,

Among them, q i represents the output of the softmax layer of each classifier θ i/C ; the training set is given N samples from M categories
Denote the corresponding label set as
Cross Entropy is the cross entropy function; KL is the Kullback-Leibler divergence; q C is the output of the softmax layer of the deepest classifier θ C ; F i and F C represent the features and maximum values in each classifier θ i/C . For the features in the deep classifier θ C , α and λ are hyperparameters that control the ratio of the KL divergence loss function to the feature loss function, and the α and λ used for the deepest classifier are zero.
The self-distillation training method of a convolutional neural network according to claim 1, wherein the shallow classifier including the bottleneck layer, the fully connected layer and the softmax layer arranged in sequence can be removed in the inference.
A scalable dynamic prediction method of a convolutional neural network, characterized in that the convolutional neural network is a scalable convolutional neural network obtained by the self-distillation training method described in any one of claims 1-7, The scalable dynamic prediction method includes the following steps,

Step 1. Set the thresholds of all shallow classifiers and the deepest classifier respectively;

Step 2. According to the depth from shallow to deep, judge the confidence and threshold of the prediction result of each layer of classifier; if the confidence of the prediction result of the current layer is greater than the threshold of the classifier of the current layer, it is considered The classifier predicts successfully; otherwise, the deeper classifier will continue to predict until the last classifier; as the depth increases, the prediction accuracy rate will increase layer by layer;

Step 3: Under the requirement of the prediction confidence, select the shallowest prediction result or the prediction result of the optimal accuracy rate as the output of the scalable dynamic prediction according to the prediction demand.
The scalable dynamic prediction method of a convolutional neural network according to claim 8, characterized in that, in step 1, the threshold value of each layer of classifier is optimized through genetic algorithm; the optimization goal is fast convolution The neural network model response speed and high prediction accuracy rate, the optimized solution is the threshold corresponding to the shallow classifier in the scalable convolutional neural network;

Step 1.1: Define the mapping relationship between genes and thresholds by defining the following decoding relationships from genes to thresholds in the genetic algorithm;

Among them, τ is the lower bound of the threshold, S(n) represents the value of the nth position in the gene sequence, σ represents the threshold corresponding to the i-th gene, and N represents the length of the gene sequence; in the gene sequence, the number of "1"s is greater More, the lower the threshold;

Step 1.2, obtain the following environmental suitability according to the speedup ratio and prediction accuracy of the scalable convolutional neural network;

fitness=acceleration ratio+γ·(accuracy-baseline)

Among them, fitness represents the environmental suitability corresponding to each gene; acceleration ratio is the acceleration ratio, which represents the ratio of the predicted response speed of the scalable dynamic prediction to the predicted response speed of the original scalable convolutional neural network; accuracy and baseline respectively represent the scalability The prediction accuracy rate of dynamic prediction and the prediction accuracy rate of the original scalable convolutional neural network; γ is a balance factor between response acceleration and prediction accuracy;

Step 1.3, according to the above definition, use genetic algorithm to search for the threshold;

First, perform random initialization for the gene representing the threshold;

Secondly, calculate the suitability of all genes for the environment; keep genes with high suitability with a greater probability, and eliminate genes with low suitability with crossover probability;

Then, pair the retained genes in pairs to obtain new genes;

The above process is performed iteratively, and the finally obtained threshold value for the gene with the highest environmental suitability is the threshold value after the optimized search.
The scalable dynamic prediction method of a convolutional neural network according to claim 8, wherein when the prediction result of the deepest classifier is better than the integration of multiple classifier models, only the first three shallow layers The classifier sets the threshold, and uses the prediction result of the deepest classifier as the final result.
A self-distillation training device for a convolution application network, which is characterized by comprising a memory for storing a computer program; a processor for implementing the one described in any one of claims 1-7 when executing the computer program The steps of a self-distillation training method for a convolution application network.