WO2021023202A1 - Self-distillation training method and device for convolutional neural network, and scalable dynamic prediction method - Google Patents

Self-distillation training method and device for convolutional neural network, and scalable dynamic prediction method Download PDF

Info

Publication number
WO2021023202A1
WO2021023202A1 PCT/CN2020/106995 CN2020106995W WO2021023202A1 WO 2021023202 A1 WO2021023202 A1 WO 2021023202A1 CN 2020106995 W CN2020106995 W CN 2020106995W WO 2021023202 A1 WO2021023202 A1 WO 2021023202A1
Authority
WO
WIPO (PCT)
Prior art keywords
classifier
neural network
convolutional neural
layer
shallow
Prior art date
Application number
PCT/CN2020/106995
Other languages
French (fr)
Chinese (zh)
Inventor
马恺声
张林峰
Original Assignee
交叉信息核心技术研究院(西安)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 交叉信息核心技术研究院(西安)有限公司 filed Critical 交叉信息核心技术研究院(西安)有限公司
Publication of WO2021023202A1 publication Critical patent/WO2021023202A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/126Evolutionary algorithms, e.g. genetic algorithms or genetic programming

Definitions

  • the invention relates to the training of convolutional neural networks, in particular to a self-distillation training method, equipment and scalable dynamic prediction method of convolutional neural networks.
  • Convolutional neural networks have been widely deployed in various application scenarios. In order to extend the scope of applications to some areas where accuracy is critical, researchers have been studying ways to improve accuracy through deeper or wider network structures, which will bring exponential growth in computing and storage costs, thereby Will delay response time.
  • KD knowledge distillation
  • the inspiration of knowledge distillation comes from the transfer of knowledge from teachers to students.
  • the key strategy is to position the compact student model as an approximation to an over-parameterized teacher model. Therefore, the student model can obtain significant performance improvements, sometimes even better than the teacher's model.
  • the implementation of knowledge distillation includes two steps. The first step is to train a large teacher model, and the second step is to distill knowledge from the teacher model to The student model; however, it also has the following problems; the first problem is about the inefficiency of knowledge transfer, which means that the student model hardly uses all the knowledge from the teacher model. An outstanding student model that is superior to its teacher model is still rare.
  • Another problem is how to design and train an appropriate teacher model.
  • the existing distillation framework requires a lot of effort and experimentation to find the best structure of the teacher model, which will take a relatively long time.
  • the third problem is that the teacher model and the student model work in their own way, and the knowledge transfer flows between different models, which involves the establishment of multiple models, which is cumbersome and low in accuracy.
  • the proposed self-distillation training method is used for efficient training, but the accuracy of the classifier in the self-distillation process is low, and its own functions cannot be automatically separated, which affects the function of the classifier, thereby reducing the accuracy of the training method .
  • neural networks have incomparable advantages in dealing with non-linear problems.
  • Predictive control has a very good pertinence for constrained card edge operation problems. Therefore, neural networks and predictive control are combined to play Their respective advantages provide a good solution to the control of non-linear, time-varying, strongly constrained, and large-lag industrial processes. Therefore, convolutional neural networks are widely used in the field of prediction; the existing technology is based on convolutional neural networks All predictions need to consider the response speed and the confidence of the prediction results. Therefore, for the prediction requirements of different needs, the algorithms of multiple models will be stored at the same time. For different response speeds and accuracy requirements, different models will be replaced. A vacuum period is formed during the switching process, which brings security risks to real applications.
  • the present invention provides a self-distillation training method, equipment and scalable dynamic prediction method of a convolutional neural network.
  • the design is reasonable, efficient and simple, the self-distillation training model is flatter, and the parameters are optimized. More robust.
  • a self-distillation training method of convolutional neural network including the following steps,
  • Step 1 According to the depth and original structure of the target convolutional neural network, divide the convolutional layer of the target convolutional neural network into n parts in the set depth interval, where n is a positive integer and n ⁇ 2, where the nth layer Is the deepest part, and the rest are shallow parts;
  • Step 2 Set a shallow classifier after each shallow part for classification, and set the deepest classifier after the deepest part for classification;
  • the shallow classifier includes the bottleneck layer, the fully connected layer and the softmax layer set in sequence for classification ,
  • the deepest classifier includes a fully connected layer and a softmax layer set in sequence for classification;
  • the specific features of the shallow classifier are obtained by the following attention module,
  • AttentionMaps(W conv ,W deconv ,F) ⁇ ( ⁇ ( ⁇ (F,W conv )),W deconv )
  • ⁇ and ⁇ respectively represent the convolution function of the convolution layer used for downsampling and the deconvolution function of the deconvolution layer used for upsampling
  • F represents the input feature
  • represents the Sigmoid function
  • W conv represents the convolution
  • the weight of the layer, W deconv represents the weight of the deconvolution layer
  • Step 3 During training, the deepest part is regarded as a teacher model, and all shallow parts with corresponding classifiers are trained as student models by distilling from the deepest part, thus realizing self-distillation training of the convolutional neural network .
  • a scalable dynamic prediction method of a convolutional neural network is a scalable convolutional neural network obtained by the self-distillation training method described in any one of the above, and the scalable dynamic prediction method includes the following steps ,
  • Step 1 Set the thresholds of all shallow classifiers and the deepest classifier respectively;
  • Step 2 According to the depth from shallow to deep, judge the confidence and threshold of the prediction result of each layer of classifier; if the confidence of the prediction result of the current layer is greater than the threshold of the classifier of the current layer, it is considered The classifier predicts successfully; otherwise, the deeper classifier will continue to predict until the last classifier; as the depth increases, the prediction accuracy rate will increase layer by layer;
  • Step 3 Under the requirement of the prediction confidence, select the shallowest prediction result or the prediction result of the optimal accuracy rate as the output of the scalable dynamic prediction according to the prediction demand.
  • the present invention also provides a self-distillation training device for a convolution application network, including a memory for storing a computer program; a processor, used to implement the self-distillation training method of the convolution application network as described above when the computer program is executed A step of.
  • the present invention has the following beneficial technical effects:
  • the self-distillation training method of the convolutional neural network of the present invention significantly enhances the performance of the convolutional neural network by reducing the size of the convolutional neural network instead of expanding the size of the network, that is, improving the accuracy.
  • Different from traditional knowledge distillation-it is a method of knowledge transfer between networks, which prompts the student neural network to approach the softmax layer output of the pre-trained teacher neural network.
  • the self-distillation framework proposed here distills knowledge within the network itself. The network is first divided into several parts. Then, the knowledge in the deeper part of the network is squeezed into the shallow part.
  • the scalable dynamic prediction method of the present invention can dynamically adjust the trade-off balance between the prediction accuracy rate and the response speed based on the available output of each of the above-mentioned shallow classifiers and through a reasonable adjustment of the threshold. Efficiently schedule multiple classifiers in the network; the ability to dynamically adjust the response speed of the model in the deployed state greatly improves the flexibility of the convolutional neural network in prediction applications; only the threshold value needs to be modified when switching models There is no need to change the model, which can avoid the vacuum period of the model during the switching process, and bring safety guarantee to the real application.
  • an automated threshold search is realized by the genetic algorithm, which further improves the acceleration effect of the neural network, thereby realizing the synergistic improvement of acceleration and accuracy.
  • FIG. 1 is a schematic diagram of the comparison of training complexity, training time and accuracy between traditional distillation and distillation of the present invention for the CIFAR100 data set.
  • Figure 2 is a schematic diagram of the self-distillation method for ResNet described in the example of the present invention.
  • Figure 3 shows the accuracy of the classifiers trained by different methods in the example of the present invention.
  • Fig. 4 is a diagram showing the relationship between the amount of calculation of the scalable network and the accuracy in the example of the present invention.
  • Fig. 5 is a diagram showing the relationship between the amount of parameters of the scalable network and the accuracy in the example of the present invention.
  • FIG. 6 is a diagram showing the relationship between the speed-up ratio and the accuracy of the scalable dynamic prediction in the scalable dynamic prediction method described in the example of the present invention.
  • Fig. 7 shows the visualization results of attention maps of different classifiers in the scalable neural network in the example of the present invention.
  • Fig. 8 is a schematic diagram of the number of classifications completed by each classifier obtained by the prediction method in the example of the present invention on different data sets.
  • the present invention proposes a self-distillation training method of convolutional neural network, which can achieve the highest possible accuracy and overcome the shortcomings of traditional distillation when training compact models.
  • the first step is to train a large teacher model
  • the second step is to distill knowledge from the teacher model to the student model
  • the one-step self-distillation framework provided by the method of the present invention directs its training to students model.
  • the proposed self-distillation not only requires less training time (from 26.98 hours to 5.87 hours on CIFAR100, the training time is shortened by 4.6 times), but also can achieve higher accuracy (from 79.33 in traditional distillation on ResNet50) % To 81.04%).
  • the performance of the shallow classifier is enhanced by better improving the accuracy of the shallow classifier.
  • the present invention can be used in any system based on convolutional neural network, such as image classification system, face recognition system, target detection system, image semantic segmentation system.
  • the training method described in the present invention can be used to improve the performance of the system, which not only has high accuracy, but also has a high speed, and can synergistically improve the speed and accuracy.
  • the accuracy comparison of four methods of training shallow classifiers in ResNet50 on CIFAR100 is provided.
  • Observation shows that as the neural network becomes shallower, the prediction accuracy of the classifier decreases rapidly. Among them, the shallowest classifier and the sub-shallow classifier are reduced by 13% and 8%, respectively.
  • the self-distillation algorithm has been significantly improved compared with the deep supervision algorithm and the separate training method, it still cannot meet the needs of practical applications.
  • the accuracy of the separately trained network is better than that of the self-distillation algorithm and the deep supervision algorithm, which shows that in the structure of the latter's corresponding shared backbone network, different classifiers There is a negative interaction between them. Since the features that the backbone network can obtain are limited by the number of network channels, the features corresponding to different classifiers are confused. For each classifier, it is almost impossible to automatically separate its own features from the mixed features.
  • the attention layer is used to obtain the characteristics of a specific classifier from the shared backbone neural network, so that each classifier can learn how to obtain the features it needs from the backbone network .
  • a simplified attention layer which includes a convolutional layer for downsampling and a deconvolutional layer for upsampling.
  • the attention layer is followed by an S-shaped stimulus to obtain an attention map between 0 and 1.
  • the dot product of the attention map and the original feature is performed to generate the specific feature of the classifier. Its forward calculation can be formulated as:
  • AttentionMaps(W conv ,W deconv ,F) ⁇ ( ⁇ ( ⁇ (F,W conv )),W deconv )
  • ⁇ and ⁇ represent the convolution function and deconvolution function, respectively, F represents the input feature, and ⁇ represents the sigmoid function. Note that the batch normalization and ReLU activation functions after the convolution and deconvolution layers are omitted here.
  • the scalable neural network enables different classifiers to extract suitable features from the backbone network through the attention layer, which greatly improves the accuracy of the shallow neural network prediction. Therefore, by visualizing the attention map output by the attention layer, the process of selecting features of the neural network can be observed.
  • Figure 7 shows the output results of the attention layer for two images. Among them, the leftmost picture is the input image. In the six images on the right, from left to right, the output results of the three classifier attention layers from shallow to deep are shown. The picture in the first row represents the heat map representation of the attention map, and the picture in the second row represents the input image after the dot multiplication operation with the attention map as a mask.
  • the position of attention In the heat map, the position of the shark and the cat has a higher value, which means that different classifiers put the main attention on the most important position of the information in the input picture, that is, the body of the shark and the cat. The background or other irrelevant elements are ignored. This shows that even a shallow classifier has the ability to judge the importance of each pixel.
  • Attention granularity the attention of different classifiers is also different. As shown in Figure 7, the shallow classifier pays more attention to the contours of sharks and cats, that is, it pays more attention to local information and high-frequency information. The deep classifier pays more attention to its body and texture, that is, it pays more attention to global information and low-frequency information. This law conforms to the information processing mechanism in the fish neural network. As the network becomes deeper, the receptive field of the neural network continues to grow, which gives the deep classifier the ability to focus on global features in the attention layer.
  • the self-distillation method in the present invention as depicted in FIG. 2.
  • the self-distillation training is carried out through the following steps to build the self-distillation framework: First, modify the original neural network on any computer that can run text editing software, and convert the target convolutional neural network according to the depth and original structure of the target convolutional neural network Divide into several shallow parts. For example, according to ResBlocks, ResNet50 is divided into 4 parts. Second, modify the original neural network on any computer that can run text editing software, and set a classifier after each shallow part, which is the same as the bottleneck layer and the bottleneck layer that are only used in training and can be removed in inference. Fully connected layers are combined.
  • the main consideration for adding a bottleneck layer is to reduce the impact between each shallow classifier and add the L2 loss from the hint.
  • NVIDIA graphics cards Intel high-performance CPUs or Google TPU chips
  • all shallow parts with corresponding classifiers are trained as student models by distilling from the deepest part, which can be conceptually regarded as teachers model.
  • ResNet has been divided into four parts according to the depth. After constructing each part of multiple classifiers, an additional bottleneck layer and a fully connected layer are set; in different accuracy and corresponding time In this case, all classifiers can be used independently; as shown in Figure 2, each classifier is trained under three types of supervision, and the part under the dotted line can be removed during inference.
  • the three types of supervision are corresponding losses Source 1’s supervision from the label, corresponding to the loss source 2’s supervision from distillation, and the corresponding loss source 3’s supervision from the prompt, their corresponding participation flows are shown in the figure.
  • Loss source 1 Cross-entropy loss from the label, not only for the deepest classifier, but also for all shallow classifiers. It is calculated using the labels from the training data set and the output of the softmax layer of each classifier. In this way, the knowledge hidden in the training data set is directly introduced from the label to all classifiers.
  • Loss source 2 KL (Kullback-Leibler) divergence loss guided by the teacher model.
  • KL Kullback-Leibler
  • the self-distillation framework affects the teacher network model and can pass its deepest classifier to each shallow classifier.
  • Loss source 3 L2 loss from the prompt. It can be obtained by calculating the L2 loss between the feature map of the deepest classifier and each shallow classifier. With the help of L2 loss, the ambiguous knowledge in the feature map is introduced into the bottleneck layer of each shallow classifier, which will induce all the classifier feature maps in their bottleneck layer to adapt to the feature map of the deepest classifier.
  • the classifier in the trained convolutional neural network that is, the proposed self-distillation has multiple classifiers in the entire network, and is represented as Where C represents the number of classifiers in the convolutional neural network. And set the softmax layer after each classifier.
  • q i c ⁇ R M is the class i probability of the classifier ⁇ i/C .
  • T represents the temperature hyperparameter of distillation and is usually set to 1. The larger the value, the smoother the predicted probability distribution.
  • the above-mentioned neural network is self-distilled training on NVIDIA graphics card, Intel high-performance CPU or Google TPU chip.
  • the supervision of each classifier ⁇ i/C except the deepest classifier ⁇ C comes from three sources. Two hyper-parameters ⁇ and ⁇ are used to balance them. ⁇ and ⁇ are hyper-parameters that control the ratio of the KL divergence loss function to the feature loss function. For the deepest classifier, ⁇ and ⁇ are zero.
  • the first source is the cross entropy loss calculated using q i and label Y.
  • q i represents the output of the softmax layer of each classifier ⁇ i/C
  • CrossEntropy is the cross entropy function.
  • the second source is the Kullback-Leibler divergence between q i and q C.
  • Our goal is to make the shallow classifier approximate the deepest classifier, which indicates supervision from distillation.
  • q i represents the output of the softmax layer of each classifier ⁇ i/C ;
  • q C means the output of the softmax layer of the deepest classifier,
  • is the hyperparameter that controls the proportion of the KL divergence loss function, and KL is the Kullback-Leibler divergence degree.
  • the final supervision comes from the hint of the deepest classifier.
  • the prompt is defined as the output of the hidden layer of the teacher model, and its purpose is to guide the learning of the student model. It works by reducing the distance between the feature map in the shallow classifier and the feature map in the deepest classifier. However, because feature maps of different depths have different sizes, additional layers should be added to align them. Instead of using convolutional layers, the present invention uses a bottleneck architecture, which shows a positive effect on model performance.
  • F i and F C denote feature classifier ⁇ i / C in the deepest features and classification of ⁇ C.
  • the loss function of the entire neural network is composed of the loss function of each classifier, which can be written as:
  • q i represents the output of the softmax layer of each classifier ⁇ i/C ; the training set is given N samples from M categories Denote the corresponding label set as y i ⁇ 1,2,...,M ⁇ ; Cross Entropy is the cross-entropy function; KL is the Kullback-Leibler divergence; q C is the output of the softmax layer of the deepest classifier ⁇ C ; F i and F C Respectively represent the features in each classifier ⁇ i/C and the features in the deepest classifier ⁇ C.
  • ⁇ and ⁇ are hyperparameters that control the ratio of the KL divergence loss function to the feature loss function, and are used for the deepest classifier ⁇ and ⁇ are zero.
  • the self-distillation training method of a convolutional neural network proposed by the present invention shows its advantages by comparing it with the deep supervision network and previous distillation methods.
  • the present invention abandons the additional teacher model required in the previous distillation method, and provides an adaptive depth architecture for the time-accuracy trade-off at runtime.
  • the specific experimental results on five convolutional neural networks and two data sets are as follows.
  • CIFAR100 The CIFAR100 dataset consists of small (32x32 pixels) RGB images with 100 categories, and contains 50K images in the training set and 10K images in the test set. Adjust the core size and step size of the neural network to fit the size of the small image.
  • ImageNet The ImageNet2012 classification data set consists of 1000 categories based on WordNet. Each category is depicted by thousands of images. We resize them to 256x256 pixel RGB images. Note that the reported accuracy of ImageNet is calculated on the validation set.
  • Table 1 The accuracy of different classifiers of the self-distillation algorithm on the CIFAR100 data set.
  • Table 2 The correct rate of different classifiers of the self-distillation algorithm on the ImageNet data set.
  • Table 3 compares the results of self-distillation with the results of five traditional distillation methods for the CIFAR100 data set. Here, we will focus on the accuracy improvement of each method when the student model has the same amount of calculation and storage. From Table 3, we draw the following observations: (i) All performances of the distillation method are better than the directly trained student network. (ii) Although self-distillation does not have additional teachers, it is still superior to most other distillation methods.
  • a significant advantage of the self-distillation framework is that it does not require additional teachers.
  • traditional distillation first needs to design and train an over-parameterized teacher model. Designing a high-quality teacher model requires a lot of experimentation to find the best depth and architecture. In addition, training an over-parameterized teacher model takes much longer.
  • the convolutional neural network trained by the present invention only applies all newly added layers (the part below the dotted line in Figure 2) during training. They do not exert any influence during reasoning. Adding these parts during inference provides another option for dynamic inference of energy-constrained edge devices. Can be used to adapt to the scalable depth of reasoning.
  • a popular solution for accelerating convolutional neural networks is to design a scalable network, which means that the depth or width of the neural network can be dynamically changed according to application requirements. For example, in scenarios where response time is more important than accuracy, some layers or channels can be discarded at runtime for acceleration.
  • Deep classifier As can be observed in Table 5: (i) Through the classifier 3/4, three of the four neural networks are better than their baselines, with an average speedup of 1.2 times. When using the classifier 2/4, a speedup of 3.16 times can be achieved with a loss of accuracy of 3.3%. (ii) Since different classifiers share a backbone network, the integration of the three deepest classifiers can increase the average level of accuracy by 0.67% with a computational cost of only 0.05%.
  • the self-distillation method itself will be further analyzed.
  • the following analyzes the advantages of the self-distillation method from the perspective of flat minimum, gradient and distinguishing characteristics.
  • the self-distillation method of the present invention is a training technique for improving the performance of the model, rather than a method for compressing or accelerating the model.
  • the self-distillation provided by the present invention is a method of knowledge transfer within a model and has a broad application prospect.
  • the self-distillation method of the present invention can help the trained model, that is, the convolutional neural network, to converge to a flat minimum value that is inherently universal. Self-distillation can prevent the model from encountering the vanishing gradient problem. Deeper classifiers are used in self-distillation to extract more distinguishing features.
  • the present invention provides a scalable dynamic prediction method of a convolutional neural network, which first makes each classifier have a corresponding threshold. If the confidence of the prediction result of the current classifier is greater than the threshold, it is considered that the prediction of the classifier is successful. Otherwise, the deeper classifier will continue to predict until the last classifier.
  • the scalable dynamic prediction mechanism only sets the threshold for the first three shallow networks, and the prediction of the deep classifier is the final result. Since most of the calculations of the shallow classifiers are part of the calculations of the deep classifiers, such gradually deepening dynamic prediction will hardly bring additional calculations.
  • the scalable dynamic prediction based on threshold control introduces another problem, that is, how to select the appropriate threshold for different classifiers.
  • a suitable threshold is very important: (1) A lower threshold will make most of the predictions completed by the shallow classifier, which can effectively reduce the response time, but it also leads to a decrease in the accuracy of the prediction. (2) In the same way, a higher threshold will make most of the predictions completed by the deep classifier, which can achieve a higher prediction accuracy, but at the same time it will lead to a longer response time. (3) By adjusting the threshold reasonably, the trade-off balance between the prediction accuracy rate and the response speed can be dynamically adjusted. In order to further explore the space for acceleration and improvement of the accuracy rate, the present invention further uses a genetic algorithm to optimize the threshold value.
  • the genetic algorithm obtains the optimal solution or the approximation of the optimal solution to the formulation of optimization goals by simulating the behaviors of different biological individuals to survive, eliminate and multiply in nature.
  • the main process includes: (1) Initializing genes, that is, randomly generating a certain amount of individuals with different genes as the first-generation organisms. (2) Calculate environmental suitability, that is, for each individual organism, calculate the suitability for the environment determined by its genes. This calculation process is determined by the optimization goal. (3) Elimination, that is, elimination of organisms that are not suitable for the environment based on the results of the previous step. (4) Cross-matching, that is, cross-matching the genes of the eliminated individual organisms to simulate the process of biological reproduction to obtain the next generation of individuals.
  • Gene mutation that is, the genes of the uncultivated individuals and the genes of the new-born individuals are changed with a certain probability to prevent the optimization process from falling into the local optimum. Through multiple iterations of the above process, the genetic algorithm can find the optimal or better solution for the optimization goal.
  • the threshold search problem is modeled as an optimization problem solved by genetic algorithms.
  • the optimization goal is fast neural network model response speed and high prediction accuracy.
  • the optimized solution corresponds to the shallow classifier in the scalable network.
  • the threshold In the process of using genetic algorithms to solve the threshold search problem, it is necessary to define the mutual mapping relationship between genes and thresholds, and to solve the environmental suitability according to the speedup and accuracy of the scalable network.
  • the decoding relationship can be as follows.
  • S(n) represents the value of the nth position in the gene sequence
  • represents the threshold corresponding to the i-th gene.
  • N represents the length of the gene sequence. In the gene sequence, the greater the number of "1"s, the lower the threshold.
  • acceleration ratio is the acceleration ratio, which represents the ratio of the predicted response speed of the scalable dynamic prediction to the predicted response speed of the original scalable convolutional neural network. Acceleration effect.
  • Accuracy and baseline respectively represent the predicted response speed of the scalable dynamic prediction and the prediction accuracy rate of the original scalable convolutional neural network.
  • is a balance factor between response acceleration and prediction accuracy.
  • the benefits of the scalable dynamic prediction method are not only higher acceleration effects than static acceleration, but also that it provides the ability to dynamically adjust the response speed of the model in the deployment state, which is extremely flexible for applications.
  • the model can use a lower threshold to ensure a higher processing frame rate.
  • the model can use a higher threshold to obtain the best prediction accuracy.
  • this method only needs to modify the threshold value when switching models without changing the model, which can avoid the vacuum period of the model during the switching process and bring safety guarantee to real applications.
  • the scalable dynamic prediction method Compared with the static acceleration method, the scalable dynamic prediction method not only has a higher acceleration ratio, but also has more reliability.
  • the requirement for the correct rate of the neural network model after compression is often one of the most important evaluation criteria for neural network compression algorithms.
  • the compression and acceleration of neural networks are often accompanied by a decrease in accuracy. Such results are unacceptable in some safety-related application scenarios, such as unmanned driving, security systems, and so on.
  • the scalable dynamic prediction method even if the accuracy of all shallow classifiers is lower than the original scalable convolutional neural network model, reasonable classifier scheduling can still be achieved through a lower threshold and the original accuracy of the neural network can be maintained.
  • the experimental results of the scalable dynamic prediction method of the convolutional neural network of the present invention on the CIFAR100 data set As shown in Figure 4 and Figure 5, the relationship between the calculation amount, parameter amount and prediction accuracy of seven different deep neural networks on the CIFAR100 data set.
  • the horizontal axis represents the number of multiplication and addition operations required for deep neural network prediction
  • the vertical axis represents its prediction accuracy rate.
  • the dotted lines and dots of each gray scale correspond to the same deep neural network.
  • the marked points of the same shape on the dashed line represent the experimental results of four (or three) deep classifiers of the same scalable network, and the marked points of the same shape outside the dashed line represent the comparison results of the original model experiment without the scalable network. .
  • the second shallow classifier of the scalable convolutional neural network can exceed the original model in prediction accuracy.
  • a statically running scalable network can achieve 2.17 times acceleration and 3.20 times compression.
  • each neural network increases the prediction accuracy rate by 4.05% at the cost of only 4.4% additional calculations.
  • the integrated prediction results of all models can improve the correct rate by 1.11%.
  • the accuracy rate of the shallow classifier is improved by a lot, which is mainly brought by the attention layer in the shallow classification.
  • the deeper the neural network the greater its performance improvement.
  • This enhancement trend is most obvious in the shallowest and sub-shallow classifiers.
  • the first two shallow classifiers of ResNet18 have more than 5% difference in accuracy.
  • the accuracy of the sub-deep classifier is almost the same as that of the deep classifier.
  • the accuracy of the sub-deep classifier may even be higher than that of the deepest classifier. This phenomenon may be caused by the relatively simple classification task of the CIFAR100 data set.
  • the scalable network achieves an increase in accuracy of more than 1%.
  • Table 7 shows the experimental results of the scalable convolutional neural network on the CIFAR10 dataset. The overall trend is the same as that of CIFAR100. It can be seen that all convolutional neural networks can achieve a significant improvement in accuracy. All the network structures for experiments Among them, the average increase is 0.98%, the highest is 1.28% on VGG16 (BN), and the lowest is 0.71% on ResNet18.
  • the absolute value of the increase in the accuracy of the CIFAR10 data set is slightly lower than the result on the CIFAR100 data.
  • the main reason for this phenomenon is that the accuracy of the original network CIFAR10 is already very high. That is, because the neural network trained by the traditional method can already achieve a higher prediction accuracy, the difficulty of further improving the accuracy is greater than that of the CIFAR100 data set.
  • Table 7 The correct rate of different classifiers of the scalable convolutional neural network on the CIFAR10 data set.
  • Table 8 shows the correct rate of each classifier in the ResNet network of three different depths on the ImageNet dataset. The trend is roughly the same as the result on CIFAR100, but there are still the following differences:
  • each network can increase the prediction accuracy rate by 1.26%.
  • the effect is the most obvious, it increases by 1.41% on ResNet50, and when it is the least obvious, it increases by 1.08% on ResNet101. This result is worse than the CIFAR100 data The result on the set.
  • FIG. 6 it shows the relationship between the accuracy and speedup of each neural network obtained by dynamic scalable prediction when different threshold schemes are used on CIFAR100 and ImageNet.
  • the horizontal axis represents the acceleration ratio of the model
  • the vertical axis represents the prediction accuracy of the model.
  • the same color points represent the experimental results under the same network and the same data set.
  • the square in the range of x>1 indicates the experimental result corresponding to the searched threshold scheme.
  • the final acceleration effect is directly dependent on the number of classifications completed by each classifier in the scalable neural network. If a large number of classification decisions are completed by a shallow classifier, the acceleration effect of the entire neural network will be very obvious. If a large number of classification decisions are completed by the deep classifier, the response speed of the system is almost the same as the original network. By counting the number of decisions made by different depth classifiers, the acceleration effect of the system can be accurately recognized.
  • the prediction performance of the four classifiers on different data sets under the premise of maintaining the same threshold scheme and the same neural network (ResNet50), the prediction performance of the four classifiers on different data sets. among them.
  • the 1/4 to 4/4 on the horizontal axis represent four classifiers from shallow to deep, and the value on the vertical axis represents the ratio of the number of predictions made by this classifier to the total number of times.
  • the number of predictions made by different classifiers in the same data set can be used to determine the redundancy in different network layers.
  • the number of predictions completed by the sub-deep classifier and the deepest classifier is close to zero, which shows that the neural network part of the two classifiers plays a small role in the overall classification.
  • the sum of the prediction numbers of the first two shallow classifiers is close to 100%, indicating that the neural network part of these two classifiers plays a great role in the classification task, and the redundancy is small, and it is not suitable for further compression. Or accelerate.
  • the number of predictions made by different classifiers in different data sets can be used as a measure of the difficulty of different data sets.
  • the easiest way to compare the difficulty of different data sets is to directly compare the prediction accuracy rates that the same network can achieve on each data set.
  • the accuracy of classification tasks is also affected by the number of categories.
  • the number of categories in different data sets is different, this measurement method will be affected by this, and thus underestimate the difficulty of the task of classification with fewer categories.
  • Deep scalability provides another way of thinking, which is to compare the difficulty of different data sets by comparing the number of samples classified by the shallow classifier.
  • the present invention also provides a self-distillation training device for a convolution application network, including a memory for storing a computer program; a processor, used to implement the self-distillation training method of the convolution application network as described above when the computer program is executed A step of.
  • the present invention also provides a computer storage medium with a computer program stored on the computer storage medium, and when the computer program is executed by a processor, the steps of the self-distillation training method of the convolution application network described above are realized.
  • the present invention also provides a scalable dynamic prediction device of a convolutional neural network, including a memory, used for storing a computer program; a processor, used to implement the aforementioned scalable dynamics of a convolutional neural network when the computer program is executed The steps of the forecasting method.
  • the present invention also provides another computer storage medium with a computer program stored on the computer storage medium, and when the computer program is executed by a processor, the steps of the scalable dynamic prediction method of the convolutional neural network as described above are realized.
  • the embodiments of the present invention may be provided as methods, systems, or computer program products. Therefore, the present invention may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, the present invention may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
  • a computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device.
  • the device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment.
  • the instructions provide steps for implementing functions specified in a flow or multiple flows in the flowchart and/or a block or multiple blocks in the block diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physiology (AREA)
  • Genetics & Genomics (AREA)
  • Image Analysis (AREA)

Abstract

The present invention provides a self-distillation training method for a convolutional neural network, for use in significantly improving the performance of a convolutional neural network by reducing the size of the convolutional neural network instead of expanding the size of the network. When knowledge is distilled within a network itself, the network is first divided into several parts; then, knowledge in a deep part of the network is pressed into a shallow part. Without taking the response time as the cost, self-distillation greatly improves the performance of a convolutional neural network, achieving an average accuracy improvement of 2.65%; the 0.61% accuracy improvement for a data set ResNeXt is the minimum value and the 4.07% accuracy improvement for VGG19 is the maximum value. In combination with enhanced extraction of features of shallow classifiers by an attention layer, the accuracy of the shallow classifiers is significantly improved; thus, a convolutional neural network having multiple outputs can be regarded as multiple convolutional neural networks, and the output of each shallow classifier can be used according to different needs.

Description

一种卷积神经网络的自蒸馏训练方法、设备和可伸缩动态预测方法Self-distillation training method, equipment and scalable dynamic prediction method of convolutional neural network 技术领域Technical field
本发明涉及卷积神经网络的训练,具体为一种卷积神经网络的自蒸馏训练方法、设备和可伸缩动态预测方法。The invention relates to the training of convolutional neural networks, in particular to a self-distillation training method, equipment and scalable dynamic prediction method of convolutional neural networks.
背景技术Background technique
卷积神经网络已经被广泛地部署在各种应用场景中。为了将应用的范围扩展到一些精度至关重要的领域,研究人员一直在研究通过更深或更宽的网络结构来提升精度的方法,这会为其带来计算和存储成本的指数式增长,从而会延迟响应时间。Convolutional neural networks have been widely deployed in various application scenarios. In order to extend the scope of applications to some areas where accuracy is critical, researchers have been studying ways to improve accuracy through deeper or wider network structures, which will bring exponential growth in computing and storage costs, thereby Will delay response time.
在卷积神经网络的帮助下,诸如图像分类、对象检测和语义分割之类的应用目前正在以前所未有的速度发展。然而,在一些要求不容错的应用,诸如自动驾驶和医学图像分析中,需要进一步改进预测和分析精度,同时需要更短的响应时间。这导致当前卷积神经网络面临巨大的挑战。现有技术中的方法侧重于性能改进或减少计算资源,从而能够减少响应时间。例如,一方面,已经提出了ResNet 150或甚至更大的ResNet 1000用来改善非常有限的性能裕度,但是具有大幅计算代价。另一方面,在与尽力而为网络相比具有预定义的性能损失的情况下,已经提出了各种技术来减少计算和存储量,以匹配硬件实现所带来的限制。这样的技术包括轻量级网络设计、修剪和量化等,其中知识蒸馏(KD)是实现模型压缩的可行方法之一。With the help of convolutional neural networks, applications such as image classification, object detection, and semantic segmentation are currently developing at an unprecedented speed. However, in some applications that require no fault tolerance, such as automatic driving and medical image analysis, it is necessary to further improve the accuracy of prediction and analysis, while requiring a shorter response time. This leads to huge challenges for current convolutional neural networks. The methods in the prior art focus on performance improvement or reduction of computing resources, which can reduce response time. For example, on the one hand, ResNet 150 or even larger ResNet 1000 has been proposed to improve a very limited performance margin, but with a large computational cost. On the other hand, in the case of a predefined performance loss compared to a best-effort network, various techniques have been proposed to reduce the amount of calculation and storage to match the limitations imposed by hardware implementation. Such technologies include lightweight network design, pruning and quantification, among which knowledge distillation (KD) is one of the feasible methods to achieve model compression.
作为常见的压缩方法之一,知识蒸馏的灵感来自于从教师到学生的知识转移。其关键策略是将紧凑型学生模型定位为逼近过度参数化的教师模型。因此,学生模型可以获得显著的性能提升,有时甚至比教师的模型更好。通过用紧凑型学生模型替代过度参数化的教师模型,可以实现高压缩和快速加速;知识蒸馏的实施包括两步,第一步训练大的教师模型,以及第二步将知识从教师模型蒸馏到学生模型;但是,其也存在如下问题;第一个问题是关于知识转移的低效,这意味着学生模型几乎不会利用来自教师模型的所有知识。一个优于其教师模型的杰出学生模型仍然很少见。另一个问题是如何设计并训练适当的教师模型,现有的蒸馏框架需要大量的努力和实验才能找到教师模型的最佳架构,这会花费相对长的时间。第三个问题教师模型和学生模型分别以它们自己的方式工作,并且知识转移在不同模型之间流动,就涉及到多个模型的建立,工作繁琐,精度较低。As one of the common compression methods, the inspiration of knowledge distillation comes from the transfer of knowledge from teachers to students. The key strategy is to position the compact student model as an approximation to an over-parameterized teacher model. Therefore, the student model can obtain significant performance improvements, sometimes even better than the teacher's model. By replacing the over-parameterized teacher model with a compact student model, high compression and rapid acceleration can be achieved; the implementation of knowledge distillation includes two steps. The first step is to train a large teacher model, and the second step is to distill knowledge from the teacher model to The student model; however, it also has the following problems; the first problem is about the inefficiency of knowledge transfer, which means that the student model hardly uses all the knowledge from the teacher model. An outstanding student model that is superior to its teacher model is still rare. Another problem is how to design and train an appropriate teacher model. The existing distillation framework requires a lot of effort and experimentation to find the best structure of the teacher model, which will take a relatively long time. The third problem is that the teacher model and the student model work in their own way, and the knowledge transfer flows between different models, which involves the establishment of multiple models, which is cumbersome and low in accuracy.
现有技术中通过提出的自蒸馏训练方法进行高效的训练,但是在自蒸馏过程中分类器的精度较低,并且无法自动分离自己的功能,影响了分类器功能,从而使得训练方法的精度降低。In the prior art, the proposed self-distillation training method is used for efficient training, but the accuracy of the classifier in the self-distillation process is low, and its own functions cannot be automatically separated, which affects the function of the classifier, thereby reducing the accuracy of the training method .
与此同时,神经网络在处理非线性问题方面有着别的方法无法比拟的优势,而预测控制对于具有约束的卡边操作问题具有非常好的针对性,因此将神经网络与预测控制相结合,发挥各自的优势,对非线性、时变、强约束、大滞后工业过程的控制提供了一个很好的解决方法,因此卷积神经网络广泛的应用在预测领域;现有技术中基于卷积神经网络的预测都需要考虑其响应速度和预测结果的置信度,因此对于不同需求的预测要求,会同时存储多个模型的算法,针对不同的响应速度和正确率的需求,更换不同的模型,则会在切换过程中形成真空期,给现实应用带来安全隐患。At the same time, neural networks have incomparable advantages in dealing with non-linear problems. Predictive control has a very good pertinence for constrained card edge operation problems. Therefore, neural networks and predictive control are combined to play Their respective advantages provide a good solution to the control of non-linear, time-varying, strongly constrained, and large-lag industrial processes. Therefore, convolutional neural networks are widely used in the field of prediction; the existing technology is based on convolutional neural networks All predictions need to consider the response speed and the confidence of the prediction results. Therefore, for the prediction requirements of different needs, the algorithms of multiple models will be stored at the same time. For different response speeds and accuracy requirements, different models will be replaced. A vacuum period is formed during the switching process, which brings security risks to real applications.
发明内容Summary of the invention
针对现有技术中存在的问题,本发明提供一种卷积神经网络的自蒸馏训练方法、设备和可伸缩动态预测方法,设计合理,高效简单,自蒸馏训练的模型更加平坦,对参数的优化更加稳健。In view of the problems in the prior art, the present invention provides a self-distillation training method, equipment and scalable dynamic prediction method of a convolutional neural network. The design is reasonable, efficient and simple, the self-distillation training model is flatter, and the parameters are optimized. More robust.
本发明是通过以下技术方案来实现:The present invention is realized through the following technical solutions:
一种卷积神经网络的自蒸馏训练方法,包括如下步骤,A self-distillation training method of convolutional neural network, including the following steps,
步骤1,根据目标卷积神经网络的深度和原始结构,以设定的深度区间将目标卷积 神经网络的卷积层划分成n个部分,n为正整数且n≥2,其中第n层为最深层部分,其余层为浅层部分; Step 1. According to the depth and original structure of the target convolutional neural network, divide the convolutional layer of the target convolutional neural network into n parts in the set depth interval, where n is a positive integer and n≥2, where the nth layer Is the deepest part, and the rest are shallow parts;
步骤2,在每个浅层部分之后分别设置浅层分类器进行分类,最深层部分之后设置最深层分类器进行分类;浅层分类器包括依次设置的瓶颈层、完全连接层和softmax层进行分类,最深层分类器包括依次设置的完全连接层和softmax层进行分类; Step 2. Set a shallow classifier after each shallow part for classification, and set the deepest classifier after the deepest part for classification; the shallow classifier includes the bottleneck layer, the fully connected layer and the softmax layer set in sequence for classification , The deepest classifier includes a fully connected layer and a softmax layer set in sequence for classification;
所述的浅层分类器的特定特征由如下的注意力模块得到,The specific features of the shallow classifier are obtained by the following attention module,
AttentionMaps(W conv,W deconv,F)=σ(φ(ψ(F,W conv)),W deconv) AttentionMaps(W conv ,W deconv ,F)=σ(φ(ψ(F,W conv )),W deconv )
其中,ψ和φ分别表示用于下采样的卷积层的卷积函数和用于上采样的反卷积层的反卷积函数,F表示输入特征,σ表示Sigmoid函数,W conv表示卷积层的权重,W deconv表示反卷积层的权重; Among them, ψ and φ respectively represent the convolution function of the convolution layer used for downsampling and the deconvolution function of the deconvolution layer used for upsampling, F represents the input feature, σ represents the Sigmoid function, and W conv represents the convolution The weight of the layer, W deconv represents the weight of the deconvolution layer;
步骤3,在训练时,最深层部分被视为教师模型,所有带有对应分类器的浅层部分都通过从最深层部分蒸馏而被训练为学生模型,从而实现卷积神经网络的自蒸馏训练。 Step 3. During training, the deepest part is regarded as a teacher model, and all shallow parts with corresponding classifiers are trained as student models by distilling from the deepest part, thus realizing self-distillation training of the convolutional neural network .
一种卷积神经网络的可伸缩动态预测方法,所述的卷积神经网络为由上述任意一项所述的自蒸馏训练方法得到可伸缩卷积神经网络,其可伸缩动态预测方法包括如下步骤,A scalable dynamic prediction method of a convolutional neural network, the convolutional neural network is a scalable convolutional neural network obtained by the self-distillation training method described in any one of the above, and the scalable dynamic prediction method includes the following steps ,
步骤1,分别设定所有浅层分类器和最深层分类器的阈值; Step 1. Set the thresholds of all shallow classifiers and the deepest classifier respectively;
步骤2,根据深度从浅到深,对每一层分类器预测结果的置信度和阈值进行判断;若当前层的分类器预测结果的置信度大于当前层的分类器阈值,则认为当前层的分类器预测成功;否则,将由更深层的分类器继续预测,直到最后一层的分类器;随着深度的增加,预测正确率逐层提高; Step 2. According to the depth from shallow to deep, judge the confidence and threshold of the prediction result of each layer of classifier; if the confidence of the prediction result of the current layer is greater than the threshold of the classifier of the current layer, it is considered The classifier predicts successfully; otherwise, the deeper classifier will continue to predict until the last classifier; as the depth increases, the prediction accuracy rate will increase layer by layer;
步骤3,在满足预测置信度的要求下,根据预测需求选取最浅层的预测结果或者最优正确率的预测结果作为可伸缩动态预测的输出。Step 3: Under the requirement of the prediction confidence, select the shallowest prediction result or the prediction result of the optimal accuracy rate as the output of the scalable dynamic prediction according to the prediction demand.
本发明还提供一种卷积申请网络的自蒸馏训练设备,包括存储器,用于存储计算机程序;处理器,用于执行所述计算机程序时实现如上述一种卷积申请网络的自蒸馏训练方法的步骤。The present invention also provides a self-distillation training device for a convolution application network, including a memory for storing a computer program; a processor, used to implement the self-distillation training method of the convolution application network as described above when the computer program is executed A step of.
与现有技术相比,本发明具有以下有益的技术效果:Compared with the prior art, the present invention has the following beneficial technical effects:
本发明一种卷积神经网络的自蒸馏训练方法,通过使卷积神经网络的尺寸缩小而不是使网络的尺寸扩大来显著增强卷积神经网络的性能,即提高精度。不同于传统知识蒸馏——是一种网络之间的知识转移方法,其促使学生神经网络逼近预训练的教师神经网络的softmax层输出,这里提出的自蒸馏框架在网络自身内蒸馏知识。网络首先被划分为几个部分。然后,网络的较深部分中的知识被挤入浅层部分中。在不以响应时间为代价的情况下,自蒸馏大幅度地提高了卷积神经网络的性能,获得了平均2.65%的精度提升;对于不同数据集精度提升的范围为,从对数据集ResNeXt中0.61%的精度提升作为最小值到VGG19中4.07%的精度提升作为最大值。再配合注意力层对浅层分类器特征的强化提取,使得浅层分类器的精度显著提高,从而能够将一个有多个输出的卷积神经网络视为多个卷积神经网络,根据不同的需求,对每个浅层分类器的输出加以利用。The self-distillation training method of the convolutional neural network of the present invention significantly enhances the performance of the convolutional neural network by reducing the size of the convolutional neural network instead of expanding the size of the network, that is, improving the accuracy. Different from traditional knowledge distillation-it is a method of knowledge transfer between networks, which prompts the student neural network to approach the softmax layer output of the pre-trained teacher neural network. The self-distillation framework proposed here distills knowledge within the network itself. The network is first divided into several parts. Then, the knowledge in the deeper part of the network is squeezed into the shallow part. Without the cost of response time, self-distillation greatly improves the performance of the convolutional neural network, and obtains an average accuracy increase of 2.65%; for different data sets, the range of accuracy improvement is from the data set ResNeXt The accuracy increase of 0.61% is taken as the minimum value to the accuracy increase of 4.07% in VGG19 as the maximum value. Coupled with the enhanced extraction of the features of the shallow classifier by the attention layer, the accuracy of the shallow classifier is significantly improved, so that a convolutional neural network with multiple outputs can be regarded as multiple convolutional neural networks, according to different Need to use the output of each shallow classifier.
本发明所述的可伸缩动态预测方法,在上述每个浅层分类器输出可用的基础上,通过合理的调整阈值,就可以动态地调整预测正确率和响应速度之间的折中平衡,高效率地调度网络中的多个分类器;能够在部署状态下动态调节模型反应速度的能力,极大的提高了卷积神经网络在预测应用上的灵活性;在切换模型时仅需要修改阈值而无需更换模型,可以避免切换过程中模型的真空期,给现实应用带来安全上的保障。The scalable dynamic prediction method of the present invention can dynamically adjust the trade-off balance between the prediction accuracy rate and the response speed based on the available output of each of the above-mentioned shallow classifiers and through a reasonable adjustment of the threshold. Efficiently schedule multiple classifiers in the network; the ability to dynamically adjust the response speed of the model in the deployed state greatly improves the flexibility of the convolutional neural network in prediction applications; only the threshold value needs to be modified when switching models There is no need to change the model, which can avoid the vacuum period of the model during the switching process, and bring safety guarantee to the real application.
进一步的,可伸缩动态预测中以遗传算法实现了自动化的阈值搜索,进一步提高了神经网络的加速效果,从而实现了加速与正确率的协同提高。Furthermore, in the scalable dynamic prediction, an automated threshold search is realized by the genetic algorithm, which further improves the acceleration effect of the neural network, thereby realizing the synergistic improvement of acceleration and accuracy.
附图说明Description of the drawings
图1为针对CIFAR100数据集,传统蒸馏和本发明蒸馏之间训练复杂性、训练时间和精度的对比示意图。FIG. 1 is a schematic diagram of the comparison of training complexity, training time and accuracy between traditional distillation and distillation of the present invention for the CIFAR100 data set.
图2为本发明实例中所述的针对ResNet的自蒸馏方法示意图。Figure 2 is a schematic diagram of the self-distillation method for ResNet described in the example of the present invention.
图3为本发明实例中所述利用不同方法训练的分类器的精度。Figure 3 shows the accuracy of the classifiers trained by different methods in the example of the present invention.
图4为本发明实例中所述可伸缩网络运算量与正确率关系图。Fig. 4 is a diagram showing the relationship between the amount of calculation of the scalable network and the accuracy in the example of the present invention.
图5为本发明实例中所述可伸缩网络参数量与正确率关系图。Fig. 5 is a diagram showing the relationship between the amount of parameters of the scalable network and the accuracy in the example of the present invention.
图6为本发明实例中所述可伸缩动态预测方法中,可伸缩动态预测加速比与正确率关系图。6 is a diagram showing the relationship between the speed-up ratio and the accuracy of the scalable dynamic prediction in the scalable dynamic prediction method described in the example of the present invention.
图7为本发明实例中所述可伸缩神经网络中不同分类器的注意力图可视化结果。Fig. 7 shows the visualization results of attention maps of different classifiers in the scalable neural network in the example of the present invention.
图8为本发明实例中所述预测方法在不同数据集上得到的每个分类器完成的分类数量示意图。Fig. 8 is a schematic diagram of the number of classifications completed by each classifier obtained by the prediction method in the example of the present invention on different data sets.
具体实施方式detailed description
下面结合具体的实施例对本发明做进一步的详细说明,所述是对本发明的解释而不是限定。The present invention will be further described in detail below in conjunction with specific embodiments, which are to explain rather than limit the present invention.
如图1所示,本发明提出了一种卷积神经网络的自蒸馏训练方法,在训练紧凑模型时能够实现尽可能高的精度并克服传统蒸馏的缺点。取代在传统蒸馏中实施两步,即,第一步训练大的教师模型,以及第二步将知识从教师模型蒸馏到学生模型;本发明方法所提供的一步自蒸馏框架,其训练直接指向学生模型。所提出的自蒸馏不仅只需要较少的训练时间(在CIFAR100上从26.98小时到5.87小时,训练时间缩短了4.6倍),而且还可以实现更高的精度(在ResNet50上从传统蒸馏中的79.33%到81.04%)。本发明中为了能更好的在现实的应用场景中使用,通过更好的提高浅层分类器的精度来增强其性能。本发明可以用于任意基于卷积神经网络的系统,例如图像分类系统,人脸识别系统,目标检测系统,图像语义分割系统。在训练以上系统所需要的神经网络时采用本发明所述的训练方法,即可提高该系统的性能,不仅精度高,而且速度快,能够将速度和正确率协同提高。As shown in Fig. 1, the present invention proposes a self-distillation training method of convolutional neural network, which can achieve the highest possible accuracy and overcome the shortcomings of traditional distillation when training compact models. Instead of implementing two steps in traditional distillation, that is, the first step is to train a large teacher model, and the second step is to distill knowledge from the teacher model to the student model; the one-step self-distillation framework provided by the method of the present invention directs its training to students model. The proposed self-distillation not only requires less training time (from 26.98 hours to 5.87 hours on CIFAR100, the training time is shortened by 4.6 times), but also can achieve higher accuracy (from 79.33 in traditional distillation on ResNet50) % To 81.04%). In order to be better used in real application scenarios in the present invention, the performance of the shallow classifier is enhanced by better improving the accuracy of the shallow classifier. The present invention can be used in any system based on convolutional neural network, such as image classification system, face recognition system, target detection system, image semantic segmentation system. When training the neural network required by the above system, the training method described in the present invention can be used to improve the performance of the system, which not only has high accuracy, but also has a high speed, and can synergistically improve the speed and accuracy.
如图3所示,提供了在CIFAR100上的ResNet50中训练浅层分类器的四种方法的精度比较。X轴是分类器的深度,其中x=5指示所有分类器的集成,并且Y轴表示在CIFAR100上的Top-1正确率。观察可见,随着神经网络变浅,分类器的预测正确率迅速降低。其中,最浅层分类器与次浅层分类器分别降低13%与8%。尽管自蒸馏算法虽然已经较深度监督算法、单独训练方法有了明显提升,但仍无法满足实际应用的需求。除此之外,在第三个分类器的实验结果中,单独训练的网络正确率要优于自蒸馏算法和深度监督算法,这说明在后者对应的共享主干网络的结构中,不同分类器之间相互之间存在着消极的相互作用。由于主干网络所能获取的特征受网络频道数量的限制,不同分类器对应的特征混淆在一起。对于每一个分类器而言,自动地从混合后特征中分离属于其自己特征几乎是不可能的。As shown in Figure 3, the accuracy comparison of four methods of training shallow classifiers in ResNet50 on CIFAR100 is provided. The X-axis is the depth of the classifiers, where x=5 indicates the integration of all the classifiers, and the Y-axis indicates the Top-1 accuracy rate on CIFAR100. Observation shows that as the neural network becomes shallower, the prediction accuracy of the classifier decreases rapidly. Among them, the shallowest classifier and the sub-shallow classifier are reduced by 13% and 8%, respectively. Although the self-distillation algorithm has been significantly improved compared with the deep supervision algorithm and the separate training method, it still cannot meet the needs of practical applications. In addition, in the experimental results of the third classifier, the accuracy of the separately trained network is better than that of the self-distillation algorithm and the deep supervision algorithm, which shows that in the structure of the latter's corresponding shared backbone network, different classifiers There is a negative interaction between them. Since the features that the backbone network can obtain are limited by the number of network channels, the features corresponding to different classifiers are confused. For each classifier, it is almost impossible to automatically separate its own features from the mixed features.
为了解决这一问题并进一步增强浅层分类器的性能,利用注意力层来从共享骨干神经网络获得特定分类器的特征,使每个分类器可以学习到如何从主干网络中获取自己需要的特征。In order to solve this problem and further enhance the performance of shallow classifiers, the attention layer is used to obtain the characteristics of a specific classifier from the shared backbone neural network, so that each classifier can learn how to obtain the features it needs from the backbone network .
为保证注意力层不会带来额外的计算、存储代价,我们提出了一个简化的注意力层,其包括一个用于下采样的卷积层和一个用于上采样的反卷积层,在注意力层之后附接S形激励以便获得0和1之间的注意力图。然后,使注意力图与原始特征进行点积运算,从而产生分类器特定的特征。它的正演计算可以被公式化为:In order to ensure that the attention layer does not bring additional calculation and storage costs, we propose a simplified attention layer, which includes a convolutional layer for downsampling and a deconvolutional layer for upsampling. The attention layer is followed by an S-shaped stimulus to obtain an attention map between 0 and 1. Then, the dot product of the attention map and the original feature is performed to generate the specific feature of the classifier. Its forward calculation can be formulated as:
AttentionMaps(W conv,W deconv,F)=σ(φ(ψ(F,W conv)),W deconv) AttentionMaps(W conv ,W deconv ,F)=σ(φ(ψ(F,W conv )),W deconv )
其中,ψ和φ分别表示卷积函数和反卷积函数,F表示输入特征,σ表示S形函数。注意,这里省略了卷积和反卷积层之后的批量归一化和ReLU激励函数。Among them, ψ and φ represent the convolution function and deconvolution function, respectively, F represents the input feature, and σ represents the sigmoid function. Note that the batch normalization and ReLU activation functions after the convolution and deconvolution layers are omitted here.
实验结果表明,如图2所示,SCAN中的注意力层在浅层分类器中实现了显著的精度提升。例如,与无注意力层的自蒸馏相比,在CIFAR100上的ResNet50中的浅层分类 器上可以观察到5.46%、4.13%和5.16%的精度增益。The experimental results show that, as shown in Figure 2, the attention layer in SCAN achieves a significant accuracy improvement in the shallow classifier. For example, compared with the self-distillation without the attention layer, the accuracy gains of 5.46%, 4.13%, and 5.16% can be observed on the shallow classifier in ResNet50 on CIFAR100.
可伸缩神经网络通过注意力层使不同分类器从主干网络中提取适合的特征,极大地提高了浅层神经网络预测的正确率。因此,通过可视化注意力层输出的注意力图,即可观测神经网络选择特征的过程。图7展示了对于两张图像,注意力层输出的结果。其中,最左侧的图片为输入的图像。右侧的六张图像中,从左向右分别表示从浅到深三个分类器注意力层输出的结果。第一行的图片表示注意力图的热力图表示,第二行的图片表示以注意力图为掩码进行点乘操作后的输入图像。The scalable neural network enables different classifiers to extract suitable features from the backbone network through the attention layer, which greatly improves the accuracy of the shallow neural network prediction. Therefore, by visualizing the attention map output by the attention layer, the process of selecting features of the neural network can be observed. Figure 7 shows the output results of the attention layer for two images. Among them, the leftmost picture is the input image. In the six images on the right, from left to right, the output results of the three classifier attention layers from shallow to deep are shown. The picture in the first row represents the heat map representation of the attention map, and the picture in the second row represents the input image after the dot multiplication operation with the attention map as a mask.
注意力的位置:热力图中,鲨鱼和猫所在的位置的值更高,这说明不同的分类器均将主要的注意力置于输入图片中信息最重要的位置,即鲨鱼和猫的身体,而忽略了背景或其他不相关的元素。这说明即使是浅层分类器,同样有着可以判断每个像素重要性的能力。The position of attention: In the heat map, the position of the shark and the cat has a higher value, which means that different classifiers put the main attention on the most important position of the information in the input picture, that is, the body of the shark and the cat. The background or other irrelevant elements are ignored. This shows that even a shallow classifier has the ability to judge the importance of each pixel.
注意力的粒度:不同分类器的注意力同样有所区别。如图7所示,浅层分类器的注意力更关注于鲨鱼、猫的轮廓,即更关注局部信息、高频信息。而深层分类器的注意力则更关注其身体、纹理,即更关注全局信息,低频信息。这一规律鱼神经网络中信息处理的机制是符合的。随着网络变深,神经网络的感受野不断变大,这赋予了深层分类器注意力层关注全局特征的能力。Attention granularity: the attention of different classifiers is also different. As shown in Figure 7, the shallow classifier pays more attention to the contours of sharks and cats, that is, it pays more attention to local information and high-frequency information. The deep classifier pays more attention to its body and texture, that is, it pays more attention to global information and low-frequency information. This law conforms to the information processing mechanism in the fish neural network. As the network becomes deeper, the receptive field of the neural network continues to grow, which gives the deep classifier the ability to focus on global features in the attention layer.
作为基础,本发明中如在图2中描绘的自蒸馏方法。通过如下步骤进行自蒸馏训练,构建自蒸馏框架:首先,在任意可以运行文本编辑软件的计算机上对于原始神经网络进行修改,根据目标卷积神经网络的深度和原始结构,将目标卷积神经网络划分成几个浅层部分。例如,根据ResBlocks,将ResNet50划分成4个部分。其次,在任意可以运行文本编辑软件的计算机上对于原始神经网络进行修改,在每个浅层部分之后设置分类器,该分类器与仅在训练中使用并且可以在推理中移除的瓶颈层和完全连接层相结合。添加瓶颈层的主要考虑是减轻每个浅层分类器之间的影响,并添加来自提示的L2损失。在使用NVIDIA显卡,Intel高性能CPU或谷歌TPU芯片训练神经网络期间,所有带有对应分类器的浅层部分都通过从最深部分蒸馏而被训练为学生模型,这可以在概念上被视为教师模型。As a basis, the self-distillation method in the present invention as depicted in FIG. 2. The self-distillation training is carried out through the following steps to build the self-distillation framework: First, modify the original neural network on any computer that can run text editing software, and convert the target convolutional neural network according to the depth and original structure of the target convolutional neural network Divide into several shallow parts. For example, according to ResBlocks, ResNet50 is divided into 4 parts. Second, modify the original neural network on any computer that can run text editing software, and set a classifier after each shallow part, which is the same as the bottleneck layer and the bottleneck layer that are only used in training and can be removed in inference. Fully connected layers are combined. The main consideration for adding a bottleneck layer is to reduce the impact between each shallow classifier and add the L2 loss from the hint. During the training of neural networks using NVIDIA graphics cards, Intel high-performance CPUs or Google TPU chips, all shallow parts with corresponding classifiers are trained as student models by distilling from the deepest part, which can be conceptually regarded as teachers model.
如图2所示,以ResNet为例,ResNet已被根据深度划分成四个部分,在构建多个分类器的每个部分之后设置额外的瓶颈层和完全连接层;在不同精度和相应时间的情况下,能够独立的利用所有分类器;如图2所示,在三种监督下训练每个分类器,并且在推理中能够移除虚线下面的部分,所述的三种监督分别为对应损失源1的来自标签的监督,对应损失源2的来自蒸馏的监督,以及对应损失源3来自提示的监督,其对应的参与流向如图所示。As shown in Figure 2, taking ResNet as an example, ResNet has been divided into four parts according to the depth. After constructing each part of multiple classifiers, an additional bottleneck layer and a fully connected layer are set; in different accuracy and corresponding time In this case, all classifiers can be used independently; as shown in Figure 2, each classifier is trained under three types of supervision, and the part under the dotted line can be removed during inference. The three types of supervision are corresponding losses Source 1’s supervision from the label, corresponding to the loss source 2’s supervision from distillation, and the corresponding loss source 3’s supervision from the prompt, their corresponding participation flows are shown in the figure.
为了提高学生模型的性能,在训练过程期间引入了三种损失:In order to improve the performance of the student model, three losses are introduced during the training process:
损失源1:来自标签的交叉熵损失,对于不仅是最深层分类器,而且是所有浅层分类器。它是使用来自训练数据集的标签和每个分类器的softmax层的输出而计算的。通过这种方式,直接将隐藏在训练数据集中的知识从标签引入到所有分类器。Loss source 1: Cross-entropy loss from the label, not only for the deepest classifier, but also for all shallow classifiers. It is calculated using the labels from the training data set and the output of the softmax layer of each classifier. In this way, the knowledge hidden in the training data set is directly introduced from the label to all classifiers.
损失源2:教师模型指导下的KL(Kullback-Leibler)散度损失。使用学生模型和教师模型之间的softmax层输出来计算KL散度,并将其引入到每个浅层分类器的softmax层。通过引入KL散度,自蒸馏框架影响教师网络模型,并且能够将其最深层分类器传递到每个浅层分类器。Loss source 2: KL (Kullback-Leibler) divergence loss guided by the teacher model. Use the softmax layer output between the student model and the teacher model to calculate the KL divergence, and introduce it to the softmax layer of each shallow classifier. By introducing KL divergence, the self-distillation framework affects the teacher network model and can pass its deepest classifier to each shallow classifier.
损失源3:来自提示的L2损失。它可以通过计算最深层分类器和每个浅层分类器的特征图之间的L2损失来获得。借助于L2损失,将特征图中的不明确知识引入每个浅层分类器的瓶颈层,这会诱导它们的瓶颈层中的所有分类器特征图适应最深层分类器的特征图。Loss source 3: L2 loss from the prompt. It can be obtained by calculating the L2 loss between the feature map of the deepest classifier and each shallow classifier. With the help of L2 loss, the ambiguous knowledge in the feature map is introduced into the bottleneck layer of each shallow classifier, which will induce all the classifier feature maps in their bottleneck layer to adapt to the feature map of the deepest classifier.
为此,仅在训练期间应用所有新添加的层,如图2中虚线下方的部分。它们在推理 期间不施加任何影响。在推理期间添加这些部分为能量受约束的边缘设备的动态推理提供了另一种选项。For this reason, only apply all newly added layers during training, as shown in the part below the dotted line in Figure 2. They do not exert any influence during reasoning. Adding these parts during inference provides another option for dynamic inference of energy-constrained edge devices.
具体的,本发明所述的自蒸馏方法的具体计算如下。Specifically, the specific calculation of the self-distillation method of the present invention is as follows.
给定来自M个类别的N个样本
Figure PCTCN2020106995-appb-000001
我们将对应的标签集表示为
Figure PCTCN2020106995-appb-000002
y i∈{1,2,...,M}。被训练的卷积神经网络中的分类器,也就是所提出的自蒸馏在整个网络内具有多个分类器被表示为
Figure PCTCN2020106995-appb-000003
其中C表示卷积神经网络中的分类器的数目。并在每个分类器之后设置softmax层。
Given N samples from M categories
Figure PCTCN2020106995-appb-000001
We denote the corresponding label set as
Figure PCTCN2020106995-appb-000002
y i ∈{1,2,...,M}. The classifier in the trained convolutional neural network, that is, the proposed self-distillation has multiple classifiers in the entire network, and is represented as
Figure PCTCN2020106995-appb-000003
Where C represents the number of classifiers in the convolutional neural network. And set the softmax layer after each classifier.
Figure PCTCN2020106995-appb-000004
Figure PCTCN2020106995-appb-000004
其中,
Figure PCTCN2020106995-appb-000005
是第C个分类器完全连接层在第i个类别(FC)的输出。q i c∈R M是分类器θ i/C的第i类概率。T表示蒸馏的温度超参数,通常被设置为1。其值越大,则得到的预测概率分布更平缓。
among them,
Figure PCTCN2020106995-appb-000005
It is the output of the C-th classifier fully connected layer in the i-th category (FC). q i c ∈ R M is the class i probability of the classifier θ i/C . T represents the temperature hyperparameter of distillation and is usually set to 1. The larger the value, the smoother the predicted probability distribution.
对上述的神经网络在NVIDIA显卡,Intel高性能CPU或谷歌TPU芯片上进行自蒸馏训练,除了最深层分类器θ C之外的每个分类器θ i/C的监督来自三个源。使用两个超参数α和λ来使它们平衡,α和λ为控制KL散度损失函数与特征损失函数比例的超参数,用于最深层分类器的α和λ为零。 The above-mentioned neural network is self-distilled training on NVIDIA graphics card, Intel high-performance CPU or Google TPU chip. The supervision of each classifier θ i/C except the deepest classifier θ C comes from three sources. Two hyper-parameters α and λ are used to balance them. α and λ are hyper-parameters that control the ratio of the KL divergence loss function to the feature loss function. For the deepest classifier, α and λ are zero.
(1-α)·Cross Entropy(q i,y)  (2) (1-α)·Cross Entropy(q i ,y) (2)
如式(2),第一个源是利用q i和标签Y计算的交叉熵损失。其中,q i表示每个分类器θ i/C的softmax层的输出,CrossEntropy为交叉熵函数。 As in formula (2), the first source is the cross entropy loss calculated using q i and label Y. Among them, q i represents the output of the softmax layer of each classifier θ i/C , and CrossEntropy is the cross entropy function.
α·KL(q i,q C)   (3) α·KL(q i ,q C ) (3)
如上式(3),第二个源是q i和q C之间的Kullback-Leibler散度。我们的目的是使浅层分类器逼近最深层分类器,这表明了来自蒸馏的监督。q i表示每个分类器θ i/C的softmax层的输出;q C意指最深层分类器的softmax层的输出,α为控制KL散度损失函数比例的超参数,KL为Kullback-Leibler散度。 As in the above formula (3), the second source is the Kullback-Leibler divergence between q i and q C. Our goal is to make the shallow classifier approximate the deepest classifier, which indicates supervision from distillation. q i represents the output of the softmax layer of each classifier θ i/C ; q C means the output of the softmax layer of the deepest classifier, α is the hyperparameter that controls the proportion of the KL divergence loss function, and KL is the Kullback-Leibler divergence degree.
Figure PCTCN2020106995-appb-000006
Figure PCTCN2020106995-appb-000006
如上式(4),最后的监督来自最深层分类器的提示。提示被定义为教师模型隐藏层的输出,其目的是指导学生模型的学习。它通过减小浅层分类器中的特征图和最深层分类器中的特征图之间的距离来工作。但是,由于不同深度的特征图具有不同的大小,因此应该添加额外的层以将它们对齐。取代使用卷积层,本发发明使用瓶颈架构,其显示了对模型性能的积极效果。F i和F C分别表示分类器θ i/C中的特征和最深层分类器θ C中的 特征。 As in the above formula (4), the final supervision comes from the hint of the deepest classifier. The prompt is defined as the output of the hidden layer of the teacher model, and its purpose is to guide the learning of the student model. It works by reducing the distance between the feature map in the shallow classifier and the feature map in the deepest classifier. However, because feature maps of different depths have different sizes, additional layers should be added to align them. Instead of using convolutional layers, the present invention uses a bottleneck architecture, which shows a positive effect on model performance. F i and F C denote feature classifier θ i / C in the deepest features and classification of θ C.
综上所述,整个神经网络的损失函数由每个分类器的损失函数组成,它可以被写为:In summary, the loss function of the entire neural network is composed of the loss function of each classifier, which can be written as:
Figure PCTCN2020106995-appb-000007
Figure PCTCN2020106995-appb-000007
其中,q i表示每个分类器θ i/C的softmax层的输出;训练集为给定来自M个类别的N个样本
Figure PCTCN2020106995-appb-000008
将对应的标签集表示为
Figure PCTCN2020106995-appb-000009
y i∈{1,2,...,M};Cross Entropy为交叉熵函数;KL为Kullback-Leibler散度;q C为最深层分类器θ C的softmax层的输出;F i和F C分别表示每个分类器θ i/C中的特征和最深层分类器θ C中的特征,α和λ为控制KL散度损失函数与特征损失函数比例的超参数,用于最深层分类器的α和λ为零。
Among them, q i represents the output of the softmax layer of each classifier θ i/C ; the training set is given N samples from M categories
Figure PCTCN2020106995-appb-000008
Denote the corresponding label set as
Figure PCTCN2020106995-appb-000009
y i ∈{1,2,...,M}; Cross Entropy is the cross-entropy function; KL is the Kullback-Leibler divergence; q C is the output of the softmax layer of the deepest classifier θ C ; F i and F C Respectively represent the features in each classifier θ i/C and the features in the deepest classifier θ C. α and λ are hyperparameters that control the ratio of the KL divergence loss function to the feature loss function, and are used for the deepest classifier α and λ are zero.
本发明提出的一种卷积神经网络的自蒸馏训练方法,通过将其与深度监督网以及先前的蒸馏方法进行比较来显示其优势。本发明放弃了先前蒸馏方法中所需的额外教师模型,并为运行时的时间-精度折衷提供了自适应深度架构。具体的通过五种卷积神经网络和两种数据集上的实验效果如下。The self-distillation training method of a convolutional neural network proposed by the present invention shows its advantages by comparing it with the deep supervision network and previous distillation methods. The present invention abandons the additional teacher model required in the previous distillation method, and provides an adaptive depth architecture for the time-accuracy trade-off at runtime. The specific experimental results on five convolutional neural networks and two data sets are as follows.
我们对五个卷积神经网络(ResNet、WideResNet、Pyramid ResNet、ResNeXt、VGG)和两个数据集(CIFAR100、ImageNet)评估了自蒸馏。在训练过程期间使用学习率衰减、L2正则化器和简单数据论证。所有实验都是由GPU设备上的PyTorch来实现的。We evaluated self-distillation on five convolutional neural networks (ResNet, WideResNet, Pyramid ResNet, ResNeXt, VGG) and two data sets (CIFAR100, ImageNet). Use learning rate decay, L2 regularizer and simple data argumentation during the training process. All experiments are implemented by PyTorch on GPU devices.
1.1.基准数据集1.1. Benchmark data set
CIFAR100:CIFAR100数据集由小型(32x32像素)RGB图像组成,具有100个类别,并且在训练集中包含50K图像且在测试集中包含10K图像。调整神经网络的核大小和步长以适应小型图像的大小。CIFAR100: The CIFAR100 dataset consists of small (32x32 pixels) RGB images with 100 categories, and contains 50K images in the training set and 10K images in the test set. Adjust the core size and step size of the neural network to fit the size of the small image.
ImageNet:ImageNet2012分类数据集由根据WordNet的1000个类别组成。每个类别都由数千个图像来描绘。我们将它们的大小调整为256x256像素的RGB图像。注意,所报告的ImageNet的精度是在验证集上计算的。ImageNet: The ImageNet2012 classification data set consists of 1000 categories based on WordNet. Each category is depicted by thousands of images. We resize them to 256x256 pixel RGB images. Note that the reported accuracy of ImageNet is calculated on the validation set.
1.2.与标准训练的比较1.2. Comparison with standard training
在表1和表2中分别显示在CIFAR100和ImageNet上的实验结果。通过简单地将softmax层的加权输出添加到每个分类器中来获得集成结果。观察到(i)所有神经网络均显著受益于自蒸馏,在CIFAR100中平均具有2.65%的增加,以及在ImageNet中平均具有2.02%的增加。(ii)神经网络越深,它们就获取越多的性能提高,例如ResNet101中4.05%的增加,以及ResNet18中2.58%的增加。(iii)一般来说,对CIFAR100来说朴素集成有效地工作,但对ImageNet影响较小且有时会产生负面影响,这可能是由于与CIFAR100相比,浅层分类器的精度下降较大导致的。(iv)分类器的深度在ImageNet中起着更为关键的作用,这表明在复杂任务中神经网络中的冗余较少。The experimental results on CIFAR100 and ImageNet are shown in Table 1 and Table 2, respectively. The ensemble result is obtained by simply adding the weighted output of the softmax layer to each classifier. It is observed that (i) all neural networks benefit significantly from self-distillation, with an average increase of 2.65% in CIFAR100, and an average increase of 2.02% in ImageNet. (ii) The deeper the neural network, the more performance they get, such as a 4.05% increase in ResNet101 and a 2.58% increase in ResNet18. (iii) Generally speaking, the naive integration works effectively for CIFAR100, but it has less impact on ImageNet and sometimes has a negative impact. This may be due to the large decrease in accuracy of the shallow classifier compared with CIFAR100 . (iv) The depth of the classifier plays a more critical role in ImageNet, which shows that there is less redundancy in the neural network in complex tasks.
表1自蒸馏算法在CIFAR100数据集上不同分类器正确率表。Table 1 The accuracy of different classifiers of the self-distillation algorithm on the CIFAR100 data set.
Figure PCTCN2020106995-appb-000010
Figure PCTCN2020106995-appb-000010
Figure PCTCN2020106995-appb-000011
Figure PCTCN2020106995-appb-000011
表2自蒸馏算法在ImageNet数据集上不同分类器正确率表。Table 2 The correct rate of different classifiers of the self-distillation algorithm on the ImageNet data set.
Figure PCTCN2020106995-appb-000012
Figure PCTCN2020106995-appb-000012
1.3.与蒸馏的比较1.3. Comparison with distillation
表3比较了关于CIFAR100数据集而言自蒸馏的结果与五种传统蒸馏方法的结果。在这里,我们将注意力放在当学生模型具有相同的计算和存储量时每种方法的精度提升。从表3中,我们得出以下观察结果:(i)蒸馏方法的所有性能都优于直接训练的学生网络。(ii)虽然自蒸馏不具有额外的教师,但它仍然优于大多数其余的蒸馏方法。Table 3 compares the results of self-distillation with the results of five traditional distillation methods for the CIFAR100 data set. Here, we will focus on the accuracy improvement of each method when the student model has the same amount of calculation and storage. From Table 3, we draw the following observations: (i) All performances of the distillation method are better than the directly trained student network. (ii) Although self-distillation does not have additional teachers, it is still superior to most other distillation methods.
自蒸馏框架的一个显著优点是它不需要额外的教师。相比之下,传统蒸馏首先需要设计和训练过度参数化的教师模型。设计高质量的教师模型需要大量的实验才能找到最佳的深度和架构。此外,训练过度参数化的教师模型需要长得多的时间。在自蒸馏中可以直接避免这些问题,这里教师模型和学生模型二者都是其自身的子部分。如图1中所描绘的,与其他蒸馏方法相比,可以通过自蒸馏来实现训练时间的4.6倍加速。A significant advantage of the self-distillation framework is that it does not require additional teachers. In contrast, traditional distillation first needs to design and train an over-parameterized teacher model. Designing a high-quality teacher model requires a lot of experimentation to find the best depth and architecture. In addition, training an over-parameterized teacher model takes much longer. These problems can be avoided directly in self-distillation, where both the teacher model and the student model are sub-parts of itself. As depicted in Figure 1, compared with other distillation methods, a 4.6 times acceleration of training time can be achieved by self-distillation.
表3自蒸馏算法于传统蒸馏算法正确率对比表Table 3 Comparison table of correct rate of self-distillation algorithm and traditional distillation algorithm
Figure PCTCN2020106995-appb-000013
Figure PCTCN2020106995-appb-000013
1.4.与深度监督网的比较1.4. Comparison with Deep Supervision Network
深度监督网和自蒸馏之间的主要区别在于,自蒸馏从最深层分类器的蒸馏来训练浅层分类器,而不是从标签来训练浅层分类器。优势可以在实验中看到,如表4中所示,表4比较了在CIFAR100上通过深度监督或自蒸馏而训练的ResNet中的每个分类器的精度。观察结果可以归纳如下:(i)在每个分类器中自蒸馏优于深度监督。(ii)浅层分类器从自蒸馏中获益更多。The main difference between deep supervision network and self-distillation is that self-distillation trains the shallow classifier from the distillation of the deepest classifier instead of training the shallow classifier from the label. The advantages can be seen in experiments. As shown in Table 4, Table 4 compares the accuracy of each classifier in ResNet trained on CIFAR100 through deep supervision or self-distillation. The observation results can be summarized as follows: (i) Self-distillation is better than deep supervision in each classifier. (ii) Shallow classifiers benefit more from self-distillation.
表4 CIFAR100数据集上本文所提出方法与深度监督算法的比较。Table 4 Comparison of the method proposed in this article and the deep supervision algorithm on the CIFAR100 data set.
Figure PCTCN2020106995-appb-000014
Figure PCTCN2020106995-appb-000014
Figure PCTCN2020106995-appb-000015
Figure PCTCN2020106995-appb-000015
这种现象的原因很容易理解。在自蒸馏中,(i)添加额外的瓶颈层以检测分类器专属特征,从而避免浅层分类器和最深层分类器之间的冲突。(ii)不是用标签而是已经用蒸馏方法训练浅层分类器来提升性能。(iii)更好的浅层分类器可以获得更多的区别特征,这反过来会增强更深层分类器的性能。The reason for this phenomenon is easy to understand. In self-distillation, (i) add an additional bottleneck layer to detect classifier-specific features, thereby avoiding conflicts between shallow classifiers and deepest classifiers. (ii) Instead of using labels, the shallow classifier has been trained by distillation to improve performance. (iii) Better shallow classifiers can obtain more distinguishing features, which in turn will enhance the performance of deeper classifiers.
1.5.本发明训练的卷积神经网络,仅在训练期间应用所有新添加的层(图2中虚线下方的部分)。它们在推理期间不施加任何影响。在推理期间添加这些部分为能量受约束的边缘设备的动态推理提供了另一种选项。能够用于适应推理的可扩展深度。1.5. The convolutional neural network trained by the present invention only applies all newly added layers (the part below the dotted line in Figure 2) during training. They do not exert any influence during reasoning. Adding these parts during inference provides another option for dynamic inference of energy-constrained edge devices. Can be used to adapt to the scalable depth of reasoning.
现有技术中,用来加速卷积神经网络的流行解决方案是设计一种可扩展的网络,这意味着神经网络的深度或宽度可以根据应用需求来动态变化。例如,在响应时间比精度更重要的场景中,可以在运行时放弃某些层或通道以进行加速。In the prior art, a popular solution for accelerating convolutional neural networks is to design a scalable network, which means that the depth or width of the neural network can be dynamically changed according to application requirements. For example, in scenarios where response time is more important than accuracy, some layers or channels can be discarded at runtime for acceleration.
在利用共享骨干网络的情况下,推理中的自适应精度-加速折衷在资源受限的边缘设备上变得可能,这意味着可以根据现实世界中的动态精度要求来自动地在应用中使用不同深度的分类器。如可以在表5中观察到:(i)通过分类器3/4,四个神经网络中的三个优于其基线,其中加速比平均为1.2倍。在使用分类器2/4时,在精度损失为3.3%的情况下可以实现3.16倍的加速比。(ii)由于不同的分类器共享一个骨干网络,所以最深层的三个分类器的集成可以在仅有0.05%的计算代价的情况下使精度的平均水平提高0.67%。In the case of using a shared backbone network, the adaptive accuracy-acceleration tradeoff in reasoning becomes possible on resource-constrained edge devices, which means that different applications can be automatically used in accordance with the dynamic accuracy requirements in the real world. Deep classifier. As can be observed in Table 5: (i) Through the classifier 3/4, three of the four neural networks are better than their baselines, with an average speedup of 1.2 times. When using the classifier 2/4, a speedup of 3.16 times can be achieved with a loss of accuracy of 3.3%. (ii) Since different classifiers share a backbone network, the integration of the three deepest classifiers can increase the average level of accuracy by 0.67% with a computational cost of only 0.05%.
表5 CIFAR100数据集上本文所提出方法与深度监督算法的比较。Table 5 Comparison of the method proposed in this article and the deep supervision algorithm on the CIFAR100 data set.
Figure PCTCN2020106995-appb-000016
Figure PCTCN2020106995-appb-000016
在通过与其他方法对比,分析得到自蒸馏方法的优势后,再从自蒸馏方法本身对其进行进一步的分析。以下从平坦最小值、梯度和区别特征的角度分析自蒸馏方法的优势原理。After comparing with other methods and analyzing the advantages of the self-distillation method, the self-distillation method itself will be further analyzed. The following analyzes the advantages of the self-distillation method from the perspective of flat minimum, gradient and distinguishing characteristics.
本发明的自蒸馏方法是一种提升模型性能的训练技术,而不是用来压缩或加速模型的方法。不同于以前的大多数研究都侧重于不同模型之间的知识转移,本发明提供的自蒸馏是在一个模型内部的知识转移方法,应用前景广阔。本发明所述的自蒸馏方法可以帮助被训练模型即卷积神经网络收敛到固有地具有普适性特征的平坦最小值。自蒸馏能够防止模型碰到消失梯度问题。在自蒸馏中使用更深的分类器来提取更多区别特征。The self-distillation method of the present invention is a training technique for improving the performance of the model, rather than a method for compressing or accelerating the model. Unlike most previous studies that focused on knowledge transfer between different models, the self-distillation provided by the present invention is a method of knowledge transfer within a model and has a broad application prospect. The self-distillation method of the present invention can help the trained model, that is, the convolutional neural network, to converge to a flat minimum value that is inherently universal. Self-distillation can prevent the model from encountering the vanishing gradient problem. Deeper classifiers are used in self-distillation to extract more distinguishing features.
在上述的自蒸馏训练的卷积神经网络基础上,通过对阈值的控制,实现可伸缩的动态预测方法。Based on the above-mentioned self-distillation training convolutional neural network, through the control of the threshold, a scalable dynamic prediction method is realized.
深度神经网络预测结果的置信度(softmax层输出的最大值)越高,其预测的结果正确的可能性越高。本发明提出一种卷积神经网络的可伸缩动态预测方法,先使得每一个分类器均有一个对应的阈值。若当前分类器预测结果的置信度大于该阈值,则认为该分 类器预测成功。否则,将由更深的分类器继续预测,直到最后一个分类器。当深层分类器的预测结果优于多个分类器Ensemble的时候,可伸缩动态预测机制仅对前三个浅层网络设置阈值,以深层分类器的预测作为最终结果。由于绝大多数浅层分类器的计算均是深层分类器计算的一部分,这样逐渐加深的动态预测几乎不会带来额外的计算量。The higher the confidence of the prediction result of the deep neural network (the maximum value of the softmax layer output), the higher the probability that the prediction result is correct. The present invention provides a scalable dynamic prediction method of a convolutional neural network, which first makes each classifier have a corresponding threshold. If the confidence of the prediction result of the current classifier is greater than the threshold, it is considered that the prediction of the classifier is successful. Otherwise, the deeper classifier will continue to predict until the last classifier. When the prediction result of the deep classifier is better than multiple classifier Ensemble, the scalable dynamic prediction mechanism only sets the threshold for the first three shallow networks, and the prediction of the deep classifier is the final result. Since most of the calculations of the shallow classifiers are part of the calculations of the deep classifiers, such gradually deepening dynamic prediction will hardly bring additional calculations.
然而,基于阈值控制的可伸缩动态预测又引入了另外一个问题,即如何为不同分类器选择合适的阈值。合适的阈值至关重要:(一)一个较低的阈值会使大部分的预测由浅层分类器完成,可以有效降低响应时间,但同时也导致预测的正确率降低。(二)同理,较高的阈值会使绝大多数的预测由深层分类器完成,可以取得较高的预测正确率,但同时会导致响应时间较长。(三)通过合理的调整阈值,就可以动态地调整预测正确率和响应速度之间的折中平衡。为进一步挖掘加速与正确率提高的空间,本发明进一步使用遗传算法对阈值进行优化搜索。However, the scalable dynamic prediction based on threshold control introduces another problem, that is, how to select the appropriate threshold for different classifiers. A suitable threshold is very important: (1) A lower threshold will make most of the predictions completed by the shallow classifier, which can effectively reduce the response time, but it also leads to a decrease in the accuracy of the prediction. (2) In the same way, a higher threshold will make most of the predictions completed by the deep classifier, which can achieve a higher prediction accuracy, but at the same time it will lead to a longer response time. (3) By adjusting the threshold reasonably, the trade-off balance between the prediction accuracy rate and the response speed can be dynamically adjusted. In order to further explore the space for acceleration and improvement of the accuracy rate, the present invention further uses a genetic algorithm to optimize the threshold value.
遗传算法通过模拟不同生物个体在自然界中生存、淘汰、繁衍的行为,获得对于制定优化目标的最优解或最优解的近似。其主要流程包括:(一)初始化基因,即随机产生一定量的拥有不同基因的个体,作为第一代的生物。(二)计算环境适宜度,即对于每一个生物个体,计算由其基因决定的对于环境的适宜程度,这一计算过程由优化的目标决定。(三)淘汰,即根据上一步计算的结果淘汰对于环境不适宜的生物个体。(四)交叉配对,即将淘汰后的生物个体基因进行交叉配对,模拟生物繁殖的过程,获取下一代个体。(五)基因突变,即对于未淘汰个体的基因与新生个体的基因,由一定几率进行变化,以防止优化过程陷入局部最优点。通过多次迭代进行以上流程,遗传算法即可找到针对于优化目标的最优或较优解。The genetic algorithm obtains the optimal solution or the approximation of the optimal solution to the formulation of optimization goals by simulating the behaviors of different biological individuals to survive, eliminate and multiply in nature. The main process includes: (1) Initializing genes, that is, randomly generating a certain amount of individuals with different genes as the first-generation organisms. (2) Calculate environmental suitability, that is, for each individual organism, calculate the suitability for the environment determined by its genes. This calculation process is determined by the optimization goal. (3) Elimination, that is, elimination of organisms that are not suitable for the environment based on the results of the previous step. (4) Cross-matching, that is, cross-matching the genes of the eliminated individual organisms to simulate the process of biological reproduction to obtain the next generation of individuals. (5) Gene mutation, that is, the genes of the uncultivated individuals and the genes of the new-born individuals are changed with a certain probability to prevent the optimization process from falling into the local optimum. Through multiple iterations of the above process, the genetic algorithm can find the optimal or better solution for the optimization goal.
在可伸缩网络中,阈值搜索问题被建模成遗传算法解决的优化问题,优化目标为快速的神经网络模型响应速度和较高的预测正确率,优化解为可伸缩网络中浅层分类器对应的阈值。在使用遗传算法解决阈值搜索问题的过程中,需要对于基因与阈值的相互映射关系进行定义,同时根据可伸缩网络的加速比与正确率求解环境适宜度。In a scalable network, the threshold search problem is modeled as an optimization problem solved by genetic algorithms. The optimization goal is fast neural network model response speed and high prediction accuracy. The optimized solution corresponds to the shallow classifier in the scalable network. The threshold. In the process of using genetic algorithms to solve the threshold search problem, it is necessary to define the mutual mapping relationship between genes and thresholds, and to solve the environmental suitability according to the speedup and accuracy of the scalable network.
定义遗传算法中的基因到阈值的解码关系。遗传算法中的基因是一个二值的代码序列。在遗传算法迭代的过程中,需要将对基因进行解码获得对应的阈值,以计算该基因对于环境的适应程度。为了避免阈值太小导致正确率过低的现象,阈值的下界被设置为0.70。其解码关系可以如下所示。Define the decoding relationship from genes to threshold in genetic algorithm. The gene in the genetic algorithm is a binary code sequence. In the iterative process of the genetic algorithm, the gene needs to be decoded to obtain the corresponding threshold to calculate the degree of adaptation of the gene to the environment. In order to avoid the phenomenon that the threshold is too small and the correct rate is too low, the lower bound of the threshold is set to 0.70. The decoding relationship can be as follows.
Figure PCTCN2020106995-appb-000017
Figure PCTCN2020106995-appb-000017
其中,S(n)表示基因序列中第n位的值,σ表示第i个基因对应的阈值。N表示基因序列的长度。在基因序列中,“1”的数量越多,则阈值越低。Among them, S(n) represents the value of the nth position in the gene sequence, and σ represents the threshold corresponding to the i-th gene. N represents the length of the gene sequence. In the gene sequence, the greater the number of "1"s, the lower the threshold.
定义遗传算法中基因对于环境适宜程度的衡量方法。由于该算法的目标包括响应速度与预测正确率两项,其环境适宜度的定义中也应同时包含这两个指标,如下式所示。Define the method for measuring the suitability of genes for the environment in the genetic algorithm. Since the goal of the algorithm includes response speed and prediction accuracy, the definition of environmental suitability should also include these two indicators, as shown in the following formula.
fitness=acceleration ratio+γ·(accuracy-baseline)fitness=acceleration ratio+γ·(accuracy-baseline)
其中,fitness表示每个基因对应的环境适宜度;acceleration ratio为加速比,表示可伸缩动态预测的预测响应速度与原可伸缩卷积神经网络的预测响应速度的比,动态可伸缩预测带来的加速效果。accuracy与baseline分别表示可伸缩动态预测的预测响应速度与原可伸缩卷积神经网络的预测正确率。γ是一个响应加速与预测正确率的平衡因子。通过动态的调整γ,就可以获得不同加速比,不同正确率的多个阈值方案。Among them, fitness represents the environmental suitability corresponding to each gene; acceleration ratio is the acceleration ratio, which represents the ratio of the predicted response speed of the scalable dynamic prediction to the predicted response speed of the original scalable convolutional neural network. Acceleration effect. Accuracy and baseline respectively represent the predicted response speed of the scalable dynamic prediction and the prediction accuracy rate of the original scalable convolutional neural network. γ is a balance factor between response acceleration and prediction accuracy. By dynamically adjusting γ, multiple threshold solutions with different speedups and different accuracy rates can be obtained.
可伸缩动态预测方法所带来的收益不仅是与静态加速相比更高的加速效果,更在于它提供了一种可以在部署状态下动态调节模型反应速度的能力,这对于应用的灵活性至关重要。例如,在无人驾驶应用中,当无人车时速较高时,模型可以使用较低的阈值以保证对于更高的处理帧率。而在无人车时速较低时,模型可以使用较高的阈值获得最优 的预测正确率。与传统的同时存储多个模型的算法相比,本方法在切换模型时仅需要修改阈值而无需更换模型,可以避免切换过程中模型的真空期,给现实应用带来安全上的保障。The benefits of the scalable dynamic prediction method are not only higher acceleration effects than static acceleration, but also that it provides the ability to dynamically adjust the response speed of the model in the deployment state, which is extremely flexible for applications. Important. For example, in unmanned driving applications, when the speed of the unmanned vehicle is high, the model can use a lower threshold to ensure a higher processing frame rate. When the speed of unmanned vehicles is low, the model can use a higher threshold to obtain the best prediction accuracy. Compared with the traditional algorithm that stores multiple models at the same time, this method only needs to modify the threshold value when switching models without changing the model, which can avoid the vacuum period of the model during the switching process and bring safety guarantee to real applications.
与静态加速的方法相比,可伸缩动态预测方法不仅加速比更高,同时更具有可靠性。对于压缩后神经网络模型正确率的要求往往是神经网络压缩算法最重要的评价标准之一。然而,神经网络压缩、加速的同时往往伴随着正确率的降低。这样的结果在一些安全相关的应用场景中是无法接受的,如无人驾驶、安防系统等等。可伸缩动态预测方法中,即使所有浅层分类器的正确率都低于原可伸缩卷积神经网络模型,仍可以通过较低的阈值实现合理的分类器调度,维持神经网络原正确率。Compared with the static acceleration method, the scalable dynamic prediction method not only has a higher acceleration ratio, but also has more reliability. The requirement for the correct rate of the neural network model after compression is often one of the most important evaluation criteria for neural network compression algorithms. However, the compression and acceleration of neural networks are often accompanied by a decrease in accuracy. Such results are unacceptable in some safety-related application scenarios, such as unmanned driving, security systems, and so on. In the scalable dynamic prediction method, even if the accuracy of all shallow classifiers is lower than the original scalable convolutional neural network model, reasonable classifier scheduling can still be achieved through a lower threshold and the original accuracy of the neural network can be maintained.
本发明所述的卷积神经网络的可伸缩动态预测方法在CIFAR100数据集上的实验结果。如图4和图5所示,在CIFAR100数据集上7种不同深度神经网络计算量、参数量与预测正确率的关系。其中,横轴表示深度神经网络预测需要进行的乘加运算数量,纵轴表示其预测正确率。每一种灰度的虚线和点对应同一种深度神经网络。在虚线上的相同形状的标记点表示同一可伸缩网络四个(或三个)深度分类器的实验结果,虚线外的相同形状的标记点表示未使用可伸缩网络的原模型实验得到的对比结果。The experimental results of the scalable dynamic prediction method of the convolutional neural network of the present invention on the CIFAR100 data set. As shown in Figure 4 and Figure 5, the relationship between the calculation amount, parameter amount and prediction accuracy of seven different deep neural networks on the CIFAR100 data set. Among them, the horizontal axis represents the number of multiplication and addition operations required for deep neural network prediction, and the vertical axis represents its prediction accuracy rate. The dotted lines and dots of each gray scale correspond to the same deep neural network. The marked points of the same shape on the dashed line represent the experimental results of four (or three) deep classifiers of the same scalable network, and the marked points of the same shape outside the dashed line represent the comparison results of the original model experiment without the scalable network. .
由此可见,在CIFARA100数据集上:It can be seen that on the CIFARA100 data set:
(一)在所有情况下,可伸缩卷积神经网络的第二个浅层分类器均可以在预测正确率上超过原模型。(二)在不损失任何正确率的情况下,静态运行的可伸缩网络可以实现2.17倍的加速与3.20倍的压缩效果。(三)与原模型的对比试验结果相比,平均每种神经网络以仅4.4%的额外计算为代价提高了4.05%的预测正确率。(四)所有模型的集成预测结果较最深层分类器可以提高1.11%的正确率。(五)在同一个深度神经网络中,与深层分类器相比,浅层分类器上的正确率提高等多,这主要是由浅层分类中的注意力层带来的。(六)整体来看,神经网络越深,则其性能提升越大。(1) In all cases, the second shallow classifier of the scalable convolutional neural network can exceed the original model in prediction accuracy. (2) Without losing any correctness, a statically running scalable network can achieve 2.17 times acceleration and 3.20 times compression. (3) Compared with the comparison test results of the original model, on average, each neural network increases the prediction accuracy rate by 4.05% at the cost of only 4.4% additional calculations. (4) Compared with the deepest classifier, the integrated prediction results of all models can improve the correct rate by 1.11%. (5) In the same deep neural network, compared with the deep classifier, the accuracy rate of the shallow classifier is improved by a lot, which is mainly brought by the attention layer in the shallow classification. (6) Overall, the deeper the neural network, the greater its performance improvement.
同时的,由表6中可伸缩卷积神经网络在CIFAR100数据集上不同分类器正确率表能够得到;在CIFAR100实验中,每个网络不同分类器的正确率,作为对图4和图5分析结果的数值补充。At the same time, the correct rate of different classifiers on the CIFAR100 data set of the scalable convolutional neural network in Table 6 can be obtained; in the CIFAR100 experiment, the correct rate of the different classifiers of each network is used as an analysis of Figure 4 and Figure 5. Numerical addition of results.
表6可伸缩神经网络在CIFAR100数据集上不同分类器正确率表Table 6 The correct rate of different classifiers of the scalable neural network on the CIFAR100 dataset
Figure PCTCN2020106995-appb-000018
Figure PCTCN2020106995-appb-000018
由表6可得,(一)在所有网络结构的实验中,即使是可伸缩神经网络中最浅层的分类器,也已经非常接近原模型的精度。平均而言,每种网络的最浅层分类器比原模型低2.8%,其中差距最大时在ResNet18中低5.25%,差距最小时在WRN44-8仅低0.19%。(二)在所有网络结构的实验中,可伸缩神经网络中次浅层的分类器就可以超过原模型的效果。平均而言,每种网络的次浅层分类器比原模型高1.8%,在WRN44-8上提升2.52%,效果最明显,在ResNet18上提升效果最小,达0.65%。(三)在所有网络结构的实验中,整体来说,可伸缩神经网络中的分类器越深其正确率越高。这种增强趋势在最浅层的分类器与次浅层分类器中表现最为明显。例如,ResNet18的前两个浅层分类器有 5%以上的正确率差异。而次深层分类器与深层分类器的正确率相差无几,在部分情况下(ResNet152),甚至出现次深层的分类器正确率高于最深层的分类器的情况。这一现象可能是由CIFAR100数据集分类任务较为简单导致的。(五)通过简单的对于多个分类器的预测结果进行集成,可伸缩网络实现了1%以上的正确率提升。(六)从静态压缩、加速的角度看,使用可伸缩神经网络训练得到的ResNet18网络正确率已经超过传统方法训练得到的ResNet152网络。在应用场景中,使用ResNet18模型替换ResNet152模型,即可实现5.33倍的参数压缩与6.27倍的加速。From Table 6, (1) In the experiments of all network structures, even the shallowest classifier in the scalable neural network is already very close to the accuracy of the original model. On average, the shallowest classifier of each network is 2.8% lower than the original model. When the gap is the largest, it is 5.25% lower in ResNet18, and when the gap is the smallest, it is only 0.19% lower in WRN44-8. (2) In all experiments of the network structure, the sub-shallow classifier in the scalable neural network can exceed the effect of the original model. On average, the sub-shallow classifier of each network is 1.8% higher than that of the original model, and it is improved by 2.52% on WRN44-8, the effect is the most obvious, and the effect on ResNet18 is the smallest, reaching 0.65%. (3) In the experiments of all network structures, as a whole, the deeper the classifier in the scalable neural network, the higher the accuracy. This enhancement trend is most obvious in the shallowest and sub-shallow classifiers. For example, the first two shallow classifiers of ResNet18 have more than 5% difference in accuracy. The accuracy of the sub-deep classifier is almost the same as that of the deep classifier. In some cases (ResNet152), the accuracy of the sub-deep classifier may even be higher than that of the deepest classifier. This phenomenon may be caused by the relatively simple classification task of the CIFAR100 data set. (5) By simply integrating the prediction results of multiple classifiers, the scalable network achieves an increase in accuracy of more than 1%. (6) From the perspective of static compression and acceleration, the accuracy of the ResNet18 network trained by the scalable neural network has exceeded the ResNet152 network trained by the traditional method. In the application scenario, replacing the ResNet152 model with the ResNet18 model can achieve 5.33 times parameter compression and 6.27 times acceleration.
表7中展示了可伸缩卷积神经网络在CIFAR10数据集上的实验结果,其整体趋势与CIFAR100相同,可以看到所有卷积神经网络均可以实现明显的正确率提升,所有进行实验的网络结构中,平均提高0.98%,最高时在VGG16(BN)上提高了1.28%,最低时在ResNet18上提高了0.71%。Table 7 shows the experimental results of the scalable convolutional neural network on the CIFAR10 dataset. The overall trend is the same as that of CIFAR100. It can be seen that all convolutional neural networks can achieve a significant improvement in accuracy. All the network structures for experiments Among them, the average increase is 0.98%, the highest is 1.28% on VGG16 (BN), and the lowest is 0.71% on ResNet18.
CIFAR10数据集中正确率提升的绝对值较CIFAR100数据上的结果略低,导致这一现象的主要原因是原网络CIFAR10的正确率已经很高。即由于传统方法训练得到的神经网络已经可以取得较高的预测正确率,导致进一步的提高正确率的难度与CIFAR100数据集相比更大。The absolute value of the increase in the accuracy of the CIFAR10 data set is slightly lower than the result on the CIFAR100 data. The main reason for this phenomenon is that the accuracy of the original network CIFAR10 is already very high. That is, because the neural network trained by the traditional method can already achieve a higher prediction accuracy, the difficulty of further improving the accuracy is greater than that of the CIFAR100 data set.
表7可伸缩卷积神经网络在CIFAR10数据集上不同分类器正确率表。Table 7 The correct rate of different classifiers of the scalable convolutional neural network on the CIFAR10 data set.
Figure PCTCN2020106995-appb-000019
Figure PCTCN2020106995-appb-000019
表8展示了在ImageNet数据集上三种不同深度的ResNet网络里每个分类器的正确率。其趋势与CIFAR100上的结果大致相同,但仍有以下区别:Table 8 shows the correct rate of each classifier in the ResNet network of three different depths on the ImageNet dataset. The trend is roughly the same as the result on CIFAR100, but there are still the following differences:
(一)平均而言,每个网络可以提高1.26%的预测正确率,效果最明显时在ResNet50上提高1.41%,最不明显时在ResNet101上提高1.08%,这一结果要差于在CIFAR100数据集上的结果。(1) On average, each network can increase the prediction accuracy rate by 1.26%. When the effect is the most obvious, it increases by 1.41% on ResNet50, and when it is the least obvious, it increases by 1.08% on ResNet101. This result is worse than the CIFAR100 data The result on the set.
(二)与CIFAR100上的实验结果不同,ImageNet数据集上随着神经网络分类器位置变深,其正确率会有非常大的变化。在进行实验的三种神经网络中,深层的分类器预测正确率均显著高于浅层分类器。这说明在ImageNet数据集中神经网络的深度至关重要,其参数的冗余性远小于在CIFAR10与CIFAR100数据集上训练的神经网络。这一现象极有可能是ImageNet分类的难度更高导致的。(2) Unlike the experimental results on CIFAR100, as the position of the neural network classifier on the ImageNet data set becomes deeper, its accuracy rate will vary greatly. In the three kinds of neural networks in the experiment, the prediction accuracy rate of the deep classifier is significantly higher than that of the shallow classifier. This shows that the depth of the neural network in the ImageNet dataset is very important, and the redundancy of its parameters is much smaller than the neural network trained on the CIFAR10 and CIFAR100 datasets. This phenomenon is most likely caused by the higher difficulty of ImageNet classification.
(三)尽管最深分类器的正确率与原模型相比有所提高,但是所有的浅层分类器正确率都无法超过原模型。这一现象导致简单的使用浅层分类器替换原模型在带来加速和压缩效果的同时无法维持原模型的正确率。因此,直接用小模型替换大模型的神经网络静态压缩、加速方法在ImageNet数据集上无法使用。本文提出的可伸缩动态预测方法则通过多个分类器的合理调度解决了这一问题。(3) Although the accuracy of the deepest classifier has improved compared with the original model, the accuracy of all shallow classifiers cannot exceed the original model. This phenomenon leads to the fact that simply replacing the original model with a shallow classifier can not maintain the accuracy of the original model while bringing acceleration and compression effects. Therefore, the neural network static compression and acceleration method that directly replaces the large model with a small model cannot be used on the ImageNet data set. The scalable dynamic prediction method proposed in this paper solves this problem through the reasonable scheduling of multiple classifiers.
ImageNet数据集的实验结果中浅层分类器的正确率无法超过原模型,这导致在CIFAR100、CIFAR10数据集中使用的模型集成的方法并不能带来额外的正确率提升。实验结果显示,即使使用更加复杂的模型集成方式,如加权集成算法也无法对于分类正确率产生收益,因此在表8将其结果略去。In the experimental results of the ImageNet dataset, the accuracy of the shallow classifier cannot exceed the original model, which results in the model integration method used in the CIFAR100 and CIFAR10 datasets that cannot bring additional accuracy improvements. The experimental results show that even if more complex model integration methods are used, such as weighted ensemble algorithms, they cannot generate revenue for the classification accuracy. Therefore, the results are omitted in Table 8.
表8可伸缩网络在ImageNet数据集上不同分类器正确率表Table 8 The correct rate of different classifiers of the scalable network on the ImageNet dataset
Figure PCTCN2020106995-appb-000020
Figure PCTCN2020106995-appb-000020
如图6所示,展示了在CIFAR100和ImageNet上使用不同的阈值方案时在动态可伸缩预测得到的各神经网络的正确率、加速比的关系。其中,横轴表示模型的加速比,纵轴表示模型的预测正确率。相同颜色的点表示同一种网络、同一种数据集下的实验结果。在x>1范围内的正方形表示搜索到的阈值方案对应的实验结果。在x=1直线上的三角形表示原模型的实验结果。As shown in Figure 6, it shows the relationship between the accuracy and speedup of each neural network obtained by dynamic scalable prediction when different threshold schemes are used on CIFAR100 and ImageNet. Among them, the horizontal axis represents the acceleration ratio of the model, and the vertical axis represents the prediction accuracy of the model. The same color points represent the experimental results under the same network and the same data set. The square in the range of x>1 indicates the experimental result corresponding to the searched threshold scheme. The triangle on the line x=1 represents the experimental result of the original model.
由图6可知:(一)CIFAR100数据集上,在不损失正确率的前提下,ResNet18、ResNet50、ResNet152分别可以实现2.5倍,4.4倍、6.2倍左右的加速。这一结果明显优于通过简单分类器替换实现的静态加速效果。(二)ImageNet数据集上,在不损失正确率的前提下,ResNet50与ResNet101分别可以实现1.5,2.5倍的加速效果。(三)在同一种数据集上,神经网络越深其加速效果越明显。例如,在ImageNet数据集上,ResNet101的加速效果明显优于ResNet50。在CIFAR100数据集上,ResNet152的加速效果优于ResNet50,ResNet50的加速效果优于ResNet18。(四)观察每一条曲线的变化趋势,其加速比与正确率呈现明显的负相关关系。从其导数关系观察,随着加速比的上升,正确率下降的速度同样也有所提高。这一现象是由阈值控制的缺陷导致的。实验发现,尽管阈值控制的动态可伸缩预测方式不需要额外的计算,但是在阈值较低的情况下会产生判断失控的情况,即部分决策虽然高于阈值,但是最终分类结果错误,导致整体模型正确率较低。It can be seen from Figure 6: (1) On the CIFAR100 data set, ResNet18, ResNet50, and ResNet152 can achieve accelerations of about 2.5 times, 4.4 times, and 6.2 times, respectively, without loss of correctness. This result is significantly better than the static acceleration effect achieved by simple classifier replacement. (2) On the ImageNet dataset, ResNet50 and ResNet101 can achieve speedups of 1.5 and 2.5 times respectively without loss of correctness. (3) On the same data set, the deeper the neural network, the more obvious the acceleration effect. For example, on the ImageNet dataset, the acceleration effect of ResNet101 is significantly better than ResNet50. On the CIFAR100 data set, the acceleration effect of ResNet152 is better than that of ResNet50, and the acceleration effect of ResNet50 is better than that of ResNet18. (4) Observe the change trend of each curve, and its speedup ratio and accuracy rate show a clear negative correlation. Observed from its derivative relationship, as the acceleration ratio increases, the rate of decrease of the accuracy rate also increases. This phenomenon is caused by the defect of threshold control. Experiments have found that although the dynamic and scalable prediction method of threshold control does not require additional calculations, when the threshold is low, the judgment will be out of control, that is, although some decisions are higher than the threshold, the final classification results are wrong, leading to the overall model The correct rate is low.
本发明所述的预测方法,最终的加速效果是直接取决于可伸缩神经网络中每个分类器完成的分类数量。如果大量分类决策由浅层分类器完成,则整个神经网络的加速效果会非常明显。倘若大量分类决策由深层分类器完成,则系统的反应速度与原网络几乎相同。通过统计不同深度分类器的决策数量,就可以对系统的加速效果得到准确的认知。In the prediction method of the present invention, the final acceleration effect is directly dependent on the number of classifications completed by each classifier in the scalable neural network. If a large number of classification decisions are completed by a shallow classifier, the acceleration effect of the entire neural network will be very obvious. If a large number of classification decisions are completed by the deep classifier, the response speed of the system is almost the same as the original network. By counting the number of decisions made by different depth classifiers, the acceleration effect of the system can be accurately recognized.
如图8所示,在保持相同阈值方案相同神经网络(ResNet50)的前提下,四个分类器在不同数据集上的预测表现。其中。横轴的1/4到4/4分别表示从浅到深的四个分类器,纵轴的数值表示该分类器进行的预测次数占总次数的比例。As shown in Figure 8, under the premise of maintaining the same threshold scheme and the same neural network (ResNet50), the prediction performance of the four classifiers on different data sets. among them. The 1/4 to 4/4 on the horizontal axis represent four classifiers from shallow to deep, and the value on the vertical axis represents the ratio of the number of predictions made by this classifier to the total number of times.
由图8可知,在CIFAR10与CIFAR100数据集中,60%以上的图像可以由最浅层的分类器完成预测,90%以上的图像分类可以由前两层分类器完成,这与实验结果中CIFAR数据集上较高的加速比是一致的。而在ImageNet数据集中,仅有20%的图像可以由最浅层分类器完成预测,接近一半数量的图像必须由较深的两个分类器进行分类,这导致了ImageNet数据集上相对不明显的加速效果。以上结论为深度可伸缩网络提供了两个潜在的应用:1、衡量神经网络的冗余性。2、衡量不同数据集的难度。It can be seen from Figure 8 that in the CIFAR10 and CIFAR100 data sets, more than 60% of the images can be predicted by the shallowest classifier, and more than 90% of the images can be classified by the first two classifiers. This is the same as the CIFAR data in the experimental results. The higher speedup on the set is consistent. In the ImageNet dataset, only 20% of the images can be predicted by the shallowest classifier, and nearly half of the images must be classified by the two deeper classifiers, which leads to relatively unobvious results on the ImageNet dataset. Acceleration effect. The above conclusions provide two potential applications for deep scalable networks: 1. To measure the redundancy of neural networks. 2. Measure the difficulty of different data sets.
首先,不同分类器在同一数据集中进行预测的次数,可以判断不同网络层中的冗余性。例如,在CIFAR10与CIFAR100的统计结果中,次深层的分类器和最深层分类器完成的预测数量接近于零,这说明这两个分类器所在的神经网络部分在整体分类中所起作用较小,有较高的冗余性。适宜继续通过剪枝、量化等算法进行压缩。而前两个浅层分类器预测数量之和接近于百分之百,说明这两个分类器所在的神经网络部分在分类任务中作用极大,冗余性较小,不适宜继续进行更高程度的压缩或加速。First, the number of predictions made by different classifiers in the same data set can be used to determine the redundancy in different network layers. For example, in the statistical results of CIFAR10 and CIFAR100, the number of predictions completed by the sub-deep classifier and the deepest classifier is close to zero, which shows that the neural network part of the two classifiers plays a small role in the overall classification. , There is a high degree of redundancy. It is suitable to continue to compress by algorithms such as pruning and quantization. The sum of the prediction numbers of the first two shallow classifiers is close to 100%, indicating that the neural network part of these two classifiers plays a great role in the classification task, and the redundancy is small, and it is not suitable for further compression. Or accelerate.
其次,不同分类器在不同数据集中进行预测的次数,可以作为不同数据集难易程度的衡量标准。比较不同数据集难易程度最简单的方法是直接比较相同网络在各个数据集上能取得的预测正确率。然而,分类任务的正确率同样受类别数量的影响。不同数据集 中类别的数量不同,这种衡量的方式会受此影响,进而低估少类别分类任务的难度。深度可伸缩则提供了另外一种思路,即通过比较由浅层分类器进行分类的样本数量,比较不同数据集的难度。Secondly, the number of predictions made by different classifiers in different data sets can be used as a measure of the difficulty of different data sets. The easiest way to compare the difficulty of different data sets is to directly compare the prediction accuracy rates that the same network can achieve on each data set. However, the accuracy of classification tasks is also affected by the number of categories. The number of categories in different data sets is different, this measurement method will be affected by this, and thus underestimate the difficulty of the task of classification with fewer categories. Deep scalability provides another way of thinking, which is to compare the difficulty of different data sets by comparing the number of samples classified by the shallow classifier.
本发明还提供一种卷积申请网络的自蒸馏训练设备,包括存储器,用于存储计算机程序;处理器,用于执行所述计算机程序时实现如上述一种卷积申请网络的自蒸馏训练方法的步骤。The present invention also provides a self-distillation training device for a convolution application network, including a memory for storing a computer program; a processor, used to implement the self-distillation training method of the convolution application network as described above when the computer program is executed A step of.
本发明还提供一种计算机存储介质,所述计算机存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如上述一种卷积申请网络的自蒸馏训练方法的步骤。The present invention also provides a computer storage medium with a computer program stored on the computer storage medium, and when the computer program is executed by a processor, the steps of the self-distillation training method of the convolution application network described above are realized.
本发明还提供一种卷积神经网络的可伸缩动态预测设备,包括存储器,用于存储计算机程序;处理器,用于执行所述计算机程序时实现如上述一种卷积神经网络的可伸缩动态预测方法的步骤。The present invention also provides a scalable dynamic prediction device of a convolutional neural network, including a memory, used for storing a computer program; a processor, used to implement the aforementioned scalable dynamics of a convolutional neural network when the computer program is executed The steps of the forecasting method.
本发明还提供另一种计算机存储介质,所述计算机存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如上述一种卷积神经网络的可伸缩动态预测方法的步骤。The present invention also provides another computer storage medium with a computer program stored on the computer storage medium, and when the computer program is executed by a processor, the steps of the scalable dynamic prediction method of the convolutional neural network as described above are realized.
本领域内的技术人员应明白,本发明的实施例可提供为方法、系统、或计算机程序产品。因此,本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that the embodiments of the present invention may be provided as methods, systems, or computer program products. Therefore, the present invention may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, the present invention may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present invention is described with reference to flowcharts and/or block diagrams of methods, devices (systems), and computer program products according to embodiments of the present invention. It should be understood that each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be implemented by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing equipment to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing equipment are generated It is a device that realizes the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device. The device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment. The instructions provide steps for implementing functions specified in a flow or multiple flows in the flowchart and/or a block or multiple blocks in the block diagram.
最后应当说明的是:以上实施例仅用以说明本发明的技术方案而非对其限制,尽管参照上述实施例对本发明进行了详细的说明,所属领域的普通技术人员应当理解:依然可以对本发明的具体实施方式进行修改或者等同替换,而未脱离本发明精神和范围的任何修改或者等同替换,其均应涵盖在本发明的权利要求保护范围之内。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and not to limit it. Although the present invention has been described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: Modifications or equivalent replacements of specific implementation manners without departing from the spirit and scope of the present invention shall be covered by the scope of protection of the claims of the present invention.

Claims (11)

  1. 一种卷积神经网络的自蒸馏训练方法,其特征在于,包括如下步骤,A self-distillation training method of convolutional neural network, which is characterized in that it comprises the following steps:
    步骤1,根据目标卷积神经网络的深度和原始结构,以设定的深度区间将目标卷积神经网络的卷积层划分成n个部分,n为正整数且n≥2,其中第n层为最深层部分,其余层为浅层部分;Step 1. According to the depth and original structure of the target convolutional neural network, divide the convolutional layer of the target convolutional neural network into n parts in the set depth interval, where n is a positive integer and n≥2, where the nth layer Is the deepest part, and the rest are shallow parts;
    步骤2,在每个浅层部分之后分别设置浅层分类器进行分类,最深层部分之后设置最深层分类器进行分类;浅层分类器包括依次设置的瓶颈层、完全连接层和softmax层进行分类,最深层分类器包括依次设置的完全连接层和softmax层进行分类;Step 2. Set a shallow classifier after each shallow part for classification, and set the deepest classifier after the deepest part for classification; the shallow classifier includes the bottleneck layer, the fully connected layer and the softmax layer set in sequence for classification , The deepest classifier includes a fully connected layer and a softmax layer set in sequence for classification;
    所述的浅层分类器的特定特征由如下的注意力模块得到,The specific features of the shallow classifier are obtained by the following attention module,
    Attention Maps(W conv,W deconv,F)=σ(φ(ψ(F,W conv)),W deconv) Attention Maps(W conv ,W deconv ,F)=σ(φ(ψ(F,W conv )),W deconv )
    其中,ψ和φ分别表示用于下采样的卷积层的卷积函数和用于上采样的反卷积层的反卷积函数,F表示输入特征,σ表示Sigmoid函数,W conv表示卷积层的权重,W deconv表示反卷积层的权重; Among them, ψ and φ respectively represent the convolution function of the convolution layer used for downsampling and the deconvolution function of the deconvolution layer used for upsampling, F represents the input feature, σ represents the Sigmoid function, and W conv represents the convolution The weight of the layer, W deconv represents the weight of the deconvolution layer;
    步骤3,在训练时,最深层部分被视为教师模型,所有带有对应分类器的浅层部分都通过从最深层部分蒸馏而被训练为学生模型,从而实现卷积神经网络的自蒸馏训练。Step 3. During training, the deepest part is regarded as a teacher model, and all shallow parts with corresponding classifiers are trained as student models by distilling from the deepest part, thus realizing self-distillation training of the convolutional neural network .
  2. 根据权利要求1所述的一种卷积神经网络的自蒸馏训练方法,其特征在于,步骤3中,在训练时,引入如下三种损失提高学生模型的性能;The self-distillation training method of a convolutional neural network according to claim 1, characterized in that, in step 3, during training, the following three losses are introduced to improve the performance of the student model;
    引入来自标签的交叉熵损失;根据来自训练数据集的标签和每个分类器的softmax层的输出计算得到交叉熵损失,将其引入到所有分类器中;Introduce the cross-entropy loss from the label; calculate the cross-entropy loss based on the label from the training data set and the output of the softmax layer of each classifier, and introduce it into all classifiers;
    引入教师模型指导下的KL散度损失;根据每个学生模型和教师模型之间的softmax层输出来计算KL散度,将其对应引入到每个浅层分类器的softmax层;Introduce the KL divergence loss guided by the teacher model; calculate the KL divergence according to the output of the softmax layer between each student model and the teacher model, and introduce it into the softmax layer of each shallow classifier;
    引入来自提示的L2损失;通过计算最深层分类器和每个浅层分类器的特征图之间的L2损失,将其对应引入到每个浅层分类器瓶颈层。Introduce the L2 loss from the hint; by calculating the L2 loss between the deepest classifier and the feature map of each shallow classifier, it is correspondingly introduced to the bottleneck layer of each shallow classifier.
  3. 根据权利要求2所述的一种卷积神经网络的自蒸馏训练方法,其特征在于,具体的,引入来自标签的交叉熵损失由如下公式得到,The self-distillation training method of a convolutional neural network according to claim 2, characterized in that, specifically, the introduction of cross-entropy loss from the label is obtained by the following formula:
    (1-α)·Cross Entropy(q i,y) (1-α)·Cross Entropy(q i ,y)
    其中,q i表示每个分类器θ i/C的softmax层的输出;训练集为给定来自M个类别的N个样本
    Figure PCTCN2020106995-appb-100001
    将对应的标签集表示为
    Figure PCTCN2020106995-appb-100002
    α为控制KL散度损失函数比例的超参数,KL为Kullback-Leibler散度,最深层分类器的α为零,CrossEntropy为交叉熵函数。
    Among them, q i represents the output of the softmax layer of each classifier θ i/C ; the training set is given N samples from M categories
    Figure PCTCN2020106995-appb-100001
    Denote the corresponding label set as
    Figure PCTCN2020106995-appb-100002
    α is a hyperparameter that controls the proportion of KL divergence loss function, KL is Kullback-Leibler divergence, α of the deepest classifier is zero, and CrossEntropy is the cross-entropy function.
  4. 根据权利要求2所述的一种卷积神经网络的自蒸馏训练方法,其特征在于,具体的,引入教师模型指导下的KL散度损失由如下公式得到,The self-distillation training method of a convolutional neural network according to claim 2, characterized in that, specifically, the KL divergence loss guided by the introduction of the teacher model is obtained by the following formula:
    α·KL(q i,q C) α·KL(q i ,q C )
    其中,α为控制KL散度损失函数比例的超参数,KL为Kullback-Leibler散度,q i表示每个分类器θ i/C的softmax层的输出,q C为最深层分类器θ C的softmax层的输出,最深层分类器的α为零。 Among them, α is the hyperparameter that controls the ratio of the KL divergence loss function, KL is the Kullback-Leibler divergence, q i represents the output of the softmax layer of each classifier θ i/C , and q C is the deepest classifier θ C The output of the softmax layer, the α of the deepest classifier is zero.
  5. 根据权利要求2所述的一种卷积神经网络的自蒸馏训练方法,其特征在于,具体的,引入来自提示的L2损失由如下公式得到,The self-distillation training method of a convolutional neural network according to claim 2, characterized in that, specifically, the L2 loss introduced from the prompt is obtained by the following formula:
    Figure PCTCN2020106995-appb-100003
    Figure PCTCN2020106995-appb-100003
    其中,F i和F C分别表示每个分类器θ i/C中的特征和最深层分类器θ C中的特征,λ为 控制特征损失函数比例的超参数,最深层分类器的λ为零。 Among them, F i and F C represent the features in each classifier θ i/C and the features in the deepest classifier θ C , λ is a hyperparameter that controls the ratio of the feature loss function, and the deepest classifier’s λ is zero .
  6. 根据权利要求2所述的一种卷积神经网络的自蒸馏训练方法,其特征在于,训练时,整个卷积神经网络的损失函数由每个分类器的损失函数组成,由下式表示,The self-distillation training method of a convolutional neural network according to claim 2, characterized in that, during training, the loss function of the entire convolutional neural network is composed of the loss function of each classifier, expressed by the following formula,
    Figure PCTCN2020106995-appb-100004
    Figure PCTCN2020106995-appb-100004
    其中,q i表示每个分类器θ i/C的softmax层的输出;训练集为给定来自M个类别的N个样本
    Figure PCTCN2020106995-appb-100005
    将对应的标签集表示为
    Figure PCTCN2020106995-appb-100006
    Cross Entropy为交叉熵函数;KL为Kullback-Leibler散度;q C为最深层分类器θ C的softmax层的输出;F i和F C分别表示每个分类器θ i/C中的特征和最深层分类器θ C中的特征,α和λ为控制KL散度损失函数与特征损失函数比例的超参数,用于最深层分类器的α和λ为零。
    Among them, q i represents the output of the softmax layer of each classifier θ i/C ; the training set is given N samples from M categories
    Figure PCTCN2020106995-appb-100005
    Denote the corresponding label set as
    Figure PCTCN2020106995-appb-100006
    Cross Entropy is the cross entropy function; KL is the Kullback-Leibler divergence; q C is the output of the softmax layer of the deepest classifier θ C ; F i and F C represent the features and maximum values in each classifier θ i/C . For the features in the deep classifier θ C , α and λ are hyperparameters that control the ratio of the KL divergence loss function to the feature loss function, and the α and λ used for the deepest classifier are zero.
  7. 根据权利要求1所述的一种卷积神经网络的自蒸馏训练方法,其特征在于,包括依次设置的瓶颈层、完全连接层和softmax层的浅层分类器能够在推理中移除。The self-distillation training method of a convolutional neural network according to claim 1, wherein the shallow classifier including the bottleneck layer, the fully connected layer and the softmax layer arranged in sequence can be removed in the inference.
  8. 一种卷积神经网络的可伸缩动态预测方法,其特征在于,所述的卷积神经网络为由权利要去1-7任意一项所述的自蒸馏训练方法得到可伸缩卷积神经网络,其可伸缩动态预测方法包括如下步骤,A scalable dynamic prediction method of a convolutional neural network, characterized in that the convolutional neural network is a scalable convolutional neural network obtained by the self-distillation training method described in any one of claims 1-7, The scalable dynamic prediction method includes the following steps,
    步骤1,分别设定所有浅层分类器和最深层分类器的阈值;Step 1. Set the thresholds of all shallow classifiers and the deepest classifier respectively;
    步骤2,根据深度从浅到深,对每一层分类器预测结果的置信度和阈值进行判断;若当前层的分类器预测结果的置信度大于当前层的分类器阈值,则认为当前层的分类器预测成功;否则,将由更深层的分类器继续预测,直到最后一层的分类器;随着深度的增加,预测正确率逐层提高;Step 2. According to the depth from shallow to deep, judge the confidence and threshold of the prediction result of each layer of classifier; if the confidence of the prediction result of the current layer is greater than the threshold of the classifier of the current layer, it is considered The classifier predicts successfully; otherwise, the deeper classifier will continue to predict until the last classifier; as the depth increases, the prediction accuracy rate will increase layer by layer;
    步骤3,在满足预测置信度的要求下,根据预测需求选取最浅层的预测结果或者最优正确率的预测结果作为可伸缩动态预测的输出。Step 3: Under the requirement of the prediction confidence, select the shallowest prediction result or the prediction result of the optimal accuracy rate as the output of the scalable dynamic prediction according to the prediction demand.
  9. 根据权利要求8所述的一种卷积神经网络的可伸缩动态预测方法,其特征在于,步骤1中,通过遗传算法对每一层分类器的阈值进行优化搜索;优化目标为快速的卷积神经网络模型响应速度和较高的预测正确率,优化解为可伸缩的卷积神经网络中浅层分类器对应的阈值;The scalable dynamic prediction method of a convolutional neural network according to claim 8, characterized in that, in step 1, the threshold value of each layer of classifier is optimized through genetic algorithm; the optimization goal is fast convolution The neural network model response speed and high prediction accuracy rate, the optimized solution is the threshold corresponding to the shallow classifier in the scalable convolutional neural network;
    步骤1.1,通过定义遗传算法中的基因到阈值的如下解码关系,对基因与阈值的相互映射关系进行定义;Step 1.1: Define the mapping relationship between genes and thresholds by defining the following decoding relationships from genes to thresholds in the genetic algorithm;
    Figure PCTCN2020106995-appb-100007
    Figure PCTCN2020106995-appb-100007
    其中,τ为阈值的下界,S(n)表示基因序列中第n位的值,σ表示第i个基因对应的阈值,N表示基因序列的长度;在基因序列中,“1”的数量越多,则阈值越低;Among them, τ is the lower bound of the threshold, S(n) represents the value of the nth position in the gene sequence, σ represents the threshold corresponding to the i-th gene, and N represents the length of the gene sequence; in the gene sequence, the number of "1"s is greater More, the lower the threshold;
    步骤1.2,根据可伸缩卷积神经网络的加速比与预测正确率得到如下的环境适宜度;Step 1.2, obtain the following environmental suitability according to the speedup ratio and prediction accuracy of the scalable convolutional neural network;
    fitness=acceleration ratio+γ·(accuracy-baseline)fitness=acceleration ratio+γ·(accuracy-baseline)
    其中,fitness表示每个基因对应的环境适宜度;acceleration ratio为加速比,表示可伸缩动态预测的预测响应速度与原可伸缩卷积神经网络的预测响应速度的比;accuracy与baseline分别表示可伸缩动态预测的预测正确率与原可伸缩卷积神经网络的预测正确 率;γ是响应加速与预测正确率的平衡因子;Among them, fitness represents the environmental suitability corresponding to each gene; acceleration ratio is the acceleration ratio, which represents the ratio of the predicted response speed of the scalable dynamic prediction to the predicted response speed of the original scalable convolutional neural network; accuracy and baseline respectively represent the scalability The prediction accuracy rate of dynamic prediction and the prediction accuracy rate of the original scalable convolutional neural network; γ is a balance factor between response acceleration and prediction accuracy;
    步骤1.3,根据以上定义,使用遗传算法对于阈值进行搜索;Step 1.3, according to the above definition, use genetic algorithm to search for the threshold;
    首先,对于表示阈值的基因进行随机初始化;First, perform random initialization for the gene representing the threshold;
    其次,计算所有基因对于环境的适宜程度;将适宜程度高的基因以较大几率保留,对于适宜程度低的基因以交叉几率淘汰;Secondly, calculate the suitability of all genes for the environment; keep genes with high suitability with a greater probability, and eliminate genes with low suitability with crossover probability;
    然后,将保留后的基因两两配对,得到新型的基因;Then, pair the retained genes in pairs to obtain new genes;
    迭代地进行以上过程,最终得到的对于环境适宜度最高的基因所表示的阈值,即为优化搜索后的阈值。The above process is performed iteratively, and the finally obtained threshold value for the gene with the highest environmental suitability is the threshold value after the optimized search.
  10. 根据权利要求8所述的一种卷积神经网络的可伸缩动态预测方法,其特征在于,当最深层分类器的预测结果优于多个分类器模型集成的时候,仅对前三个浅层分类器设置阈值,以最深层分类器的的预测结果作为最终结果。The scalable dynamic prediction method of a convolutional neural network according to claim 8, wherein when the prediction result of the deepest classifier is better than the integration of multiple classifier models, only the first three shallow layers The classifier sets the threshold, and uses the prediction result of the deepest classifier as the final result.
  11. 一种卷积申请网络的自蒸馏训练设备,其特征在于,包括存储器,用于存储计算机程序;处理器,用于执行所述计算机程序时实现如权利要求1-7任意一项所述的一种卷积申请网络的自蒸馏训练方法的步骤。A self-distillation training device for a convolution application network, which is characterized by comprising a memory for storing a computer program; a processor for implementing the one described in any one of claims 1-7 when executing the computer program The steps of a self-distillation training method for a convolution application network.
PCT/CN2020/106995 2019-08-07 2020-08-05 Self-distillation training method and device for convolutional neural network, and scalable dynamic prediction method WO2021023202A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910725997.XA CN110472730A (en) 2019-08-07 2019-08-07 A kind of distillation training method and the scalable dynamic prediction method certainly of convolutional neural networks
CN201910725997.X 2019-08-07

Publications (1)

Publication Number Publication Date
WO2021023202A1 true WO2021023202A1 (en) 2021-02-11

Family

ID=68510359

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/106995 WO2021023202A1 (en) 2019-08-07 2020-08-05 Self-distillation training method and device for convolutional neural network, and scalable dynamic prediction method

Country Status (2)

Country Link
CN (1) CN110472730A (en)
WO (1) WO2021023202A1 (en)

Cited By (62)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113011581A (en) * 2021-02-23 2021-06-22 北京三快在线科技有限公司 Neural network model compression method and device, electronic equipment and readable storage medium
CN113011091A (en) * 2021-03-08 2021-06-22 西安理工大学 Automatic-grouping multi-scale light-weight deep convolution neural network optimization method
CN113010674A (en) * 2021-03-11 2021-06-22 平安科技(深圳)有限公司 Text classification model packaging method, text classification method and related equipment
CN113110550A (en) * 2021-04-23 2021-07-13 南京大学 Unmanned aerial vehicle flight control method based on reinforcement learning and network model distillation
CN113159173A (en) * 2021-04-20 2021-07-23 北京邮电大学 Convolutional neural network model compression method combining pruning and knowledge distillation
CN113298817A (en) * 2021-07-02 2021-08-24 贵阳欧比特宇航科技有限公司 High-accuracy semantic segmentation method for remote sensing image
CN113392938A (en) * 2021-07-30 2021-09-14 广东工业大学 Classification model training method, Alzheimer disease classification method and device
CN113420812A (en) * 2021-06-23 2021-09-21 西安电子科技大学 Polarization SAR image classification method based on evolution convolutional neural network
CN113610126A (en) * 2021-07-23 2021-11-05 武汉工程大学 Label-free knowledge distillation method based on multi-target detection model and storage medium
CN113627537A (en) * 2021-08-12 2021-11-09 科大讯飞股份有限公司 Image identification method and device, storage medium and equipment
CN113723238A (en) * 2021-08-18 2021-11-30 北京深感科技有限公司 Human face lightweight network model construction method and human face recognition method
CN113793341A (en) * 2021-09-16 2021-12-14 湘潭大学 Automatic driving scene semantic segmentation method, electronic device and readable medium
CN113838008A (en) * 2021-09-08 2021-12-24 江苏迪赛特医疗科技有限公司 Abnormal cell detection method based on attention-drawing mechanism
CN113869512A (en) * 2021-10-09 2021-12-31 北京中科智眼科技有限公司 Supplementary label learning method based on self-supervision and self-distillation
CN113887698A (en) * 2021-08-25 2022-01-04 浙江大学 Overall knowledge distillation method and system based on graph neural network
CN113887610A (en) * 2021-09-29 2022-01-04 内蒙古工业大学 Pollen image classification method based on cross attention distillation transducer
CN113887647A (en) * 2021-10-14 2022-01-04 浙江大学 Class increase and decrease sample object detection method integrating knowledge distillation and class representative point extraction
CN113963022A (en) * 2021-10-20 2022-01-21 哈尔滨工业大学 Knowledge distillation-based target tracking method of multi-outlet full convolution network
CN114022872A (en) * 2021-09-24 2022-02-08 中国海洋大学 Multi-crop leaf disease identification method based on dynamic neural network
CN114022494A (en) * 2021-11-14 2022-02-08 北京工业大学 Automatic segmentation method of traditional Chinese medicine tongue image based on light convolutional neural network and knowledge distillation
CN114022727A (en) * 2021-10-20 2022-02-08 之江实验室 Deep convolution neural network self-distillation method based on image knowledge review
CN114037074A (en) * 2021-11-09 2022-02-11 北京百度网讯科技有限公司 Model pruning method and device, electronic equipment and storage medium
CN114037653A (en) * 2021-09-23 2022-02-11 上海仪电人工智能创新院有限公司 Industrial machine vision defect detection method and system based on two-stage knowledge distillation
CN114049527A (en) * 2022-01-10 2022-02-15 湖南大学 Self-knowledge distillation method and system based on online cooperation and fusion
CN114067099A (en) * 2021-10-29 2022-02-18 北京百度网讯科技有限公司 Training method of student image recognition network and image recognition method
CN114095447A (en) * 2021-11-22 2022-02-25 成都中科微信息技术研究院有限公司 Communication network encrypted flow classification method based on knowledge distillation and self-distillation
CN114118207A (en) * 2021-10-20 2022-03-01 清华大学 Incremental learning image identification method based on network expansion and memory recall mechanism
CN114241282A (en) * 2021-11-04 2022-03-25 河南工业大学 Knowledge distillation-based edge equipment scene identification method and device
CN114330457A (en) * 2022-01-06 2022-04-12 福州大学 EEG signal MI task classification method based on DSCNN and ELM
CN114463576A (en) * 2021-12-24 2022-05-10 中国科学技术大学 Network training method based on re-weighting strategy
CN114528937A (en) * 2022-02-18 2022-05-24 支付宝(杭州)信息技术有限公司 Model training method, device, equipment and system
CN114677673A (en) * 2022-03-30 2022-06-28 中国农业科学院农业信息研究所 Potato disease identification method based on improved YOLO V5 network model
CN114757100A (en) * 2022-04-12 2022-07-15 兰州理工大学 Tank bottom batch-based finished gasoline blending mixed formula model modeling method
CN114863353A (en) * 2022-04-19 2022-08-05 华南理工大学 Method and device for detecting relation between person and object and storage medium
CN114881206A (en) * 2022-04-21 2022-08-09 北京航空航天大学 General neural network distillation formula method
CN114898086A (en) * 2022-07-13 2022-08-12 山东圣点世纪科技有限公司 Target key point detection method based on cascade temperature control distillation
CN114972839A (en) * 2022-03-30 2022-08-30 天津大学 Generalized continuous classification method based on online contrast distillation network
CN114974228A (en) * 2022-05-24 2022-08-30 名日之梦(北京)科技有限公司 Rapid voice recognition method based on hierarchical recognition
CN114997333A (en) * 2022-06-29 2022-09-02 清华大学 Fault diagnosis method and device for wind driven generator
CN115082880A (en) * 2022-05-25 2022-09-20 安徽蔚来智驾科技有限公司 Target detection method, electronic device, medium, and vehicle
CN115131627A (en) * 2022-07-01 2022-09-30 贵州大学 Construction and training method of lightweight plant disease and insect pest target detection model
CN115170809A (en) * 2022-09-06 2022-10-11 浙江大华技术股份有限公司 Image segmentation model training method, image segmentation device, image segmentation equipment and medium
CN115294332A (en) * 2022-10-09 2022-11-04 浙江啄云智能科技有限公司 Image processing method, device, equipment and storage medium
CN115457006A (en) * 2022-09-23 2022-12-09 华能澜沧江水电股份有限公司 Unmanned aerial vehicle inspection defect classification method and device based on similarity consistency self-distillation
CN115511059A (en) * 2022-10-12 2022-12-23 北华航天工业学院 Network lightweight method based on convolutional neural network channel decoupling
CN115631631A (en) * 2022-11-14 2023-01-20 北京航空航天大学 Traffic flow prediction method and device based on bidirectional distillation network
CN116110022A (en) * 2022-12-10 2023-05-12 河南工业大学 Lightweight traffic sign detection method and system based on response knowledge distillation
CN116187322A (en) * 2023-03-15 2023-05-30 深圳市迪博企业风险管理技术有限公司 Internal control compliance detection method and system based on momentum distillation
CN116310667A (en) * 2023-05-15 2023-06-23 鹏城实验室 Self-supervision visual characterization learning method combining contrast loss and reconstruction loss
CN116502621A (en) * 2023-06-26 2023-07-28 北京航空航天大学 Network compression method and device based on self-adaptive comparison knowledge distillation
CN116645507A (en) * 2023-05-18 2023-08-25 丽水瑞联医疗科技有限公司 Placenta image processing method and system based on semantic segmentation
CN116778300A (en) * 2023-06-25 2023-09-19 北京数美时代科技有限公司 Knowledge distillation-based small target detection method, system and storage medium
CN117036790A (en) * 2023-07-25 2023-11-10 中国科学院空天信息创新研究院 Instance segmentation multi-classification method under small sample condition
CN117197590A (en) * 2023-11-06 2023-12-08 山东智洋上水信息技术有限公司 Image classification method and device based on neural architecture search and knowledge distillation
CN116881832B (en) * 2023-09-07 2023-12-08 山东能源数智云科技有限公司 Construction method and device of fault diagnosis model of rotary mechanical equipment
CN117274824A (en) * 2023-11-21 2023-12-22 岭南设计集团有限公司 Mangrove growth state detection method and system based on artificial intelligence
CN117393043A (en) * 2023-12-11 2024-01-12 浙江大学 Thyroid papilloma BRAF gene mutation detection device
CN117496509A (en) * 2023-12-25 2024-02-02 江西农业大学 Yolov7 grapefruit counting method integrating multi-teacher knowledge distillation
CN117542085A (en) * 2024-01-10 2024-02-09 湖南工商大学 Park scene pedestrian detection method, device and equipment based on knowledge distillation
CN117557857A (en) * 2023-11-23 2024-02-13 哈尔滨工业大学 Detection network light weight method combining progressive guided distillation and structural reconstruction
CN118072227A (en) * 2024-04-17 2024-05-24 西北工业大学太仓长三角研究院 Rail transit train speed measuring method based on knowledge distillation
CN114863353B (en) * 2022-04-19 2024-08-02 华南理工大学 Method and device for detecting relationship between person and object and storage medium

Families Citing this family (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110472730A (en) * 2019-08-07 2019-11-19 交叉信息核心技术研究院(西安)有限公司 A kind of distillation training method and the scalable dynamic prediction method certainly of convolutional neural networks
CN110991613B (en) * 2019-11-29 2022-08-02 支付宝(杭州)信息技术有限公司 Method and system for training neural network
CN111079574B (en) * 2019-11-29 2022-08-02 支付宝(杭州)信息技术有限公司 Method and system for training neural network
CN111159489B (en) * 2019-12-05 2022-05-03 中国科学院深圳先进技术研究院 Searching method
CN111143509B (en) * 2019-12-09 2023-06-30 天津大学 Dialogue generation method based on static-dynamic attention variation network
CN111062951B (en) * 2019-12-11 2022-03-25 华中科技大学 Knowledge distillation method based on semantic segmentation intra-class feature difference
CN111191722B (en) * 2019-12-30 2022-08-09 支付宝(杭州)信息技术有限公司 Method and device for training prediction model through computer
CN111444760B (en) * 2020-02-19 2022-09-09 天津大学 Traffic sign detection and identification method based on pruning and knowledge distillation
CN111340116A (en) * 2020-02-27 2020-06-26 中冶赛迪重庆信息技术有限公司 Converter flame identification method and system, electronic equipment and medium
CN111275192B (en) * 2020-02-28 2023-05-02 交叉信息核心技术研究院(西安)有限公司 Auxiliary training method for improving accuracy and robustness of neural network simultaneously
CN111368977B (en) * 2020-02-28 2023-05-02 交叉信息核心技术研究院(西安)有限公司 Enhanced data enhancement method for improving accuracy and robustness of convolutional neural network
CN113673533A (en) * 2020-05-15 2021-11-19 华为技术有限公司 Model training method and related equipment
CN111783606B (en) * 2020-06-24 2024-02-20 北京百度网讯科技有限公司 Training method, device, equipment and storage medium of face recognition network
CN112199535B (en) * 2020-09-30 2022-08-30 浙江大学 Image classification method based on integrated knowledge distillation
CN112422870B (en) * 2020-11-12 2021-09-17 复旦大学 Deep learning video frame insertion method based on knowledge distillation
CN112364926A (en) * 2020-11-17 2021-02-12 苏州大学 Gastroscope picture classification method and device based on ResNet-50 time compression and storage medium
CN112529178B (en) * 2020-12-09 2024-04-09 中国科学院国家空间科学中心 Knowledge distillation method and system suitable for detection model without preselection frame
CN112418190B (en) * 2021-01-21 2021-04-02 成都点泽智能科技有限公司 Mobile terminal medical protective shielding face recognition method, device, system and server
CN113221935B (en) * 2021-02-02 2023-05-02 北极雄芯信息科技(西安)有限公司 Image recognition method and system based on environment perception depth convolution neural network
CN112862095B (en) * 2021-02-02 2023-09-29 浙江大华技术股份有限公司 Self-distillation learning method and device based on feature analysis and readable storage medium
CN113034483B (en) * 2021-04-07 2022-06-10 昆明理工大学 Cigarette defect detection method based on deep migration learning
CN113191602A (en) * 2021-04-13 2021-07-30 上海东普信息科技有限公司 Logistics allocation method, device, equipment and storage medium based on address
CN113469963B (en) * 2021-06-24 2022-04-19 推想医疗科技股份有限公司 Pulmonary artery image segmentation method and device
CN113507466A (en) * 2021-07-07 2021-10-15 浙江大学 Method and system for defending backdoor attack by knowledge distillation based on attention mechanism
CN113592007B (en) * 2021-08-05 2022-05-31 哈尔滨理工大学 Knowledge distillation-based bad picture identification system and method, computer and storage medium
CN113486990B (en) * 2021-09-06 2021-12-21 北京字节跳动网络技术有限公司 Training method of endoscope image classification model, image classification method and device
CN113792822B (en) * 2021-11-16 2022-04-01 南京信息工程大学 Efficient dynamic image classification method
CN116416456B (en) * 2023-01-13 2023-10-24 北京数美时代科技有限公司 Self-distillation-based image classification method, system, storage medium and electronic device
CN117351533A (en) * 2023-04-19 2024-01-05 南通大学 Attention knowledge distillation-based lightweight pedestrian re-identification method
CN116384439B (en) * 2023-06-06 2023-08-25 深圳市南方硅谷半导体股份有限公司 Target detection method based on self-distillation

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103679185A (en) * 2012-08-31 2014-03-26 富士通株式会社 Convolutional neural network classifier system as well as training method, classifying method and application thereof
CN107229942A (en) * 2017-04-16 2017-10-03 北京工业大学 A kind of convolutional neural networks rapid classification method based on multiple graders
CN110472730A (en) * 2019-08-07 2019-11-19 交叉信息核心技术研究院(西安)有限公司 A kind of distillation training method and the scalable dynamic prediction method certainly of convolutional neural networks

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108734290B (en) * 2018-05-16 2021-05-18 湖北工业大学 Convolutional neural network construction method based on attention mechanism and application

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103679185A (en) * 2012-08-31 2014-03-26 富士通株式会社 Convolutional neural network classifier system as well as training method, classifying method and application thereof
CN107229942A (en) * 2017-04-16 2017-10-03 北京工业大学 A kind of convolutional neural networks rapid classification method based on multiple graders
CN110472730A (en) * 2019-08-07 2019-11-19 交叉信息核心技术研究院(西安)有限公司 A kind of distillation training method and the scalable dynamic prediction method certainly of convolutional neural networks

Cited By (103)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113011581B (en) * 2021-02-23 2023-04-07 北京三快在线科技有限公司 Neural network model compression method and device, electronic equipment and readable storage medium
CN113011581A (en) * 2021-02-23 2021-06-22 北京三快在线科技有限公司 Neural network model compression method and device, electronic equipment and readable storage medium
CN113011091A (en) * 2021-03-08 2021-06-22 西安理工大学 Automatic-grouping multi-scale light-weight deep convolution neural network optimization method
CN113010674B (en) * 2021-03-11 2023-12-22 平安创科科技(北京)有限公司 Text classification model packaging method, text classification method and related equipment
CN113010674A (en) * 2021-03-11 2021-06-22 平安科技(深圳)有限公司 Text classification model packaging method, text classification method and related equipment
CN113159173A (en) * 2021-04-20 2021-07-23 北京邮电大学 Convolutional neural network model compression method combining pruning and knowledge distillation
CN113159173B (en) * 2021-04-20 2024-04-26 北京邮电大学 Convolutional neural network model compression method combining pruning and knowledge distillation
CN113110550A (en) * 2021-04-23 2021-07-13 南京大学 Unmanned aerial vehicle flight control method based on reinforcement learning and network model distillation
CN113420812A (en) * 2021-06-23 2021-09-21 西安电子科技大学 Polarization SAR image classification method based on evolution convolutional neural network
CN113420812B (en) * 2021-06-23 2024-04-19 西安电子科技大学 Polarized SAR image classification method based on evolutionary convolutional neural network
CN113298817A (en) * 2021-07-02 2021-08-24 贵阳欧比特宇航科技有限公司 High-accuracy semantic segmentation method for remote sensing image
CN113610126A (en) * 2021-07-23 2021-11-05 武汉工程大学 Label-free knowledge distillation method based on multi-target detection model and storage medium
CN113610126B (en) * 2021-07-23 2023-12-05 武汉工程大学 Label-free knowledge distillation method based on multi-target detection model and storage medium
CN113392938A (en) * 2021-07-30 2021-09-14 广东工业大学 Classification model training method, Alzheimer disease classification method and device
CN113627537A (en) * 2021-08-12 2021-11-09 科大讯飞股份有限公司 Image identification method and device, storage medium and equipment
CN113627537B (en) * 2021-08-12 2023-12-01 科大讯飞股份有限公司 Image recognition method, device, storage medium and equipment
CN113723238B (en) * 2021-08-18 2024-02-09 厦门瑞为信息技术有限公司 Face lightweight network model construction method and face recognition method
CN113723238A (en) * 2021-08-18 2021-11-30 北京深感科技有限公司 Human face lightweight network model construction method and human face recognition method
CN113887698A (en) * 2021-08-25 2022-01-04 浙江大学 Overall knowledge distillation method and system based on graph neural network
CN113838008B (en) * 2021-09-08 2023-10-24 江苏迪赛特医疗科技有限公司 Abnormal cell detection method based on attention-introducing mechanism
CN113838008A (en) * 2021-09-08 2021-12-24 江苏迪赛特医疗科技有限公司 Abnormal cell detection method based on attention-drawing mechanism
CN113793341A (en) * 2021-09-16 2021-12-14 湘潭大学 Automatic driving scene semantic segmentation method, electronic device and readable medium
CN113793341B (en) * 2021-09-16 2024-02-06 湘潭大学 Automatic driving scene semantic segmentation method, electronic equipment and readable medium
CN114037653A (en) * 2021-09-23 2022-02-11 上海仪电人工智能创新院有限公司 Industrial machine vision defect detection method and system based on two-stage knowledge distillation
CN114022872A (en) * 2021-09-24 2022-02-08 中国海洋大学 Multi-crop leaf disease identification method based on dynamic neural network
CN114022872B (en) * 2021-09-24 2024-05-10 中国海洋大学 Dynamic neural network-based method for identifying leaf diseases of various crops
CN113887610A (en) * 2021-09-29 2022-01-04 内蒙古工业大学 Pollen image classification method based on cross attention distillation transducer
CN113887610B (en) * 2021-09-29 2024-02-02 内蒙古工业大学 Pollen image classification method based on cross-attention distillation transducer
CN113869512A (en) * 2021-10-09 2021-12-31 北京中科智眼科技有限公司 Supplementary label learning method based on self-supervision and self-distillation
CN113869512B (en) * 2021-10-09 2024-05-21 北京中科智眼科技有限公司 Self-supervision and self-distillation-based supplementary tag learning method
CN113887647A (en) * 2021-10-14 2022-01-04 浙江大学 Class increase and decrease sample object detection method integrating knowledge distillation and class representative point extraction
CN114118207A (en) * 2021-10-20 2022-03-01 清华大学 Incremental learning image identification method based on network expansion and memory recall mechanism
CN113963022A (en) * 2021-10-20 2022-01-21 哈尔滨工业大学 Knowledge distillation-based target tracking method of multi-outlet full convolution network
CN114022727B (en) * 2021-10-20 2024-04-26 之江实验室 Depth convolution neural network self-distillation method based on image knowledge review
CN114022727A (en) * 2021-10-20 2022-02-08 之江实验室 Deep convolution neural network self-distillation method based on image knowledge review
CN113963022B (en) * 2021-10-20 2023-08-18 哈尔滨工业大学 Multi-outlet full convolution network target tracking method based on knowledge distillation
CN114067099B (en) * 2021-10-29 2024-02-06 北京百度网讯科技有限公司 Training method of student image recognition network and image recognition method
CN114067099A (en) * 2021-10-29 2022-02-18 北京百度网讯科技有限公司 Training method of student image recognition network and image recognition method
CN114241282B (en) * 2021-11-04 2024-01-26 河南工业大学 Knowledge distillation-based edge equipment scene recognition method and device
CN114241282A (en) * 2021-11-04 2022-03-25 河南工业大学 Knowledge distillation-based edge equipment scene identification method and device
CN114037074A (en) * 2021-11-09 2022-02-11 北京百度网讯科技有限公司 Model pruning method and device, electronic equipment and storage medium
CN114022494A (en) * 2021-11-14 2022-02-08 北京工业大学 Automatic segmentation method of traditional Chinese medicine tongue image based on light convolutional neural network and knowledge distillation
CN114022494B (en) * 2021-11-14 2024-03-29 北京工业大学 Automatic segmentation method for traditional Chinese medicine tongue image based on light convolutional neural network and knowledge distillation
CN114095447A (en) * 2021-11-22 2022-02-25 成都中科微信息技术研究院有限公司 Communication network encrypted flow classification method based on knowledge distillation and self-distillation
CN114095447B (en) * 2021-11-22 2024-03-12 成都中科微信息技术研究院有限公司 Communication network encryption flow classification method based on knowledge distillation and self-distillation
CN114463576B (en) * 2021-12-24 2024-04-09 中国科学技术大学 Network training method based on re-weighting strategy
CN114463576A (en) * 2021-12-24 2022-05-10 中国科学技术大学 Network training method based on re-weighting strategy
CN114330457A (en) * 2022-01-06 2022-04-12 福州大学 EEG signal MI task classification method based on DSCNN and ELM
CN114049527A (en) * 2022-01-10 2022-02-15 湖南大学 Self-knowledge distillation method and system based on online cooperation and fusion
CN114049527B (en) * 2022-01-10 2022-06-14 湖南大学 Self-knowledge distillation method and system based on online cooperation and fusion
CN114528937A (en) * 2022-02-18 2022-05-24 支付宝(杭州)信息技术有限公司 Model training method, device, equipment and system
CN114677673B (en) * 2022-03-30 2023-04-18 中国农业科学院农业信息研究所 Potato disease identification method based on improved YOLO V5 network model
CN114972839A (en) * 2022-03-30 2022-08-30 天津大学 Generalized continuous classification method based on online contrast distillation network
CN114677673A (en) * 2022-03-30 2022-06-28 中国农业科学院农业信息研究所 Potato disease identification method based on improved YOLO V5 network model
CN114757100A (en) * 2022-04-12 2022-07-15 兰州理工大学 Tank bottom batch-based finished gasoline blending mixed formula model modeling method
CN114863353B (en) * 2022-04-19 2024-08-02 华南理工大学 Method and device for detecting relationship between person and object and storage medium
CN114863353A (en) * 2022-04-19 2022-08-05 华南理工大学 Method and device for detecting relation between person and object and storage medium
CN114881206A (en) * 2022-04-21 2022-08-09 北京航空航天大学 General neural network distillation formula method
CN114881206B (en) * 2022-04-21 2024-05-28 北京航空航天大学 General neural network distillation formula method
CN114974228B (en) * 2022-05-24 2023-04-11 名日之梦(北京)科技有限公司 Rapid voice recognition method based on hierarchical recognition
CN114974228A (en) * 2022-05-24 2022-08-30 名日之梦(北京)科技有限公司 Rapid voice recognition method based on hierarchical recognition
CN115082880A (en) * 2022-05-25 2022-09-20 安徽蔚来智驾科技有限公司 Target detection method, electronic device, medium, and vehicle
CN114997333A (en) * 2022-06-29 2022-09-02 清华大学 Fault diagnosis method and device for wind driven generator
CN114997333B (en) * 2022-06-29 2024-04-23 清华大学 Fault diagnosis method and device for wind driven generator
CN115131627B (en) * 2022-07-01 2024-02-20 贵州大学 Construction and training method of lightweight plant disease and pest target detection model
CN115131627A (en) * 2022-07-01 2022-09-30 贵州大学 Construction and training method of lightweight plant disease and insect pest target detection model
CN114898086A (en) * 2022-07-13 2022-08-12 山东圣点世纪科技有限公司 Target key point detection method based on cascade temperature control distillation
CN114898086B (en) * 2022-07-13 2022-09-20 山东圣点世纪科技有限公司 Target key point detection method based on cascade temperature control distillation
CN115170809A (en) * 2022-09-06 2022-10-11 浙江大华技术股份有限公司 Image segmentation model training method, image segmentation device, image segmentation equipment and medium
CN115457006A (en) * 2022-09-23 2022-12-09 华能澜沧江水电股份有限公司 Unmanned aerial vehicle inspection defect classification method and device based on similarity consistency self-distillation
CN115457006B (en) * 2022-09-23 2023-08-22 华能澜沧江水电股份有限公司 Unmanned aerial vehicle inspection defect classification method and device based on similarity consistency self-distillation
CN115294332A (en) * 2022-10-09 2022-11-04 浙江啄云智能科技有限公司 Image processing method, device, equipment and storage medium
CN115294332B (en) * 2022-10-09 2023-01-17 浙江啄云智能科技有限公司 Image processing method, device, equipment and storage medium
CN115511059A (en) * 2022-10-12 2022-12-23 北华航天工业学院 Network lightweight method based on convolutional neural network channel decoupling
CN115511059B (en) * 2022-10-12 2024-02-09 北华航天工业学院 Network light-weight method based on convolutional neural network channel decoupling
CN115631631A (en) * 2022-11-14 2023-01-20 北京航空航天大学 Traffic flow prediction method and device based on bidirectional distillation network
CN116110022A (en) * 2022-12-10 2023-05-12 河南工业大学 Lightweight traffic sign detection method and system based on response knowledge distillation
CN116110022B (en) * 2022-12-10 2023-09-05 河南工业大学 Lightweight traffic sign detection method and system based on response knowledge distillation
CN116187322A (en) * 2023-03-15 2023-05-30 深圳市迪博企业风险管理技术有限公司 Internal control compliance detection method and system based on momentum distillation
CN116310667A (en) * 2023-05-15 2023-06-23 鹏城实验室 Self-supervision visual characterization learning method combining contrast loss and reconstruction loss
CN116310667B (en) * 2023-05-15 2023-08-22 鹏城实验室 Self-supervision visual characterization learning method combining contrast loss and reconstruction loss
CN116645507A (en) * 2023-05-18 2023-08-25 丽水瑞联医疗科技有限公司 Placenta image processing method and system based on semantic segmentation
CN116778300B (en) * 2023-06-25 2023-12-05 北京数美时代科技有限公司 Knowledge distillation-based small target detection method, system and storage medium
CN116778300A (en) * 2023-06-25 2023-09-19 北京数美时代科技有限公司 Knowledge distillation-based small target detection method, system and storage medium
CN116502621A (en) * 2023-06-26 2023-07-28 北京航空航天大学 Network compression method and device based on self-adaptive comparison knowledge distillation
CN116502621B (en) * 2023-06-26 2023-10-17 北京航空航天大学 Network compression method and device based on self-adaptive comparison knowledge distillation
CN117036790B (en) * 2023-07-25 2024-03-22 中国科学院空天信息创新研究院 Instance segmentation multi-classification method under small sample condition
CN117036790A (en) * 2023-07-25 2023-11-10 中国科学院空天信息创新研究院 Instance segmentation multi-classification method under small sample condition
CN116881832B (en) * 2023-09-07 2023-12-08 山东能源数智云科技有限公司 Construction method and device of fault diagnosis model of rotary mechanical equipment
CN117197590B (en) * 2023-11-06 2024-02-27 山东智洋上水信息技术有限公司 Image classification method and device based on neural architecture search and knowledge distillation
CN117197590A (en) * 2023-11-06 2023-12-08 山东智洋上水信息技术有限公司 Image classification method and device based on neural architecture search and knowledge distillation
CN117274824A (en) * 2023-11-21 2023-12-22 岭南设计集团有限公司 Mangrove growth state detection method and system based on artificial intelligence
CN117274824B (en) * 2023-11-21 2024-02-27 岭南设计集团有限公司 Mangrove growth state detection method and system based on artificial intelligence
CN117557857A (en) * 2023-11-23 2024-02-13 哈尔滨工业大学 Detection network light weight method combining progressive guided distillation and structural reconstruction
CN117557857B (en) * 2023-11-23 2024-06-04 哈尔滨工业大学 Detection network light weight method combining progressive guided distillation and structural reconstruction
CN117393043B (en) * 2023-12-11 2024-02-13 浙江大学 Thyroid papilloma BRAF gene mutation detection device
CN117393043A (en) * 2023-12-11 2024-01-12 浙江大学 Thyroid papilloma BRAF gene mutation detection device
CN117496509B (en) * 2023-12-25 2024-03-19 江西农业大学 Yolov7 grapefruit counting method integrating multi-teacher knowledge distillation
CN117496509A (en) * 2023-12-25 2024-02-02 江西农业大学 Yolov7 grapefruit counting method integrating multi-teacher knowledge distillation
CN117542085B (en) * 2024-01-10 2024-05-03 湖南工商大学 Park scene pedestrian detection method, device and equipment based on knowledge distillation
CN117542085A (en) * 2024-01-10 2024-02-09 湖南工商大学 Park scene pedestrian detection method, device and equipment based on knowledge distillation
CN118072227A (en) * 2024-04-17 2024-05-24 西北工业大学太仓长三角研究院 Rail transit train speed measuring method based on knowledge distillation
CN118072227B (en) * 2024-04-17 2024-07-05 西北工业大学太仓长三角研究院 Rail transit train speed measuring method based on knowledge distillation

Also Published As

Publication number Publication date
CN110472730A (en) 2019-11-19

Similar Documents

Publication Publication Date Title
WO2021023202A1 (en) Self-distillation training method and device for convolutional neural network, and scalable dynamic prediction method
Zheng et al. PAC-Bayesian framework based drop-path method for 2D discriminative convolutional network pruning
Xu et al. Knowledge distillation meets self-supervision
CN110378383B (en) Picture classification method based on Keras framework and deep neural network
CN110837836A (en) Semi-supervised semantic segmentation method based on maximized confidence
CN108921047B (en) Multi-model voting mean value action identification method based on cross-layer fusion
Ma et al. Lightweight attention convolutional neural network through network slimming for robust facial expression recognition
Khadangi et al. EM-net: deep learning for electron microscopy image segmentation
Loey et al. Deep learning autoencoder approach for handwritten arabic digits recognition
CN107451545A (en) The face identification method of Non-negative Matrix Factorization is differentiated based on multichannel under soft label
Liang et al. Evolutionary deep fusion method and its application in chemical structure recognition
Dai et al. Hybrid deep model for human behavior understanding on industrial internet of video things
Hu et al. Multi-label learning from noisy labels with non-linear feature transformation
CN115270752A (en) Template sentence evaluation method based on multilevel comparison learning
Shehu et al. Lateralized approach for robustness against attacks in emotion categorization from images
Zhao et al. Cbph-net: A small object detector for behavior recognition in classroom scenarios
Gjertsson Segmentation in Skeletal Scintigraphy Images using Convolutional Neural Networks
Bouguezzi et al. Improved architecture for traffic sign recognition using a self-regularized activation function: SigmaH
CN114202021A (en) Knowledge distillation-based efficient image classification method and system
CN113408418A (en) Calligraphy font and character content synchronous identification method and system
Paul et al. Reinforced quasi-random forest
Zhang et al. Soft Hybrid Knowledge Distillation against deep neural networks
Zhang et al. SSIT: a sample selection-based incremental model training method for image recognition
Jain et al. Flynet–neural network model for automatic building detection from satellite images
Li et al. Toward a Deeper understanding: RetNet viewed through convolution

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20851019

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20851019

Country of ref document: EP

Kind code of ref document: A1