CN116452863A

CN116452863A - Class center knowledge distillation method for remote sensing image scene classification

Info

Publication number: CN116452863A
Application number: CN202310328849.0A
Authority: CN
Inventors: 刘智; 刘潇; 芮杰; 王淑香; 林雨准; 金飞; 左溪冰; 李美霖
Original assignee: Information Engineering University of PLA Strategic Support Force
Current assignee: Information Engineering University of PLA Strategic Support Force
Priority date: 2023-03-30
Filing date: 2023-03-30
Publication date: 2023-07-18

Abstract

The invention belongs to the technical field of remote sensing image scene classification, and discloses a class center knowledge distillation method for remote sensing image scene classification. The invention has good generalization capability, and the extracted characteristics have good intra-class compactness and inter-class discreteness.

Description

Class center knowledge distillation method for remote sensing image scene classification

Technical Field

The invention relates to the technical field of remote sensing image scene classification, in particular to a class center knowledge distillation method for remote sensing image scene classification.

Background

In recent years, due to the development of remote sensing imaging technology, the spatial resolution of the remote sensing image is greatly improved, so that the expression capability of the image on target details is gradually improved, and meanwhile, the practical application requirements of intelligent knowledge extraction and information mining are met. The high-resolution remote sensing image scene classification is used as an important research field of remote sensing data intelligent interpretation tasks, and can provide support for environmental monitoring, urban planning, resource investigation and the like.

The remote sensing image scene classification identifies a scene class for each image patch based on global semantic information. The traditional remote sensing image scene classification method based on the artificial feature descriptors is used for directly classifying global descriptors such as texture features or carrying out visual word bag and other coding on local descriptors such as scale-invariant feature transformation to represent the whole scene. Because of the limited representation capability of artificial features, the method is not suitable for complex scene images, and researchers turn to using non-supervision learning methods such as sparse coding and the like. However, the unsupervised learning method cannot make full use of the data class information. In recent years, with the rapid development of deep learning, convolutional neural networks are widely applied to the field of scene classification with strong feature extraction capability and have achieved great success.

The algorithm based on deep learning is rapidly developed and takes the dominant role in the field of remote sensing image scene classification. However, the high-performance network model usually has a large number of training parameters, high operation cost and large resource consumption, and such computational complexity and high storage requirements are difficult to meet during on-orbit processing of the mobile terminal embedded device and the satellite. Therefore, compression models are imperative. Model compression aims at simplifying the model and compensating the precision at the same time, and is a necessary trend of actual application landing. The currently prevailing algorithms are: network pruning, parameter quantization and knowledge distillation. Network pruning and parameter quantization achieve the purpose of simplifying model parameters by deleting redundant parameters based on designed standards and replacing original floating point parameters by low bit width. And the knowledge distillation is to migrate implicit knowledge from the complex teacher network to the light-weight student network, so that the light-weight network approximates to the performance of the complex network, and the compression of the model structure is realized.

The idea of knowledge distillation was first traced back to the theory that a complex integrated model proposed by Bucilua et al [ Bucilu Bu A, caruana R, niculiscu-Mizil A.model compression [ C ]// Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining.New York:ACM,2006:535-541.DOI:10.1145/1150402.1150464] can be converted into a simple neural network by model compression. Ba et al [ Ba J and Caruana R.Do deep nets really need to be deep? [C] Advances in Neural Information Processing Systems,2014,27.DOI:10.48550/arXiv.1312.6184] verified by experimentation, suggesting that simulation of small models can be achieved by minimizing the loss of L2 between logical cell values of large and small models. However, the output values of such fully connected layers that are not subjected to the softmax function are not constrained and may contain noise during the model training test. To this end, hinton et al [ Hinton G, vinylals O, dean J.Distilling the knowledge in a neural network [ EB/OL ].2015: arXiv:1503.02531.Https:// arxiv. Org/abs/1503.02531] propose to use a "soft target", soften the output class probability by a softmax function with temperature coefficient, and fit the class probability distribution by KL divergence. This process of transferring knowledge from a complex teacher model to a simple student model is initially defined as knowledge distillation. As knowledge distillation is further explored, knowledge of distillation is expanded in abundance. Existing methods can be categorized into knowledge distillation based on response, characteristics, instance relationships, and network interlayer relationships, depending on the knowledge type.

Knowledge distillation has achieved remarkable results in the fields of computer vision and the like as an efficient means of model compression. However, the field of Remote Sensing image scene classification has relatively few researches on knowledge distillation, chen et al [ Chen G Z, zhang X D, tan X L, et al training small networks for scene classification of Remote Sensing images via knowledge distillation [ J ]. Remote Sensing,2018,10 (5): 719.DOI:10.3390/rs10050719] originally introduced classical knowledge distillation into Remote Sensing image scene classification, and the performance of a shallow network can be effectively improved by matching the output of softmax layers of the deep network and the shallow network. Yang et al [ Yang Hongbing, bay-YO-Wang Jinguang ] computer applied research on remote sensing satellite image classification method based on knowledge distillation of pruning network [ J ]. 2021,38 (8): 2469-2473 ] [ Yang H B, chi Y X, wang J G.Knowledge distillation method for remote sensing satellite image classification based on pruning network [ J ]. Application Research of Computers,2021,38 (8): 2469-2473 ] DOI:10.19734/j.issn.1001-3695.2020.07.0387] introduced knowledge distillation to compensate for the loss of accuracy caused by model pruning to compress the model. Zhao et al [ Zhao H R, sun X, gao F, et al pair-wise similarity knowledge distillation for RSI scene classification [ J ]. Remote Sensing 2022,14 (10): 2483.DOI:10.3390/rs14102483] introduce pairwise similarity knowledge distillation and mix samples of different tags using a mixup technique to improve student network accuracy by additionally migrating knowledge of the correlation of similarity between virtual samples. However, most of the methods directly apply the existing knowledge distillation algorithm, neglect challenges of intra-class high diversity and inter-class low separability in scene classification tasks, lose discrimination information of intra-class diversity and inter-class similarity of scene data, reduce classification precision of student networks to a certain extent, and cause general compression effect expression.

Disclosure of Invention

The invention provides a similar central knowledge distillation method for compressing a heavyweight network to obtain a lightweight network, which is oriented to a remote sensing image scene classification task, and an overall framework comprises three parts of teacher network fine adjustment, teacher and student network distillation and student network prediction. In order to enable the light-weight network to cope with challenges of intra-class high-variability and inter-class low-separability in scene classification tasks, a new knowledge distillation loss function is designed, and powerful feature extraction capacity of a teacher network is efficiently transferred by restricting the distance between similar feature distribution centers extracted by the teacher and student networks, so that the characteristics extracted by the student networks are compact in class and discrete among classes. The invention evaluates the performance of the proposed method and the existing knowledge distillation method based on response, characteristics, instance relation and network interlayer relation on four public data sets on the remote sensing image scene classification task, and experimental results prove the effectiveness of the center-like knowledge distillation method.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a class center knowledge distillation method for remote sensing image scene classification comprises the following steps:

firstly, sending a remote sensing image scene classification data set into a pre-trained teacher network for parameter fine adjustment;

then, condensing the similar characteristic distribution into knowledge, extracting knowledge from the adjusted teacher network, and guiding the student network to carry out distillation training by using the central knowledge of the class of the hidden layer in the middle of the network;

and finally, independently using the trained student network for remote sensing image scene prediction.

Further, the method further comprises the following steps:

during the distillation stage of the teacher-student network, the student network is supervised by combining the truth value label.

Further, in the network training stage, data enhancement is performed by using Gaussian blur with random overturn and random radius, and in the testing stage, test data is not enhanced.

Further, condensing the similar feature distribution into knowledge, and extracting the knowledge from the adjusted teacher network comprises:

based on the designed class center distillation loss function, the characteristic extraction capacity of the teacher network is migrated to the student network, and knowledge transfer is completed by constraining the distance between the similar characteristic distribution centers extracted by the teacher network and the student network: aiming at the output characteristics of one or more specific layers of a teacher-student network, classifying according to labels, solving the centers of characteristic distribution of each instance of the same label, and realizing the learning of class characteristic knowledge extracted by a teacher model by a student model by minimizing the distance between each class of centers between the teacher and the student.

Further, the center-like distillation loss function is:

in the method, in the process of the invention,

wherein,,clustering centers of the teacher network and the student network, which respectively represent the distances between the ith class center, are +.>Representing cluster center->And->The distance between N represents the number of instances of the same tag, k (·, ·) represents the kernel function that projects the feature vector into a higher-or infinite-dimensional feature space, C _T 、C _S The number of channels of teacher network and student network are respectively represented, < >>Characteristic diagram and corresponding transpose respectively representing ith channel of teacher network,/for each channel>Respectively represent studentsA feature map and corresponding transpose of the jth lane of the network.

Further, in the training process, the overall loss function adopted is as follows:

in the method, in the process of the invention,

wherein,,representing the predicted probability of a sample in class i, z _i Refers to the logic, y of class i _true Representing the truth-value label, Σ refers to summing λ of all matching layers selected by the invention, which is a super-parameter for balancing two-part losses.

Further, the teacher network employs ResNet50 and the student network employs ResNet-18 or MobileNet V2.

Compared with the prior art, the invention has the beneficial effects that:

Drawings

FIG. 1 is a model frame of a class-center knowledge distillation method for remote sensing image scene classification in an embodiment of the invention;

FIG. 2 is an exemplary sample of 4 remote sensing image scene classification datasets according to an embodiment of the invention;

FIG. 3 is a plot of the accuracy confusion matrix (%) for the comparative experiments of the present invention on the RSC11 dataset; wherein (a) corresponds to a student network trained alone, (b) corresponds to a teacher network for fine tuning, (c) corresponds to a student network trained based on the improved NST method, (d) corresponds to a student network trained based on the method of the present invention;

FIG. 4 is a graph showing a test error curve according to an embodiment of the present invention; wherein (a) is the method of the invention compared to training alone and (b) is the method of the invention compared to improved NST;

FIG. 5 is a characteristic scatter diagram of an RSSCN7 dataset visualized by a T-SNE algorithm, wherein the red circle in the diagram represents a region with obvious difference in compactness of characteristic clusters; wherein (a) corresponds to a separately trained student network, (b) corresponds to a modified NST method, (c) corresponds to the method of the invention;

FIG. 6 is a thermodynamic diagram of an output feature layer according to an embodiment of the present invention; wherein (a) corresponds to a separately trained student network, (b) corresponds to a fine-tuned teacher network, (c) corresponds to an improved NST method, and (d) corresponds to the method of the present invention.

Detailed Description

The invention is further illustrated by the following description of specific embodiments in conjunction with the accompanying drawings:

the model framework of the remote sensing image scene classification-oriented class center knowledge distillation method provided by the invention is shown in figure 1, and consists of two streams of a teacher network and a student network. Firstly, sending a remote sensing image scene classification data set into a teacher network pre-trained by a large-scale data set for parameter fine adjustment; then, condensing the same type of characteristic distribution into knowledge, extracting knowledge from the adjusted teacher network, guiding the student network to carry out distillation training by using the class center knowledge of the hidden layer in the middle of the network, and introducing a distillation loss function in detail in section 1.1; in addition, during the distillation stage of the teacher-student network, the student network needs to be supervised by combining a truth value tag, and the total loss function will be described in section 1.2. And finally, independently using the trained student network for remote sensing image scene prediction. Specific steps of the model framework will be listed in section 1.3.

1.1 distillation loss function

The remote sensing image scene classification faces the challenges of intra-class high diversity and inter-class low separability, and the existing knowledge distillation method ignores the discrimination information of the intra-class diversity and the inter-class similarity of scene data, so that the characteristic extraction capability of a teacher model cannot be well learned. Therefore, the invention designs a class center knowledge distillation loss function for classifying remote sensing image scenes, the knowledge transfer about classes is completed by restraining the distance between similar feature distribution centers extracted by a teacher-student network, and the features extracted by a student model are expected to have good intra-class compactness and inter-class discreteness with the teacher model.

1.1.1 neuron Selective migration Algorithm

The neuron selective migration (Neuron Selectivity Transfer, NST) algorithm achieves knowledge transfer by minimizing the maximum average difference metric of the neuron selective feature distribution between the teacher and student networks. The maximum average difference (Maximum Mean Discrepancy, MMD) is used to measure the difference in probability distribution between data samples, in particular by mapping two non-linearly separable distributions to a high-dimensional linearly separable regenerated kernel hilbert space (Reproducing Kernel Hilbert Space, RKHS) to calculate the distance, as shown in equation (1).

Wherein: sample x ⁱ ，y ^j Respectively from sample setsPhi (-) represents an explicit mapping function.

The NST algorithm looks at the activation value of each spatial location as a feature f ^ij Feature map of all positions of each channelVector flattened to 1 XHW dimension and regarded as one sample +.>Sample of all channels->Constitutes a distribution of C x HW->By constraining F _T And F _S The maximum average difference between the two distributions matches the characteristics of the output layer of the teacher and student networks as shown in the formula (2).

Based on the method, the invention further gels the similar characteristic distribution into knowledge. The feature distribution of the similar examples is similar, a cluster is formed by mapping the feature distribution into a high-dimensional space, the intra-class compact and inter-class discrete features extracted by the teacher model are embodied in the information of the cluster, and the feature distribution is characterized as class feature knowledge for the student model to learn. Specifically, aiming at the output characteristics of one or more specific layers of a teacher-student network, the centers of characteristic distribution of all examples of the same label are solved according to the label classification, and class characteristic knowledge extracted by a teacher model and a teacher model is learned by a student model by minimizing the distance between each class of centers.

Class 1.1.2 center knowledge distillation losses

The invention represents the characteristic diagram of a specific output layer in a network asOne example k is trained by a teacher and students network to respectively generate characteristic diagrams (F _T ) ^k And (F) _S ) ^k Can be regarded as two feature distributions +.>And->Mapping it to regenerated kernel Hilbert space phi ((F) _T ) ^k ) And phi ((F) _S ) ^k ) Then, solving the clustering center for the feature distribution of N instances under the same label>And->And calculates the distance between the two>

Note that the feature maps of the output layers corresponding to the teacher-student network should have the same spatial dimension h×w, and interpolation is required if the feature maps do not match in size.

The mapping function phi (·) is computationally complex, and can be performed using kernel techniques K (x, y) =<φ(x),φ(y)>＝φ(x) ^T Phi (y) simplifies the calculation. Normalization of samples with L2 normsEnsure that the samples are compared at the same level, then +.>Restated as shown in formula (4).

Wherein: k (·, ·) represents the kernel function that projects the feature vector into a higher-dimensional or infinite-dimensional feature space. A polynomial kernel function is used which is used,where c=0, d=2. C (C) _T 、C _S The number of channels of teacher network and student network are respectively represented, < >>Characteristic diagram and corresponding transpose respectively representing ith channel of teacher network,/for each channel>Respectively representing the characteristic diagram and the corresponding transpose of the jth channel of the student network.

Then, summing M class center distances to obtain a class center knowledge distillation loss function value of a specific output layerAs shown in formula (5).

Wherein,,and respectively representing the clustering centers of the teacher network and the student network corresponding to the ith class center distance.

The invention relates to distillation of a characteristic diagram of a plurality of output layers in the middle of a network, and as shown in fig. 1, the selection of the output layers is introduced in experimental setting.

1.2 Overall loss function

In the training process, the student model is forced to match the ground truth value label by using the standard cross entropy loss function, so that the student model is beneficial to improving the performance, as shown in a formula (6).

Wherein,,representing the predicted probability of a sample in class i, z _i Refers to the logic, y of class i _true Representing a truth label.

Thus, the overall objective function during training, which includes both center-like knowledge distillation loss and standard cross entropy loss, can be expressed as

Note that Σ refers to summing all matching layers selected by the present invention, λ being the hyper-parameter that balances the two-part loss.

1.3 specific flow of algorithm

Firstly, training a teacher model by using a standard supervised learning strategy, adopting a pre-training-fine tuning mechanism to pre-train the teacher model in advance of an ImageNet data set in order to enable the teacher model to obtain good characteristic representation capability, and then sending the model into a remote sensing scene classification data set for fine tuning. The finely tuned teacher network guides the student network to train, and the student network loses the ability of learning the teacher network to extract the characteristics according to class center distillation and is supervised by a truth value label. Finally, the student network prediction performance is tested independently. The whole training process is as follows:

to verify the effect of the invention, the following experiments were performed:

2 experiment and analysis

2.1 data sets

The comprehensive experiment is carried out on four main current remote sensing image scene classification data sets of RSC11, UC detected Land-Use (UCM), RSSCN7 and Aerial Image Dataset (AID), and the details of the data sets are shown in table 1.

TABLE 1 remote sensing image scene classification dataset

Some images were randomly extracted from the dataset as example samples, as shown in fig. 2. From the left two columns of FIG. 2, it can be seen that the scene classification datasets are highly diverse in class, such as planting land, residential area, grassland, tourist attractions, etc.; from the right four columns of fig. 2, it can be observed that some scene classes in the dataset have high similarity, such as three classes of highways, interchange and railways in the RSC11 dataset, various subdivided residential areas and buildings in the UCM dataset, industrial areas and residential areas, farmlands and grasslands in the RSSCN7 dataset, deserts and bare lands in the AID dataset, lakes and parks, which are very similar and difficult to distinguish. This presents a significant challenge for a lightweight classification network that is compact in structure.

2.2 Experimental setup

Network structure. The invention adopts ResNet and MobileNet series as the basic framework of the teacher-student network. Wherein the teacher network is ResNet-50, and the student network uses ResNet-18 and MobileNet V2 to study the knowledge distillation performance of the teacher-student network model under the same series and different series. The invention designs that the characteristic diagrams of four output layers in the middle of the network are subjected to center-like knowledge distillation, and the detailed information of the network structure and the size information of the characteristics of each output layer are shown in table 2.

Table 2 teacher-student network structure and output layer characteristic information

Wherein F represents a characteristic diagram of an intermediate output layer of the design distillation network,the size of the representative feature map is N×M, and the number of channels is C.

Experimental configuration. The invention uses NVIDIATesla 4 to carry out comprehensive experiments in a PyTorch environment. During the training phase, we use random inversion and gaussian blur with random radius for data enhancement. During the test phase, no enhancement is made to the test data. The teacher network fine tuning process sets the batch size to 64, the initial learning rate to 1e-4, the learning rate to exponential decay, and the number of iterations to 160 using a random gradient descent (SGD) with momentum of 0.9 as the optimizer. The independent training process and the knowledge distillation process of the student network are both set to be 32 in batch size, the initial learning rate is 0.05, the learning rate is still regulated according to exponential decay, a random gradient descent (SGD) with the momentum of 0.9 is used as an optimizer, and the iteration number is set to be 240. All training adopts an early termination strategy, and if the verification loss is not reduced after 30 continuous iterative computations, the training is terminated.

And (5) setting super parameters. The invention sets the balance factor lambda to 50 and adopts 20 generations of linear preheating. The optimal setting of parameters is obtained by experiments, lambda has little influence on the overall precision in a proper range, but the precision tends to be reduced along with the increase of lambda value, and analysis can be that lambda with a large value leads to larger initial loss, so that the initial loss is reduced by using linear preheating in a certain period, and finally, the 20 generations of linear preheating energy is obtained, so that the precision is effectively improved.

2.3 experimental results and analysis

2.3.1 effectiveness experiments

In order to verify the effectiveness of the center-like knowledge distillation method, experiments with different teacher-to-student architectures and training ratios are firstly carried out on a classical remote sensing image scene classification data set UCM, and the experiments are compared with eight advanced knowledge distillation methods: KD. DKD, NST, VID [ Ahn S, hu S X, damia A, et al, variation information distillation for knowledge transfer [ C ]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) & IEEE 2020:9155-9163.DOI:10.1109/CVPR.2019.00938], KDSVD, reviewKD, RKD, SP, for ease of analysis, these methods are classified into four classes according to the type of knowledge. The experimental results are shown in table 3, and the method provided by the invention achieves optimal precision.

In the experiments of the teacher and student networks belonging to the same series, it can be observed that the class center knowledge distillation method improves the classification overall precision by more than 5% compared with the independent training of the student networks, achieves improvement of 1.43% compared with the most advanced DKD method under the training rate of 80%, improves the overall precision by 97.14%, and improves the overall precision by 2.74% compared with the most accurate review KD method under the training rate of 60%, thereby achieving the overall precision of 95.36%.

In experiments of different series of teacher and student networks, the accuracy is improved, and the distillation results using MobileNet-V2 as a student model show that the accuracy of 94.76% and 92.62% of the method is respectively obtained under the training ratio of 80% and 60%, and the accuracy is improved by 3.33% and 2.14% compared with that of the independent training, and the distillation methods with the second super performance are respectively 0.24% and 0.48%.

Experimental results show that the method provided by the invention is obviously improved compared with the independent training and is superior to the current advanced knowledge distillation method. Therefore, the method provided by the invention can be used for effectively extracting valuable information from the teacher model and converting the valuable information into knowledge to be transferred to the student model for learning. From the two groups of experiments, the lower the training ratio is, the larger the accuracy improvement amplitude of the method is, and the more obvious the center-like knowledge distillation advantage is.

TABLE 3 general accuracy (%)

Note that Baseline refers to the result of training a student network alone. The highest precision results among all methods are indicated in bold, and the next highest are indicated in underline. The results were averaged over 10 experiments.

Table 4 shows the comparison of the dimensions of the network model before and after compression, the complexity of the model operation is measured in terms of calculated quantities (Floating point operations, flow), the size of the model store is measured in terms of parameter quantities (Parameters), and the compression ratio is the ratio of the student network (after model compression) to the teacher network (before model compression). The results show that on a training rate of 60% UCM dataset, compared to the res net50 model, the res net18 model achieved a compression rate of 47.52% and compensated for 5.36% accuracy, and the mobilenet v2 model achieved a compression rate of 9.55% and reduced the computational complexity to 7.6% of the original.

Table 4 comparison of network model sizes

2.3.2 suitability experiments

In order to verify the universality of the class center knowledge distillation method, the invention carries out supplementary experiments on a small data set with fewer RSC11 classes and fewer samples, a multiscale data set with fewer RSSCN7 classes and multiple samples and a large complex data set with multiple types of uneven AID samples. Table 5 shows that the proposed method of the present invention achieves 94.37% and 94.10% accuracy on RSC11 and AID datasets, respectively, over training alone, improved by 4.23% and 5.12% compared to the second knowledge distillation method of performance, respectively, by 2.42% and 1.07% for the same series of experimental results on the teacher and student networks at a training rate of 60%. Notably, the current advanced knowledge distillation method has poor performance on the RSSCN7 data set, compared with independent training of student networks, the precision is not increased and decreased, only the RKD method maintains the precision, but the method provided by the invention has 2.95% improvement, and the precision reaches 91.70%. It can also be observed that on the lesser data sets of the two categories RSC11 and RSSCN7, the proposed distillation method not only achieves significant improvements over training alone, even exceeding the accuracy of the teacher model under pre-training fine tuning, illustrating that the center-like knowledge distillation method transfers more specialized knowledge than other distillation methods, particularly for the reasons that will be analyzed later.

Table 5 overall accuracy of multiple data sets under 60% training rate and isomorphic teacher-student network conditions (%)

Note that the teacher network was res net-50, showing the pre-trained-tuned classification results, the student network was res net-18, showing the results without distillation training alone. The highest precision results among all methods are indicated in bold, the next highest in underline, and italicized bold means that teacher classification precision is exceeded. The results were averaged over 10 experiments.

In summary, in three groups of experiments of teacher and student networks with different architectures, training samples with different ratios and data sets with different scales, the method provided by the invention has excellent performance, and the accuracy is improved compared with that of independent training of student networks and advanced knowledge distillation methods, so that the method has good generalization capability.

2.3.3 comparative experiments

In order to verify the superiority of the method provided by the invention, the invention carries out a comparison experiment on the RSC11 Dataset data set for the two distillation methods, and analyzes the experiment results by combining the fine-tuned ResNet-50 teacher network and the independently trained ResNet-18 student network. In order to ensure that the variable is unique, the NST distillation method is improved, and the same output layer characteristics are matched according to the invention so as to study the effectiveness of class center knowledge.

The overall accuracy of the classification is shown in Table 6, and the method provided by the invention has 1.811 percent improvement over the improved NST method. And calculating an accuracy confusion matrix for the experimental result, and visually checking the classification accuracy of each category and the wrong classification and missed classification conditions. The values of the (i, j) element in the confusion matrix represent the proportion of the test samples classified into j types with the labels i to the total number of the test images, and the result is shown in fig. 3.

Table 6 results of comparative experiments on RSC11 dataset

As can be seen from fig. 3, the method of the present invention improves the accuracy of all classifications, wherein ports and high buildings are improved by 7.69% and 6.80% compared with the improved NST distillation method, and the accuracy of other classifications is improved by 1-4%. The classification accuracy is relatively low, namely, highways, interchange and railways, and it is notable that the classification accuracy of the three categories is higher than that of a separately trained student model and even greatly exceeds that of a teacher model.

In addition, it can be observed from the visual test error curve (fig. 4) that the test error curve of the method (blue line) of the present invention can converge rapidly compared with the training alone (orange line), and the test error curve of the improved NST distillation method (red line) converges rapidly but oscillates immediately and has a floating trend, which indicates that NST distillation loss is greatly affected by random samples, the training model has a risk of overfitting, and the proposed center-like distillation loss can benefit noise very well, and the loss value converges rapidly and tends to be stable.

2.3.4 visual analysis

The present invention uses the feature extraction capabilities of the T-SNE algorithm visualization model on challenging RSSCN7 datasets. The T-SNE algorithm is used for high-dimensional dimension reduction, and can represent high-dimensional features in a two-dimensional visual space. The high-dimensional features extracted by the model are visualized through a T-SNE algorithm to measure whether the proposed method can effectively solve the problems of intra-class diversity and inter-class similarity. As shown in fig. 5, compared with the student model trained alone and the NST method based on the improvement, the features extracted by the method of the invention have more compact feature clusters with the same category, and feature clusters with different categories are relatively scattered, especially three category feature clusters in a red circle, which are most obvious in performance. The characteristics extracted by the method have good intra-class compactness and inter-class discreteness, and effectively solve the problems of intra-class diversity and inter-class similarity of remote sensing image scenes.

In order to further analyze the learning effect of the student network, the invention visualizes four network middle output layers designed and matched in the distillation process, and utilizes Selvaxju et al [ Selvaxaju R, cogswell M, das A, et al Grad-CAM: visual explanations from deep networks via Gradient-based localization [ C ]//2017IEEE International Conference on Computer Vision (ICCV). IEEE,2017:618-626.DOI:10.1109/ICCV.2017.74] to draw a thermodynamic diagram to display the region and characteristic information focused by the network middle layer, thereby exploring the characteristic extraction capability of the student network. As shown in fig. 6, the thermodynamic value represents the visualization result of the model attention, and a high value represents the model attention to the region, and the darker the region in the thermodynamic diagram is the higher the attention. The lateral analysis can find that the emphasis points of different network output layers are different, the deeper the hierarchy is, the more abstract the features are focused, if the feature layer 1 focuses on the edge features, and the feature layer 4 focuses on the semantic scene features. Compared with a single training method and an improved NST method, the method integrates the teacher network and the student network attention areas, and well transmits the characteristic extraction capability of the teacher network. From the thermodynamic diagram of all output layers, the method of the invention focuses on the fact that the receptive field is larger, and is not easy to predict errors due to overscaling of the similar features between the classes in the scene. The original image is an industrial area, the improved NST method is mispredicted into a residential area due to excessively paying attention to houses in the image, the influence of noise is large, and meanwhile, the characteristic extracted by the method can effectively cope with the similarity challenges among classes.

The invention provides a similar central knowledge distillation method for remote sensing image scene classification tasks, which is used for obtaining a high-performance light-weight network which can be deployed in edge computing equipment through two-step training of teacher network fine adjustment and teacher-student network distillation. The class center distillation loss function designed by the invention efficiently transfers the strong feature extraction capacity of the complex network by matching the centers of the similar feature distribution extracted by the teacher and student networks, so that the lightweight network can cope with challenges of intra-class high-difference and inter-class low-separability in scene classification tasks. According to the invention, a series of comprehensive experiments are performed on four disclosed remote sensing image scene classification reference data sets so as to evaluate the conclusion of the experiment of the effectiveness of the center-like knowledge distillation method, wherein the conclusion is summarized as follows:

(1) The invention performs effectiveness experiments on classical UCM high-resolution land utilization data sets under the conditions of two training ratios of 60% and 80% and two teachers and students of ResNet50 and ResNet18 and ResNet50 and MobileNet-V2, and compares the effectiveness experiments with four types of eight prior advanced knowledge distillation methods. Experimental results show that the similar center knowledge distillation method is optimal in isomorphic and heterogeneous teacher-student networks, and particularly the lower the training ratio is, the larger the precision lifting amplitude is. Subsequently, the applicability experiment is carried out on a small data set with few RSC11 types and few samples, a multi-scale data set with few RSSCN7 types and multiple samples and a large complex data set with uneven AID types, and the experimental result proves that the method has good generalization capability.

(2) The invention also carries out a technical comparison experiment with an improved NST algorithm, and the obfuscated matrix finds that the class center knowledge distillation method in the challenging class can still keep the superiority, so that the classification capability of a student network is kept, the knowledge in a teacher network is also learned, and a test error curve shows that the distillation loss function can better process noise to enable the loss value to be quickly converged and tend to be stable, thereby verifying the superiority of the method. In addition, the invention visualizes the feature extraction capability of the model based on the T-SNE algorithm, and draws the thermodynamic diagram visualization output layer attention area based on Grad-CAM, and the result shows that the features extracted by the method have good intra-class compactness and inter-class discreteness.

In summary, the class-center knowledge distillation method provided by the invention improves the classification precision of a compact network and performs optimally compared with other distillation methods.

The foregoing is merely illustrative of the preferred embodiments of this invention, and it will be appreciated by those skilled in the art that changes and modifications may be made without departing from the principles of this invention, and it is intended to cover such modifications and changes as fall within the true scope of the invention.

Claims

1. A class center knowledge distillation method for remote sensing image scene classification is characterized by comprising the following steps:

2. The remote sensing image scene classification oriented center-like knowledge distillation method as claimed in claim 1, further comprising:

3. The remote sensing image scene classification oriented center-like knowledge distillation method according to claim 1, wherein data enhancement is performed by using random overturn and random radius gaussian blur in a network training stage, and test data is not enhanced in a test stage.

4. The remote sensing image scene classification oriented class center knowledge distillation method according to claim 1, wherein condensing the similar feature distribution into knowledge, extracting knowledge from the adjusted teacher network comprises:

5. The remote sensing image scene classification oriented class center knowledge distillation method according to claim 4, wherein the class center distillation loss function is:

in the method, in the process of the invention,

wherein,,respectively representing the clustering centers of the teacher network and the student network corresponding to the ith class center distance,representing cluster center->And->The distance between N represents the number of instances of the same tag, k (·, ·) represents the kernel function that projects the feature vector into a higher-or infinite-dimensional feature space, C _T 、C _S The number of channels of teacher network and student network are respectively represented, < >>Characteristic diagram and corresponding transpose respectively representing ith channel of teacher network,/for each channel>Respectively representing the characteristic diagram and the corresponding transpose of the jth channel of the student network.

6. The remote sensing image scene classification oriented center-like knowledge distillation method according to claim 5, wherein in the training process, an overall loss function is adopted as follows:

in the method, in the process of the invention,

7. The remote sensing image scene classification oriented class center knowledge distillation method according to claim 1, wherein said teacher network adopts ResNet50, and said student network adopts ResNet-18 or MobileNet V2.