CN116665068A

CN116665068A - Mixed knowledge decoupling knowledge distillation algorithm for remote sensing target detection

Info

Publication number: CN116665068A
Application number: CN202310521321.5A
Authority: CN
Inventors: 钱付兰; 洪嘉成; 张崇浩; 陈海; 赵姝
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2023-05-10
Filing date: 2023-05-10
Publication date: 2023-08-29

Abstract

The application discloses a knowledge distillation algorithm for mixed knowledge decoupling for remote sensing target detection, wherein the method comprises the following steps: s1, constructing a remote sensing target detection model as a teacher model for knowledge distillation; s2, performing light weight treatment on the model to form a student model for knowledge distillation; s3, predicting the boundary box information and calculating target detection loss; s4, the teacher model guides the student model to decouple knowledge of different types; s5, distilling semantic knowledge from the output level, and calculating a loss function value of output layer distillation; s6, cross distilling semantic features and positioning features of different layers, and calculating loss function values of cross feature distillation; and S7, calculating a total loss function value and optimizing the student model. The application solves the problem that the remote sensing detection model with large parameter and complex parameter is difficult to be deployed to the satellite and other edge equipment. The method not only realizes the light weight of the remote sensing detector, but also improves the performance of the detector.

Description

Mixed knowledge decoupling knowledge distillation algorithm for remote sensing target detection

Technical Field

The application relates to the field of remote sensing image detection, in particular to a knowledge distillation algorithm for mixed knowledge decoupling for remote sensing target detection.

Background

The detection of the target of the remote sensing image is one of the basic tasks of satellite image processing, and aims to extract the category and position information of the target from the remote sensing image.

In recent years, with the rapid development of deep learning technology, research on remote sensing target detection has made a significant breakthrough, and detection accuracy is greatly improved. However, high-precision detection algorithms rely on complex network models, and their deployment on-board or other edge devices is hampered by the enormous computational complexity and memory requirements. The research on the lightweight remote sensing target detection algorithm has great practical significance. The lightweight technology of the deep network model is an effective means for improving the deployment feasibility of the network model.

In recent years, a model weight reduction technique has been developed which can reduce the number of model parameters and the amount of calculation, but also affect the accuracy of the model, and which balances the detection performance and the inference speed of the model. The knowledge distillation is used as an emerging model light-weight method, and is a teacher-student training structure, and complex network learned characteristics with strong learning ability are distilled out and transferred to a network with small parameters and weak learning ability, so that a light-weight network with high speed and strong learning ability is obtained.

At present, knowledge distillation is widely studied in the fields of image classification, target detection and the like. However, knowledge distillation for remote sensing target detection has not been fully studied. The existing knowledge distillation method can be divided into different stages according to distillation: knowledge distillation of intermediate features (feature distillation), and knowledge distillation of predicted outputs (logic knowledge distillation). These two knowledge distillation modes migrate knowledge fairly for each distillation zone, however not all zones actually play the same role in knowledge migration. In addition, deep feature distillation often requires students to simulate teacher features of the same level, which leads to the separation of feature migration processes of different levels, and ignores the guiding value of shallow features of the teacher on deep features of the students.

Disclosure of Invention

In order to solve the problems, the application provides a knowledge distillation algorithm for mixed knowledge decoupling for remote sensing target detection, which comprises the following specific scheme:

a knowledge distillation algorithm for mixed knowledge decoupling for remote sensing target detection comprises the following steps:

s1, constructing a remote sensing target detection model as a teacher model for knowledge distillation;

s2, performing light weight processing on the model in the step S1 to form a student model for knowledge distillation;

s3, predicting the boundary frame information by the student model in the step S2 and calculating target detection loss;

s4, the teacher model guides the student model to decouple knowledge of different types; under the guidance of the real annotation data, generating a semantic perception mask and a positioning perception mask for knowledge migration according to the output of the teacher model;

s5, distilling semantic knowledge from the output level under the guidance of the positioning sensing mask in the step S4, and calculating a loss function value of output layer distillation;

s6, under the guidance of the semantic perception mask and the positioning perception mask in the step S4, the semantic features and the positioning features of different layers are subjected to cross distillation, and a loss function value of cross feature distillation is calculated; the characteristic distillation process between teachers and students in different layers is adaptively fused, so that the guidance effect of shallow teacher characteristics on deep student characteristics is exerted;

and S7, calculating a total loss function value and optimizing the student model.

Preferably, in step S2, a pruning technique is adopted to perform light-weight processing on the remote sensing target detection model; and judging the importance of the channel by using the weight of the BN layer in the student model, and pruning the channel for all modules in the backbone network of the student model according to the importance of the channel.

Preferably, the step of light weight treatment specifically includes:

s21, constructing a backbone network corresponding to the student model, and loading model parameters;

s22, traversing all BN layers in a student backbone network, and recording the corresponding weight and channel number; according to the manually set pruning rate theta, changing the number of the pruned channels into original theta times, and recording the number of the pruned channels;

s23, traversing all BN layers, and sorting the weights of the BN layers in a descending order; screening important weights to be reserved according to the number of the channels after pruning, and generating pruning masks corresponding to each BN layer;

s24, traversing all modules in the student trunk network, screening weights corresponding to a convolution layer, a linear layer and a BN layer in a certain dimension according to the indication of the pruning mask, and discarding the unwanted weights;

s25, constructing a lightweight backbone network according to the number of the pruned channels; saving the pruned weight to the network and generating a student model file for initializing the student model.

Preferably, the step S3 specifically includes the following steps:

s31, preprocessing an input image, including unifying the size and normalization of the image, and finally converting the input image into a tensor form;

s32, an initial chemo-model is used for loading a pruned lightweight network as an initial backbone network;

s33, inputting the image tensor into a student model, and the student modelExtracting image features from shallow to deep in a trunk network and a neck network to obtain image features with different granularitiesThe specific operation process is as formula (1):

wherein ,(S₁ ,S ₂ ,…,S _n ) N stages of a student feature extraction network are represented, n represents the number of features with different granularity, the ° represents function nesting, and X represents image tensors;

s34, multi-granularity student feature F ^stu Predicting boundary box information through a detection head in a student model to obtain category prediction scoresAnd regression prediction value-> Specific operation formula (2) and formula (3):

wherein ,S_cls and S_reg Respectively representing a category prediction layer and a regression prediction layer in the student model;

s35, respectively calculating a classification loss function value and a regression loss function value according to the real frame label Y and the regression target delta generated by the student detection head, wherein the specific operation is as shown in the formula (4) and the formula (5):

wherein ,representing a classification loss function, +.>Representing the regression loss function.

Preferably, the step S4 of generating the semantic perception mask and locating the perception mask specifically includes the steps of:

s41, loading a pre-trained teacher model, and setting gradients of all parameters of the teacher model not to return;

s42, extracting image features from shallow to deep in a trunk network and a neck network of the teacher model to obtain teacher image features with different granularitiesThe specific operation is as formula (6):

wherein ,(T₁ ,T ₂ ,…,T _n ) N stages representing a teacher feature extraction network;

s43, multi-granularity teacher image feature F ^tea Predicting boundary box information through a detection head in a teacher model to obtain category prediction scoresAnd regression prediction value-> The specific operation is as formula (7) and formula (8):

wherein ,T_cls and T_reg Respectively representing a category prediction layer and a regression prediction layer in the teacher model;

s44, utilizing boundary frame information predicted by the teacher model to mine boundaries between semantic knowledge and positioning knowledge, and generating a semantic perception maskLocating perceptual masksTo capture sensitivity of the distilled region to semantic knowledge, positional knowledge.

Preferably, the step S44 specifically includes the following steps:

s441, regarding teacher image characteristicsFor each element in (a) calculating the maximum value of all class prediction scores as semantic perception mask +.>The specific operation is as formula (9):

wherein K is the total classNumber of other groups, (c) ₁ ,…,c _i ,…,c _K ) Representing all target categories;

s442, regarding teacher feature mapAccording to regression prediction value +.>Encoding anchor box A into corresponding predictive regression box->Then IoU between the predictive regression frame and the real frame GT is calculated and used as a localization perceptual mask +.>The specific operation is as shown in the formula (10) and the formula (11):

where M represents the number of all prediction frames and decode is the prediction frame coding function.

Preferably, in the semantic perception maskUnder the guidance of (2), the output layer distillation loss function value in step S5 is calculated as in formula (12):

wherein ,H_k 、W _k The dimensions of the k-layer feature map are indicated,representing probability vectors at the (i, j) positions of the k-layer teacher feature map,/for>Representing probability vectors at the positions of the k-layer student feature map (i, j), +.>Representing Logit distillation loss function for measuring similarity between student's prediction and soft label, T being smoothing factor, ++>Element value of the semantic perception mask (i, j) for k layers, < >>The sum of all element values is masked for k-layer semantic perception.

Preferably, the cross-feature distillation process in step S6 comprises the steps of:

s61, according to the size of the receptive field, student characteristics F with different granularities are obtained ^stu Performing descending order sorting to obtainSequentially and iteratively fusing and updating;

specifically, first, a fusion feature is initialized, a pair ofFeature transformation is achieved>Secondly, carrying out feature fusion; at t iterations, due to the precursor fusion feature +.>Size and current characteristics of->Inconsistent, so that the two features are aligned by an interpolation method, and then the fusion feature of t iterations is obtained by weighted summation>After n-1 iterations, reversing the feature sequence to finally obtain the multi-layer fusion feature ++> The specific operation is shown in the formula (13) and the formula (14):

wherein phi is a characteristic transformation layer, and />Fusion weights for t iterations, +.>Is an interpolation function;

s62, introducing a semantic perception mask ^se And a location awareness mask ^lo Feature distillation is respectively carried out on semantic knowledge and positioning knowledge on the feature map, and in addition, student features used in the distillation process are fusion features H ^stu The loss function is specifically shown in formula (15), formula (16) and formula (17):

wherein ,values of (i, j) elements are masked for k-layer semantic perception, < >>Values of the element of the perceptual mask (i, j) are located for the k-layer, < >>Sum of element values of k-layer semantic perception mask, < +.>Sum of element values of the localization perceptual mask for k layers,/->As a conventional characteristic distillation loss function, W ^se and W^lo And respectively representing coefficients corresponding to the semantic feature distillation and the locating feature distillation loss.

Preferably, in step S7, the total loss function value includes classification loss, regression loss, cross-feature distillation loss, and output layer distillation loss of the remote sensing target detection task.

Preferably, the student model optimization in step S7 includes the steps of:

s71, calculating the gradient of the total loss function on the student model parameters by using a back propagation mechanism;

s72, updating student model parameters along the gradient direction;

the total loss function value is calculated as in equation (18):

L _total ＝αL _cls +βL _reg +γL _logit +λL _feat #(18)

wherein, alpha, beta, gamma and lambda respectively represent the corresponding coefficients of the loss of each part.

The application has the beneficial effects that:

the application solves the problem that the remote sensing detection model with large parameter and complex parameter is difficult to be deployed to the satellite and other edge equipment. The detection precision of the remote sensing detector is ensured by capturing the sensitivity of the distillation area to different types of knowledge and establishing the connection mode between different levels of feature migration, and meanwhile, the model parameter quantity, the calculation amount and the reasoning time are reduced. The method not only realizes the light weight of the remote sensing detector, but also improves the performance of the detector.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a knowledge distillation algorithm of mixed knowledge decoupling for remote sensing target detection.

Fig. 2 is an input picture provided in an embodiment of the present application.

Fig. 3 is a diagram showing the sense of semantic information and positioning information.

FIG. 4 shows the results of the test prior to distillation provided in the examples of the present application.

FIG. 5 shows the results of the measurement after distillation provided in the examples of the present application.

Note that: the rectangular boxes in fig. 4 and 5 are artificially added to facilitate comparison of the detection results before and after distillation.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

As shown in fig. 1, a knowledge distillation algorithm for hybrid knowledge decoupling for remote sensing target detection includes the following steps:

s1, constructing a remote sensing target detection model as a teacher model for knowledge distillation.

The teacher model may select a plurality of mainstream remote sensing target detection models, for example: single-stage methods Rotated RetinaNet, rotated ats, etc.

Specifically, the application takes RotatedRetinaNet as a basic model, the backbone network adopts ResNet50, and the teacher model and the student model have the same detection framework.

And S2, compared with a student model, the teacher model has stronger representation capability and larger model specification. In order to obtain a compact student model, the remote sensing target detection model in the step S1 is subjected to light weight processing by adopting a pruning technology, a student model for knowledge distillation is formed, the importance of a channel is judged by using the weight of a BN layer in the student model, and channel pruning is carried out on all modules in a backbone network of the student model according to the importance of the channel.

The step of light weight treatment specifically comprises the following steps:

s21, constructing a backbone network corresponding to the student model, and loading model parameters. Specifically, resNet50 network R is constructed ^O Model parameters are loaded.

S22, traversing all BN layers in a student backbone network, and recording the corresponding weight and channel number; and changing the number of the channels after pruning into original theta times according to the manually set pruning rate theta, and recording the number of the channels after pruning.

Specifically, R is traversed ^O All of (3)BN layer of (b), recording weight corresponding to each layer And channel number->According to the manually set pruning rate theta, changing the number of the pruned channels to be theta times of the original number of the pruned channels and recording the number of the pruned channels +.>Wherein N is R ^O The number of BN layers contained, θ was 0.8.

S23, traversing all BN layers, and sorting the weights of the BN layers in a descending order; and screening important weights to be reserved according to the number of the channels after pruning, and generating pruning masks corresponding to each BN layer.

Traversing all BN layers, screening important channels and generating a pruning mask. For the ith BN layer, its weights W are ordered in descending order _i ^BN . Then screening according to the weight valueGenerating a mask which indicates whether the channel is important>As shown in formula (1'). After N traversals, a list storing the pruning masks is finally obtained>

S24, traversing all modules in the student backbone network, in particular traversing R ^O Is provided. And screening weights corresponding to the convolution layer, the linear layer and the BN layer in a certain dimension according to the indication of the pruning mask, and discarding the unwanted weights.

Specifically, for the ith convolution layer, its weight is of the sizeTensors of (c). The weight pruning process comprises the following steps: weight tensor is in 0 dimension according to pruning mask +.>Index, according to mask +.>Index, finally get the size +.>Tensors of (c). Wherein K is _i ^CONV The convolution kernel size of the i convolution layer is indicated.

The linear layer in ResNet50 is a class prediction layer with weights of sizeTensors of (c). The weight pruning process comprises the following steps: the weight tensor is in 1 dimension according to the pruning mask +.>Index to filter out unimportant linear layer weights. Where n_c represents the total number of categories of the dataset.

S25, constructing a lightweight backbone network according to the number of the pruned channels; loading the pruned weights into the network and generating a student model file for initializing the student model.

Specifically, according to the number of channels C ^P Creation of a new ResNet50 network R ^P . R is R ^O All weights loaded to R ^P And R is taken as ^P All weights of (a) are saved into a model file to facilitate direct loading and enableIs used.

S3, predicting the boundary box information by the student model in the step S2 and calculating target detection loss. Specifically, the input image is first preprocessed to be converted into a tensor of 3×1024×1024. And secondly, obtaining class scores and boundary box information by the image tensor through the student model, and respectively calculating a class loss function value and a regression loss function value according to the remote sensing target detection loss function.

The method specifically comprises the following steps:

specifically, the main step of preprocessing the input image includes unifying the image sizes to 1024×1024×3, then normalizing, the normalized mean value is (123.675,116.28,103.53), the variance is (58.395,57.12,57.375), and converting the normalized mean value into tensors with the sizes of 3×1024×1024. In order to accelerate the processing speed, batch processing is introduced, the number of image tensor dimensions is expanded, and the final dimension is Batch x 3 x 1024, wherein Batch represents the number of images in Batch processing.

S32, an initial chemo-model is used for loading a pruned lightweight network as an initial backbone network; specifically, R after pruning is loaded ^P The network serves as the initial backbone network.

S33, inputting the image tensor into a student model, and extracting image features from shallow to deep in a trunk network and a neck network in the student model to obtain 5 image features with different granularitiesThe specific operation process is as formula (1):

wherein ,(S₁ ,S ₂ ,…,S ₅ ) Representing 5 phases of the student feature extraction network,representing function nesting, X representing image tensors; />Represents a tensor of size Batch x 256 x 128->Represents a tensor of size Batch by 256 by 64,/->Represents a tensor of size Batch x 256 x 32,/->Represents a tensor of size Batch 16 x 256>Represents a tensor of size Batch x 256 x 8.

wherein ,S_cls and S_reg Respectively represent student modelsA class prediction layer and a regression prediction layer in the model;represents a tensor of size Batch 135 x 128->Represents a tensor of size Batch 135 64, ∈64>Represents a tensor of size Batch 135 x 32,/->Represents a tensor of size Batch 135 16, ∈16>Represents a tensor of size Batch 135 x 8,/->Represents a tensor of size Batch 45 x 128->Represents a tensor of size Batch 45 64, ->Represents a tensor of size Batch x 45 x 32,/->Represents a tensor of size Batch 45 16, ∈16>Represents a tensor of size Batch x 45 x 8.

wherein ,representing the classification Loss function Focal Loss, +.>The regression Loss function L1Loss is shown.

S4, the teacher model guides the student model to decouple knowledge of different types; namely, under the guidance of the real annotation data, a semantic perception mask and a positioning perception mask for knowledge migration are generated according to the output of the teacher model.

As shown in fig. 3, since each element on the feature map has a difference in perceptibility of different types of knowledge, it is necessary to distinguish the contribution degree of each element in different knowledge migration processes. And a teacher model with strong representation capability and knowledge perception capability can guide a student model to decouple different types of knowledge. Therefore, the application designs a knowledge decoupling module, and captures semantic knowledge and positioning knowledge on the feature map by using the class prediction score and the regression prediction value of the teacher model, thereby guiding the student model to transfer different types of knowledge in a targeted manner.

The step S4 of generating the semantic perception mask and locating the perception mask specifically includes the following steps:

s42, extracting image features of a main network ResNet50 and a neck network FPN of the teacher model from shallow to deep to obtain teacher image features with different granularitiesThe specific operation is as formula (6):

wherein ,(T₁ ,T ₂ ,…,T ₅ ) Five stages representing the teacher feature extraction layer,represents a tensor of size Batch x 256 x 128->Represents a tensor of size Batch by 256 by 64,/->Represents a tensor of size Batch x 256 x 32,/->Represents a tensor of size Batch 16 x 256>Represents a tensor of size Batch x 256 x 8.

wherein ,T_cls and T_reg Respectively representing a category prediction layer and a regression prediction layer in the teacher model;represents a tensor of size Batch 135 x 128->Represents a tensor of size Batch 135 64, ∈64>Represents a tensor of size Batch 135 x 32,/->Represents a tensor of size Batch 135 16, ∈16>Indicating that the size is Batch 135 x 8,/i>Represents a tensor of size Batch 45 x 128->Represents a tensor of size Batch 45 64, ->Represents a tensor of size Batch x 45 x 32,/->Represents a tensor of size Batch 45 16, ∈16>Represents a tensor of size Batch x 45 x 8.

The step S44 specifically includes the following steps:

wherein K is the total category number, (c) ₁ ,…,c _i ,…,c _K ) Representing all target categories;

s442, regarding teacher feature mapRegression prediction value +.>Coding into the corresponding predictive regression frame->IoU between the predictive regression frame and the real frame GT is then calculated and used as a location awareness maskSpecifically, the size is Batch 45 h _i *W _i Features of->Dimension transformation into a size of Batch (H _i *W _i * 9) Tensor of 5, then ++according to regression prediction>Decoding an anchor frame A generated by a student detection head into a frame A with a size of Batch (H) _i *W _i * 9) Prediction box of 5->Since each image contains a different number of real frames, performing Batch iterations to calculate IoU between the predicted and real frames and find the maximum value will result in a size of H for each iteration _i *W _i Finally stacking the tensors into a tensor of the size of Batch H _i *W _i Tensors of (c). The specific operation is as shown in the formula (10) and the formula (11):

S5, distilling semantic knowledge from the output level under the guidance of the positioning perception mask in the step S4, and calculating a loss function value of output layer distillation.

The basic idea of the output layer knowledge distillation (logic knowledge distillation) is: and taking the output of the teacher model as supervision information, and continuously optimizing the distance between the output of the student model and the soft label provided by the teacher model in the distillation process. The conventional Logit knowledge distillation process is shown in formula (2'):

wherein ,H_k 、W _k The dimensions of the k-layer feature map are indicated,representing probability vectors at the (i, j) positions of the k-layer teacher feature map,/for>Representing probability vectors at the positions of the k-layer student feature map (i, j), +.>The Logit distillation loss function is expressed and used for measuring the similarity between student predictions and soft labels, and T is a smoothing factor.

In particular, the method comprises the steps of,(k.epsilon. {1,2,3,4,5 }) represents probability vector at the position of (i, j) of the k-layer teacher class probability map, +.>(k.epsilon. {1,2,3,4,5 }) represents the probability vector at the position of the k-layer student class probability map (i, j).

However, this knowledge distillation pattern tends to migrate semantic knowledge fairly per distillation area. In fact, not all regions contribute equally to semantic migration. Therefore, in order to enhance the region with strong semantic sensitivity and inhibit the region with weak semantic sensitivity, the patent introduces a semantic mask and gives each distillation region different weights.

Based on the traditional Logit distillation model, a semantic mask is introduced ^se Each element on the class probability map is given different weights, so that the due distillation value of the element is exerted.

Specifically, the output layer distillation loss function value is calculated as in formula (12):

wherein ,element value of the semantic perception mask (i, j) for k layers, < >>Mask the sum of all element values for k-layer semantic perception, +.>The value of T is 10 as a function of KL divergence.

S6, under the guidance of the semantic perception mask and the positioning perception mask in the step S4, the semantic features and the positioning features of different layers are subjected to cross distillation, and a loss function value of cross feature distillation is calculated; the characteristic distillation process between teachers and students in different layers is adaptively fused, so that the guidance effect of the characteristics of shallow teachers on the characteristics of deep students is exerted.

Feature layer knowledge distillation, i.e., the basic idea of feature distillation, simulates teacher features for student features. The gap between the student characteristics and the teacher characteristics is continuously optimized in the distillation process, and the gap is specifically shown as a formula (3'):

wherein ,(k.epsilon. {1,2,3,4,5 }) represents the feature vector at the (i, j) position of the k-layer teacher feature map, +.>(k.epsilon. {1,2,3,4,5 }) represents the feature vector at the position of the k-layer student feature map (i, j), +.>The characteristic distillation loss is expressed and used for measuring the similarity degree between the characteristics of a teacher and the characteristics of a student.

Traditional feature distillation requires students to mimic teacher features of the same granularity, which isolates migration between features of different granularity. In fact, the shallow features of the teacher model also have guiding value for the deep features of the student model. Furthermore, feature distillation is essentially a knowledge migration between the corresponding elements in the student, teacher feature graph, and the role each element plays in the knowledge migration process is fair, which confuses the contribution of different distillation regions to feature distillation. Even further, since the remote sensing target detection is used as a combination of classification task and regression task, the contribution degree of each distillation area to semantic knowledge migration and positioning knowledge migration is different.

Therefore, the patent designs a multi-layer feature interaction module to iteratively fuse student features with different granularities to obtain fusion features for enriching information in a feature distillation stage so as to realize cross feature distillation. And introducing a semantic mask and a positioning mask generated by teacher prediction results to distinguish the sensitivity of each distillation area to semantic knowledge and positioning knowledge, so that the semantic feature distillation and the positioning feature distillation are targeted.

Specifically, the cross-feature distillation process comprises the steps of:

first, initialize the fusion feature, pairFeature transformation is achieved>Secondly, carrying out feature fusion; at t iterations, due to the precursor fusion feature +.>Size and current characteristics of->Inconsistent, so that the two features are aligned by an interpolation method, and then the fusion feature of t iterations is obtained by weighted summation>After 5-1 iterations, reversing the feature sequence to finally obtain the multi-layer fusion feature ++> The specific operation is shown in the formula (13) and the formula (14):

wherein phi is a characteristic transformation layer, and />Fusion weights for t iterations, +.>As a bilinear interpolation function,is of dimension Batch by 256 by H _t-1 *W _t-1 ，/>Is of dimension Batch by 256 by H _t *W _t ，/>Is Bach×256×H _t *W _t 。

S62, introducing a semantic perception mask based on a traditional distillation model ^se And a location awareness mask ^lo Feature distillation is respectively carried out on semantic knowledge and positioning knowledge on the feature map, and in addition, student features used in the distillation process are fusion features H ^stu The loss function is specifically shown in formula (15), formula (16) and formula (17):

wherein ,values of (i, j) elements are masked for k-layer semantic perception, < >>Values of the element of the perceptual mask (i, j) are located for the k-layer, < >>Sum of element values of k-layer semantic perception mask, < +.>Sum of element values of the localization perceptual mask for k layers,/->Is L1 Low, W ^se and W^lo The coefficients corresponding to the semantic feature distillation and the locating feature distillation loss are respectively represented, and are 1.

And S7, calculating a total loss function value and optimizing the student model. The total loss function value is calculated as in equation (18):

L _total ＝αL _cls +βL _reg +γL _logit +λL _feat #(18)

wherein, alpha, beta, gamma and lambda respectively represent the corresponding coefficients of the loss of each part. The corresponding values are 1, 0.01, respectively.

Specifically, student model optimization includes the steps of:

s72, updating student model parameters along the gradient direction.

The knowledge distillation method provided by the application is mainly used for solving the problem that a large-parameter and complex remote sensing detection model is difficult to deploy to satellite and other edge equipment. The method not only realizes the light weight of the remote sensing detector, but also improves the performance of the detector.

The above detailed description of the knowledge distillation method for mixed knowledge decoupling for remote sensing target detection provided by the application adopts specific examples to illustrate the principle and the implementation of the application, and the above examples are only used for helping to understand the method and the core idea of the application; also, those skilled in the art will appreciate that the present application can be practiced with other variations in specific details and with respect to a particular embodiment or range of applications, which are not to be construed as limitations on the application

The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. The knowledge distillation algorithm for the mixed knowledge decoupling of the remote sensing target detection is characterized by comprising the following steps of:

2. The knowledge distillation algorithm for hybrid knowledge decoupling for remote sensing target detection of claim 1, wherein: in the step S2, a pruning technology is adopted to carry out light weight treatment on the remote sensing target detection model; and judging the importance of the channel by using the weight of the BN layer in the student model, and pruning the channel for all modules in the backbone network of the student model according to the importance of the channel.

3. The knowledge distillation algorithm for hybrid knowledge decoupling for remote sensing target detection according to claim 2, wherein the step of lightweight processing specifically comprises:

4. The knowledge distillation algorithm for hybrid knowledge decoupling for remote sensing target detection according to claim 1, wherein step S3 specifically comprises the steps of:

s33, inputting the image tensor into a student model, and extracting image features from shallow to deep in a trunk network and a neck network in the student model to obtain image features with different granularitiesThe specific operation process is as formula (1):

wherein ,(S₁ ,S ₂ ,…,S _n ) Represents n stages of the student feature extraction network, n represents the number of features of different granularity,representing function nesting, X representing image tensors;

5. The knowledge distillation algorithm for hybrid knowledge decoupling for remote sensing target detection according to claim 1, wherein the step S4 of generating a semantic perception mask and a location perception mask specifically comprises the steps of:

s42, the main network and the neck network of the teacher model are from shallow to shallowDeep extracting image features to obtain teacher image features with different granularitiesThe specific operation is as formula (6):

6. The knowledge distillation algorithm for hybrid knowledge decoupling for remote sensing target detection as claimed in claim 5, wherein said step S44 specifically comprises the steps of:

s442, regarding teacher feature mapAccording to regression prediction value +.>Encoding anchor box A into corresponding predictive regression box->IoU between the predictive regression frame and the real frame GT is then calculated and used as a location-aware maskCode->The specific operation is as shown in the formula (10) and the formula (11):

7. The knowledge distillation algorithm for hybrid knowledge decoupling for remote sensing target detection as claimed in claim 1, wherein in semantic perception maskingUnder the guidance of (2), the output layer distillation loss function value in step S5 is calculated as in formula (12):

8. The knowledge distillation algorithm for hybrid knowledge decoupling for remote sensing target detection according to claim 1, wherein the cross-feature distillation process in step S6 comprises the steps of:

9. The knowledge distillation algorithm for hybrid knowledge decoupling for remote sensing target detection of claim 1, wherein: in step S7, the total loss function value includes classification loss, regression loss, cross feature distillation loss, and output layer distillation loss of the remote sensing target detection task.

10. The knowledge distillation algorithm for mixed knowledge decoupling for remote sensing target detection as claimed in claim 1, wherein the student model optimization in step S7 comprises the steps of:

s72, updating student model parameters along the gradient direction;

the total loss function value is calculated as in equation (18):

L _total ＝αL _cls +βL _reg +γL _logit +λL _feat #(18