CN116665068A - Mixed knowledge decoupling knowledge distillation algorithm for remote sensing target detection - Google Patents

Mixed knowledge decoupling knowledge distillation algorithm for remote sensing target detection Download PDF

Info

Publication number
CN116665068A
CN116665068A CN202310521321.5A CN202310521321A CN116665068A CN 116665068 A CN116665068 A CN 116665068A CN 202310521321 A CN202310521321 A CN 202310521321A CN 116665068 A CN116665068 A CN 116665068A
Authority
CN
China
Prior art keywords
knowledge
distillation
model
student
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310521321.5A
Other languages
Chinese (zh)
Inventor
钱付兰
洪嘉成
张崇浩
陈海
赵姝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University
Original Assignee
Anhui University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University filed Critical Anhui University
Priority to CN202310521321.5A priority Critical patent/CN116665068A/en
Publication of CN116665068A publication Critical patent/CN116665068A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • G06V20/13Satellite images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/766Arrangements for image or video recognition or understanding using pattern recognition or machine learning using regression, e.g. by projecting features on hyperplanes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/778Active pattern-learning, e.g. online learning of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Astronomy & Astrophysics (AREA)
  • Remote Sensing (AREA)
  • Image Processing (AREA)
  • Geophysics And Detection Of Objects (AREA)
  • Investigating Or Analysing Materials By Optical Means (AREA)

Abstract

The application discloses a knowledge distillation algorithm for mixed knowledge decoupling for remote sensing target detection, wherein the method comprises the following steps: s1, constructing a remote sensing target detection model as a teacher model for knowledge distillation; s2, performing light weight treatment on the model to form a student model for knowledge distillation; s3, predicting the boundary box information and calculating target detection loss; s4, the teacher model guides the student model to decouple knowledge of different types; s5, distilling semantic knowledge from the output level, and calculating a loss function value of output layer distillation; s6, cross distilling semantic features and positioning features of different layers, and calculating loss function values of cross feature distillation; and S7, calculating a total loss function value and optimizing the student model. The application solves the problem that the remote sensing detection model with large parameter and complex parameter is difficult to be deployed to the satellite and other edge equipment. The method not only realizes the light weight of the remote sensing detector, but also improves the performance of the detector.

Description

Mixed knowledge decoupling knowledge distillation algorithm for remote sensing target detection
Technical Field
The application relates to the field of remote sensing image detection, in particular to a knowledge distillation algorithm for mixed knowledge decoupling for remote sensing target detection.
Background
The detection of the target of the remote sensing image is one of the basic tasks of satellite image processing, and aims to extract the category and position information of the target from the remote sensing image.
In recent years, with the rapid development of deep learning technology, research on remote sensing target detection has made a significant breakthrough, and detection accuracy is greatly improved. However, high-precision detection algorithms rely on complex network models, and their deployment on-board or other edge devices is hampered by the enormous computational complexity and memory requirements. The research on the lightweight remote sensing target detection algorithm has great practical significance. The lightweight technology of the deep network model is an effective means for improving the deployment feasibility of the network model.
In recent years, a model weight reduction technique has been developed which can reduce the number of model parameters and the amount of calculation, but also affect the accuracy of the model, and which balances the detection performance and the inference speed of the model. The knowledge distillation is used as an emerging model light-weight method, and is a teacher-student training structure, and complex network learned characteristics with strong learning ability are distilled out and transferred to a network with small parameters and weak learning ability, so that a light-weight network with high speed and strong learning ability is obtained.
At present, knowledge distillation is widely studied in the fields of image classification, target detection and the like. However, knowledge distillation for remote sensing target detection has not been fully studied. The existing knowledge distillation method can be divided into different stages according to distillation: knowledge distillation of intermediate features (feature distillation), and knowledge distillation of predicted outputs (logic knowledge distillation). These two knowledge distillation modes migrate knowledge fairly for each distillation zone, however not all zones actually play the same role in knowledge migration. In addition, deep feature distillation often requires students to simulate teacher features of the same level, which leads to the separation of feature migration processes of different levels, and ignores the guiding value of shallow features of the teacher on deep features of the students.
Disclosure of Invention
In order to solve the problems, the application provides a knowledge distillation algorithm for mixed knowledge decoupling for remote sensing target detection, which comprises the following specific scheme:
a knowledge distillation algorithm for mixed knowledge decoupling for remote sensing target detection comprises the following steps:
s1, constructing a remote sensing target detection model as a teacher model for knowledge distillation;
s2, performing light weight processing on the model in the step S1 to form a student model for knowledge distillation;
s3, predicting the boundary frame information by the student model in the step S2 and calculating target detection loss;
s4, the teacher model guides the student model to decouple knowledge of different types; under the guidance of the real annotation data, generating a semantic perception mask and a positioning perception mask for knowledge migration according to the output of the teacher model;
s5, distilling semantic knowledge from the output level under the guidance of the positioning sensing mask in the step S4, and calculating a loss function value of output layer distillation;
s6, under the guidance of the semantic perception mask and the positioning perception mask in the step S4, the semantic features and the positioning features of different layers are subjected to cross distillation, and a loss function value of cross feature distillation is calculated; the characteristic distillation process between teachers and students in different layers is adaptively fused, so that the guidance effect of shallow teacher characteristics on deep student characteristics is exerted;
and S7, calculating a total loss function value and optimizing the student model.
Preferably, in step S2, a pruning technique is adopted to perform light-weight processing on the remote sensing target detection model; and judging the importance of the channel by using the weight of the BN layer in the student model, and pruning the channel for all modules in the backbone network of the student model according to the importance of the channel.
Preferably, the step of light weight treatment specifically includes:
s21, constructing a backbone network corresponding to the student model, and loading model parameters;
s22, traversing all BN layers in a student backbone network, and recording the corresponding weight and channel number; according to the manually set pruning rate theta, changing the number of the pruned channels into original theta times, and recording the number of the pruned channels;
s23, traversing all BN layers, and sorting the weights of the BN layers in a descending order; screening important weights to be reserved according to the number of the channels after pruning, and generating pruning masks corresponding to each BN layer;
s24, traversing all modules in the student trunk network, screening weights corresponding to a convolution layer, a linear layer and a BN layer in a certain dimension according to the indication of the pruning mask, and discarding the unwanted weights;
s25, constructing a lightweight backbone network according to the number of the pruned channels; saving the pruned weight to the network and generating a student model file for initializing the student model.
Preferably, the step S3 specifically includes the following steps:
s31, preprocessing an input image, including unifying the size and normalization of the image, and finally converting the input image into a tensor form;
s32, an initial chemo-model is used for loading a pruned lightweight network as an initial backbone network;
s33, inputting the image tensor into a student model, and the student modelExtracting image features from shallow to deep in a trunk network and a neck network to obtain image features with different granularitiesThe specific operation process is as formula (1):
wherein ,(S1 ,S 2 ,…,S n ) N stages of a student feature extraction network are represented, n represents the number of features with different granularity, the ° represents function nesting, and X represents image tensors;
s34, multi-granularity student feature F stu Predicting boundary box information through a detection head in a student model to obtain category prediction scoresAnd regression prediction value-> Specific operation formula (2) and formula (3):
wherein ,Scls and Sreg Respectively representing a category prediction layer and a regression prediction layer in the student model;
s35, respectively calculating a classification loss function value and a regression loss function value according to the real frame label Y and the regression target delta generated by the student detection head, wherein the specific operation is as shown in the formula (4) and the formula (5):
wherein ,representing a classification loss function, +.>Representing the regression loss function.
Preferably, the step S4 of generating the semantic perception mask and locating the perception mask specifically includes the steps of:
s41, loading a pre-trained teacher model, and setting gradients of all parameters of the teacher model not to return;
s42, extracting image features from shallow to deep in a trunk network and a neck network of the teacher model to obtain teacher image features with different granularitiesThe specific operation is as formula (6):
wherein ,(T1 ,T 2 ,…,T n ) N stages representing a teacher feature extraction network;
s43, multi-granularity teacher image feature F tea Predicting boundary box information through a detection head in a teacher model to obtain category prediction scoresAnd regression prediction value-> The specific operation is as formula (7) and formula (8):
wherein ,Tcls and Treg Respectively representing a category prediction layer and a regression prediction layer in the teacher model;
s44, utilizing boundary frame information predicted by the teacher model to mine boundaries between semantic knowledge and positioning knowledge, and generating a semantic perception maskLocating perceptual masksTo capture sensitivity of the distilled region to semantic knowledge, positional knowledge.
Preferably, the step S44 specifically includes the following steps:
s441, regarding teacher image characteristicsFor each element in (a) calculating the maximum value of all class prediction scores as semantic perception mask +.>The specific operation is as formula (9):
wherein K is the total classNumber of other groups, (c) 1 ,…,c i ,…,c K ) Representing all target categories;
s442, regarding teacher feature mapAccording to regression prediction value +.>Encoding anchor box A into corresponding predictive regression box->Then IoU between the predictive regression frame and the real frame GT is calculated and used as a localization perceptual mask +.>The specific operation is as shown in the formula (10) and the formula (11):
where M represents the number of all prediction frames and decode is the prediction frame coding function.
Preferably, in the semantic perception maskUnder the guidance of (2), the output layer distillation loss function value in step S5 is calculated as in formula (12):
wherein ,Hk 、W k The dimensions of the k-layer feature map are indicated,representing probability vectors at the (i, j) positions of the k-layer teacher feature map,/for>Representing probability vectors at the positions of the k-layer student feature map (i, j), +.>Representing Logit distillation loss function for measuring similarity between student's prediction and soft label, T being smoothing factor, ++>Element value of the semantic perception mask (i, j) for k layers, < >>The sum of all element values is masked for k-layer semantic perception.
Preferably, the cross-feature distillation process in step S6 comprises the steps of:
s61, according to the size of the receptive field, student characteristics F with different granularities are obtained stu Performing descending order sorting to obtainSequentially and iteratively fusing and updating;
specifically, first, a fusion feature is initialized, a pair ofFeature transformation is achieved>Secondly, carrying out feature fusion; at t iterations, due to the precursor fusion feature +.>Size and current characteristics of->Inconsistent, so that the two features are aligned by an interpolation method, and then the fusion feature of t iterations is obtained by weighted summation>After n-1 iterations, reversing the feature sequence to finally obtain the multi-layer fusion feature ++> The specific operation is shown in the formula (13) and the formula (14):
wherein phi is a characteristic transformation layer, and />Fusion weights for t iterations, +.>Is an interpolation function;
s62, introducing a semantic perception mask se And a location awareness mask lo Feature distillation is respectively carried out on semantic knowledge and positioning knowledge on the feature map, and in addition, student features used in the distillation process are fusion features H stu The loss function is specifically shown in formula (15), formula (16) and formula (17):
wherein ,values of (i, j) elements are masked for k-layer semantic perception, < >>Values of the element of the perceptual mask (i, j) are located for the k-layer, < >>Sum of element values of k-layer semantic perception mask, < +.>Sum of element values of the localization perceptual mask for k layers,/->As a conventional characteristic distillation loss function, W se and Wlo And respectively representing coefficients corresponding to the semantic feature distillation and the locating feature distillation loss.
Preferably, in step S7, the total loss function value includes classification loss, regression loss, cross-feature distillation loss, and output layer distillation loss of the remote sensing target detection task.
Preferably, the student model optimization in step S7 includes the steps of:
s71, calculating the gradient of the total loss function on the student model parameters by using a back propagation mechanism;
s72, updating student model parameters along the gradient direction;
the total loss function value is calculated as in equation (18):
L total =αL cls +βL reg +γL logit +λL feat #(18)
wherein, alpha, beta, gamma and lambda respectively represent the corresponding coefficients of the loss of each part.
The application has the beneficial effects that:
the application solves the problem that the remote sensing detection model with large parameter and complex parameter is difficult to be deployed to the satellite and other edge equipment. The detection precision of the remote sensing detector is ensured by capturing the sensitivity of the distillation area to different types of knowledge and establishing the connection mode between different levels of feature migration, and meanwhile, the model parameter quantity, the calculation amount and the reasoning time are reduced. The method not only realizes the light weight of the remote sensing detector, but also improves the performance of the detector.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic flow chart of a knowledge distillation algorithm of mixed knowledge decoupling for remote sensing target detection.
Fig. 2 is an input picture provided in an embodiment of the present application.
Fig. 3 is a diagram showing the sense of semantic information and positioning information.
FIG. 4 shows the results of the test prior to distillation provided in the examples of the present application.
FIG. 5 shows the results of the measurement after distillation provided in the examples of the present application.
Note that: the rectangular boxes in fig. 4 and 5 are artificially added to facilitate comparison of the detection results before and after distillation.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
As shown in fig. 1, a knowledge distillation algorithm for hybrid knowledge decoupling for remote sensing target detection includes the following steps:
s1, constructing a remote sensing target detection model as a teacher model for knowledge distillation.
The teacher model may select a plurality of mainstream remote sensing target detection models, for example: single-stage methods Rotated RetinaNet, rotated ats, etc.
Specifically, the application takes RotatedRetinaNet as a basic model, the backbone network adopts ResNet50, and the teacher model and the student model have the same detection framework.
And S2, compared with a student model, the teacher model has stronger representation capability and larger model specification. In order to obtain a compact student model, the remote sensing target detection model in the step S1 is subjected to light weight processing by adopting a pruning technology, a student model for knowledge distillation is formed, the importance of a channel is judged by using the weight of a BN layer in the student model, and channel pruning is carried out on all modules in a backbone network of the student model according to the importance of the channel.
The step of light weight treatment specifically comprises the following steps:
s21, constructing a backbone network corresponding to the student model, and loading model parameters. Specifically, resNet50 network R is constructed O Model parameters are loaded.
S22, traversing all BN layers in a student backbone network, and recording the corresponding weight and channel number; and changing the number of the channels after pruning into original theta times according to the manually set pruning rate theta, and recording the number of the channels after pruning.
Specifically, R is traversed O All of (3)BN layer of (b), recording weight corresponding to each layer And channel number->According to the manually set pruning rate theta, changing the number of the pruned channels to be theta times of the original number of the pruned channels and recording the number of the pruned channels +.>Wherein N is R O The number of BN layers contained, θ was 0.8.
S23, traversing all BN layers, and sorting the weights of the BN layers in a descending order; and screening important weights to be reserved according to the number of the channels after pruning, and generating pruning masks corresponding to each BN layer.
Traversing all BN layers, screening important channels and generating a pruning mask. For the ith BN layer, its weights W are ordered in descending order i BN . Then screening according to the weight valueGenerating a mask which indicates whether the channel is important>As shown in formula (1'). After N traversals, a list storing the pruning masks is finally obtained>
S24, traversing all modules in the student backbone network, in particular traversing R O Is provided. And screening weights corresponding to the convolution layer, the linear layer and the BN layer in a certain dimension according to the indication of the pruning mask, and discarding the unwanted weights.
Specifically, for the ith convolution layer, its weight is of the sizeTensors of (c). The weight pruning process comprises the following steps: weight tensor is in 0 dimension according to pruning mask +.>Index, according to mask +.>Index, finally get the size +.>Tensors of (c). Wherein K is i CONV The convolution kernel size of the i convolution layer is indicated.
The linear layer in ResNet50 is a class prediction layer with weights of sizeTensors of (c). The weight pruning process comprises the following steps: the weight tensor is in 1 dimension according to the pruning mask +.>Index to filter out unimportant linear layer weights. Where n_c represents the total number of categories of the dataset.
S25, constructing a lightweight backbone network according to the number of the pruned channels; loading the pruned weights into the network and generating a student model file for initializing the student model.
Specifically, according to the number of channels C P Creation of a new ResNet50 network R P . R is R O All weights loaded to R P And R is taken as P All weights of (a) are saved into a model file to facilitate direct loading and enableIs used.
S3, predicting the boundary box information by the student model in the step S2 and calculating target detection loss. Specifically, the input image is first preprocessed to be converted into a tensor of 3×1024×1024. And secondly, obtaining class scores and boundary box information by the image tensor through the student model, and respectively calculating a class loss function value and a regression loss function value according to the remote sensing target detection loss function.
The method specifically comprises the following steps:
s31, preprocessing an input image, including unifying the size and normalization of the image, and finally converting the input image into a tensor form;
specifically, the main step of preprocessing the input image includes unifying the image sizes to 1024×1024×3, then normalizing, the normalized mean value is (123.675,116.28,103.53), the variance is (58.395,57.12,57.375), and converting the normalized mean value into tensors with the sizes of 3×1024×1024. In order to accelerate the processing speed, batch processing is introduced, the number of image tensor dimensions is expanded, and the final dimension is Batch x 3 x 1024, wherein Batch represents the number of images in Batch processing.
S32, an initial chemo-model is used for loading a pruned lightweight network as an initial backbone network; specifically, R after pruning is loaded P The network serves as the initial backbone network.
S33, inputting the image tensor into a student model, and extracting image features from shallow to deep in a trunk network and a neck network in the student model to obtain 5 image features with different granularitiesThe specific operation process is as formula (1):
wherein ,(S1 ,S 2 ,…,S 5 ) Representing 5 phases of the student feature extraction network,representing function nesting, X representing image tensors; />Represents a tensor of size Batch x 256 x 128->Represents a tensor of size Batch by 256 by 64,/->Represents a tensor of size Batch x 256 x 32,/->Represents a tensor of size Batch 16 x 256>Represents a tensor of size Batch x 256 x 8.
S34, multi-granularity student feature F stu Predicting boundary box information through a detection head in a student model to obtain category prediction scoresAnd regression prediction value-> Specific operation formula (2) and formula (3):
wherein ,Scls and Sreg Respectively represent student modelsA class prediction layer and a regression prediction layer in the model;represents a tensor of size Batch 135 x 128->Represents a tensor of size Batch 135 64, ∈64>Represents a tensor of size Batch 135 x 32,/->Represents a tensor of size Batch 135 16, ∈16>Represents a tensor of size Batch 135 x 8,/->Represents a tensor of size Batch 45 x 128->Represents a tensor of size Batch 45 64, ->Represents a tensor of size Batch x 45 x 32,/->Represents a tensor of size Batch 45 16, ∈16>Represents a tensor of size Batch x 45 x 8.
S35, respectively calculating a classification loss function value and a regression loss function value according to the real frame label Y and the regression target delta generated by the student detection head, wherein the specific operation is as shown in the formula (4) and the formula (5):
wherein ,representing the classification Loss function Focal Loss, +.>The regression Loss function L1Loss is shown.
S4, the teacher model guides the student model to decouple knowledge of different types; namely, under the guidance of the real annotation data, a semantic perception mask and a positioning perception mask for knowledge migration are generated according to the output of the teacher model.
As shown in fig. 3, since each element on the feature map has a difference in perceptibility of different types of knowledge, it is necessary to distinguish the contribution degree of each element in different knowledge migration processes. And a teacher model with strong representation capability and knowledge perception capability can guide a student model to decouple different types of knowledge. Therefore, the application designs a knowledge decoupling module, and captures semantic knowledge and positioning knowledge on the feature map by using the class prediction score and the regression prediction value of the teacher model, thereby guiding the student model to transfer different types of knowledge in a targeted manner.
The step S4 of generating the semantic perception mask and locating the perception mask specifically includes the following steps:
s41, loading a pre-trained teacher model, and setting gradients of all parameters of the teacher model not to return;
s42, extracting image features of a main network ResNet50 and a neck network FPN of the teacher model from shallow to deep to obtain teacher image features with different granularitiesThe specific operation is as formula (6):
wherein ,(T1 ,T 2 ,…,T 5 ) Five stages representing the teacher feature extraction layer,represents a tensor of size Batch x 256 x 128->Represents a tensor of size Batch by 256 by 64,/->Represents a tensor of size Batch x 256 x 32,/->Represents a tensor of size Batch 16 x 256>Represents a tensor of size Batch x 256 x 8.
S43, multi-granularity teacher image feature F tea Predicting boundary box information through a detection head in a teacher model to obtain category prediction scoresAnd regression prediction value-> The specific operation is as formula (7) and formula (8):
wherein ,Tcls and Treg Respectively representing a category prediction layer and a regression prediction layer in the teacher model;represents a tensor of size Batch 135 x 128->Represents a tensor of size Batch 135 64, ∈64>Represents a tensor of size Batch 135 x 32,/->Represents a tensor of size Batch 135 16, ∈16>Indicating that the size is Batch 135 x 8,/i>Represents a tensor of size Batch 45 x 128->Represents a tensor of size Batch 45 64, ->Represents a tensor of size Batch x 45 x 32,/->Represents a tensor of size Batch 45 16, ∈16>Represents a tensor of size Batch x 45 x 8.
S44, utilizing boundary frame information predicted by the teacher model to mine boundaries between semantic knowledge and positioning knowledge, and generating a semantic perception maskLocating perceptual masksTo capture sensitivity of the distilled region to semantic knowledge, positional knowledge.
The step S44 specifically includes the following steps:
s441, regarding teacher image characteristicsFor each element in (a) calculating the maximum value of all class prediction scores as semantic perception mask +.>The specific operation is as formula (9):
wherein K is the total category number, (c) 1 ,…,c i ,…,c K ) Representing all target categories;
s442, regarding teacher feature mapRegression prediction value +.>Coding into the corresponding predictive regression frame->IoU between the predictive regression frame and the real frame GT is then calculated and used as a location awareness maskSpecifically, the size is Batch 45 h i *W i Features of->Dimension transformation into a size of Batch (H i *W i * 9) Tensor of 5, then ++according to regression prediction>Decoding an anchor frame A generated by a student detection head into a frame A with a size of Batch (H) i *W i * 9) Prediction box of 5->Since each image contains a different number of real frames, performing Batch iterations to calculate IoU between the predicted and real frames and find the maximum value will result in a size of H for each iteration i *W i Finally stacking the tensors into a tensor of the size of Batch H i *W i Tensors of (c). The specific operation is as shown in the formula (10) and the formula (11):
where M represents the number of all prediction frames and decode is the prediction frame coding function.
S5, distilling semantic knowledge from the output level under the guidance of the positioning perception mask in the step S4, and calculating a loss function value of output layer distillation.
The basic idea of the output layer knowledge distillation (logic knowledge distillation) is: and taking the output of the teacher model as supervision information, and continuously optimizing the distance between the output of the student model and the soft label provided by the teacher model in the distillation process. The conventional Logit knowledge distillation process is shown in formula (2'):
wherein ,Hk 、W k The dimensions of the k-layer feature map are indicated,representing probability vectors at the (i, j) positions of the k-layer teacher feature map,/for>Representing probability vectors at the positions of the k-layer student feature map (i, j), +.>The Logit distillation loss function is expressed and used for measuring the similarity between student predictions and soft labels, and T is a smoothing factor.
In particular, the method comprises the steps of,(k.epsilon. {1,2,3,4,5 }) represents probability vector at the position of (i, j) of the k-layer teacher class probability map, +.>(k.epsilon. {1,2,3,4,5 }) represents the probability vector at the position of the k-layer student class probability map (i, j).
However, this knowledge distillation pattern tends to migrate semantic knowledge fairly per distillation area. In fact, not all regions contribute equally to semantic migration. Therefore, in order to enhance the region with strong semantic sensitivity and inhibit the region with weak semantic sensitivity, the patent introduces a semantic mask and gives each distillation region different weights.
Based on the traditional Logit distillation model, a semantic mask is introduced se Each element on the class probability map is given different weights, so that the due distillation value of the element is exerted.
Specifically, the output layer distillation loss function value is calculated as in formula (12):
wherein ,element value of the semantic perception mask (i, j) for k layers, < >>Mask the sum of all element values for k-layer semantic perception, +.>The value of T is 10 as a function of KL divergence.
S6, under the guidance of the semantic perception mask and the positioning perception mask in the step S4, the semantic features and the positioning features of different layers are subjected to cross distillation, and a loss function value of cross feature distillation is calculated; the characteristic distillation process between teachers and students in different layers is adaptively fused, so that the guidance effect of the characteristics of shallow teachers on the characteristics of deep students is exerted.
Feature layer knowledge distillation, i.e., the basic idea of feature distillation, simulates teacher features for student features. The gap between the student characteristics and the teacher characteristics is continuously optimized in the distillation process, and the gap is specifically shown as a formula (3'):
wherein ,(k.epsilon. {1,2,3,4,5 }) represents the feature vector at the (i, j) position of the k-layer teacher feature map, +.>(k.epsilon. {1,2,3,4,5 }) represents the feature vector at the position of the k-layer student feature map (i, j), +.>The characteristic distillation loss is expressed and used for measuring the similarity degree between the characteristics of a teacher and the characteristics of a student.
Traditional feature distillation requires students to mimic teacher features of the same granularity, which isolates migration between features of different granularity. In fact, the shallow features of the teacher model also have guiding value for the deep features of the student model. Furthermore, feature distillation is essentially a knowledge migration between the corresponding elements in the student, teacher feature graph, and the role each element plays in the knowledge migration process is fair, which confuses the contribution of different distillation regions to feature distillation. Even further, since the remote sensing target detection is used as a combination of classification task and regression task, the contribution degree of each distillation area to semantic knowledge migration and positioning knowledge migration is different.
Therefore, the patent designs a multi-layer feature interaction module to iteratively fuse student features with different granularities to obtain fusion features for enriching information in a feature distillation stage so as to realize cross feature distillation. And introducing a semantic mask and a positioning mask generated by teacher prediction results to distinguish the sensitivity of each distillation area to semantic knowledge and positioning knowledge, so that the semantic feature distillation and the positioning feature distillation are targeted.
Specifically, the cross-feature distillation process comprises the steps of:
s61, according to the size of the receptive field, student characteristics F with different granularities are obtained stu Performing descending order sorting to obtainSequentially and iteratively fusing and updating;
first, initialize the fusion feature, pairFeature transformation is achieved>Secondly, carrying out feature fusion; at t iterations, due to the precursor fusion feature +.>Size and current characteristics of->Inconsistent, so that the two features are aligned by an interpolation method, and then the fusion feature of t iterations is obtained by weighted summation>After 5-1 iterations, reversing the feature sequence to finally obtain the multi-layer fusion feature ++> The specific operation is shown in the formula (13) and the formula (14):
wherein phi is a characteristic transformation layer, and />Fusion weights for t iterations, +.>As a bilinear interpolation function,is of dimension Batch by 256 by H t-1 *W t-1 ,/>Is of dimension Batch by 256 by H t *W t ,/>Is Bach×256×H t *W t
S62, introducing a semantic perception mask based on a traditional distillation model se And a location awareness mask lo Feature distillation is respectively carried out on semantic knowledge and positioning knowledge on the feature map, and in addition, student features used in the distillation process are fusion features H stu The loss function is specifically shown in formula (15), formula (16) and formula (17):
wherein ,values of (i, j) elements are masked for k-layer semantic perception, < >>Values of the element of the perceptual mask (i, j) are located for the k-layer, < >>Sum of element values of k-layer semantic perception mask, < +.>Sum of element values of the localization perceptual mask for k layers,/->Is L1 Low, W se and Wlo The coefficients corresponding to the semantic feature distillation and the locating feature distillation loss are respectively represented, and are 1.
And S7, calculating a total loss function value and optimizing the student model. The total loss function value is calculated as in equation (18):
L total =αL cls +βL reg +γL logit +λL feat #(18)
wherein, alpha, beta, gamma and lambda respectively represent the corresponding coefficients of the loss of each part. The corresponding values are 1, 0.01, respectively.
Specifically, student model optimization includes the steps of:
s71, calculating the gradient of the total loss function on the student model parameters by using a back propagation mechanism;
s72, updating student model parameters along the gradient direction.
The knowledge distillation method provided by the application is mainly used for solving the problem that a large-parameter and complex remote sensing detection model is difficult to deploy to satellite and other edge equipment. The method not only realizes the light weight of the remote sensing detector, but also improves the performance of the detector.
The above detailed description of the knowledge distillation method for mixed knowledge decoupling for remote sensing target detection provided by the application adopts specific examples to illustrate the principle and the implementation of the application, and the above examples are only used for helping to understand the method and the core idea of the application; also, those skilled in the art will appreciate that the present application can be practiced with other variations in specific details and with respect to a particular embodiment or range of applications, which are not to be construed as limitations on the application
The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims (10)

1. The knowledge distillation algorithm for the mixed knowledge decoupling of the remote sensing target detection is characterized by comprising the following steps of:
s1, constructing a remote sensing target detection model as a teacher model for knowledge distillation;
s2, performing light weight processing on the model in the step S1 to form a student model for knowledge distillation;
s3, predicting the boundary frame information by the student model in the step S2 and calculating target detection loss;
s4, the teacher model guides the student model to decouple knowledge of different types; under the guidance of the real annotation data, generating a semantic perception mask and a positioning perception mask for knowledge migration according to the output of the teacher model;
s5, distilling semantic knowledge from the output level under the guidance of the positioning sensing mask in the step S4, and calculating a loss function value of output layer distillation;
s6, under the guidance of the semantic perception mask and the positioning perception mask in the step S4, the semantic features and the positioning features of different layers are subjected to cross distillation, and a loss function value of cross feature distillation is calculated; the characteristic distillation process between teachers and students in different layers is adaptively fused, so that the guidance effect of shallow teacher characteristics on deep student characteristics is exerted;
and S7, calculating a total loss function value and optimizing the student model.
2. The knowledge distillation algorithm for hybrid knowledge decoupling for remote sensing target detection of claim 1, wherein: in the step S2, a pruning technology is adopted to carry out light weight treatment on the remote sensing target detection model; and judging the importance of the channel by using the weight of the BN layer in the student model, and pruning the channel for all modules in the backbone network of the student model according to the importance of the channel.
3. The knowledge distillation algorithm for hybrid knowledge decoupling for remote sensing target detection according to claim 2, wherein the step of lightweight processing specifically comprises:
s21, constructing a backbone network corresponding to the student model, and loading model parameters;
s22, traversing all BN layers in a student backbone network, and recording the corresponding weight and channel number; according to the manually set pruning rate theta, changing the number of the pruned channels into original theta times, and recording the number of the pruned channels;
s23, traversing all BN layers, and sorting the weights of the BN layers in a descending order; screening important weights to be reserved according to the number of the channels after pruning, and generating pruning masks corresponding to each BN layer;
s24, traversing all modules in the student trunk network, screening weights corresponding to a convolution layer, a linear layer and a BN layer in a certain dimension according to the indication of the pruning mask, and discarding the unwanted weights;
s25, constructing a lightweight backbone network according to the number of the pruned channels; saving the pruned weight to the network and generating a student model file for initializing the student model.
4. The knowledge distillation algorithm for hybrid knowledge decoupling for remote sensing target detection according to claim 1, wherein step S3 specifically comprises the steps of:
s31, preprocessing an input image, including unifying the size and normalization of the image, and finally converting the input image into a tensor form;
s32, an initial chemo-model is used for loading a pruned lightweight network as an initial backbone network;
s33, inputting the image tensor into a student model, and extracting image features from shallow to deep in a trunk network and a neck network in the student model to obtain image features with different granularitiesThe specific operation process is as formula (1):
wherein ,(S1 ,S 2 ,…,S n ) Represents n stages of the student feature extraction network, n represents the number of features of different granularity,representing function nesting, X representing image tensors;
s34, multi-granularity student feature F stu Predicting boundary box information through a detection head in a student model to obtain category prediction scoresAnd regression prediction value-> Specific operation formula (2) and formula (3):
wherein ,Scls and Sreg Respectively representing a category prediction layer and a regression prediction layer in the student model;
s35, respectively calculating a classification loss function value and a regression loss function value according to the real frame label Y and the regression target delta generated by the student detection head, wherein the specific operation is as shown in the formula (4) and the formula (5):
wherein ,representing a classification loss function, +.>Representing the regression loss function.
5. The knowledge distillation algorithm for hybrid knowledge decoupling for remote sensing target detection according to claim 1, wherein the step S4 of generating a semantic perception mask and a location perception mask specifically comprises the steps of:
s41, loading a pre-trained teacher model, and setting gradients of all parameters of the teacher model not to return;
s42, the main network and the neck network of the teacher model are from shallow to shallowDeep extracting image features to obtain teacher image features with different granularitiesThe specific operation is as formula (6):
wherein ,(T1 ,T 2 ,…,T n ) N stages representing a teacher feature extraction network;
s43, multi-granularity teacher image feature F tea Predicting boundary box information through a detection head in a teacher model to obtain category prediction scoresAnd regression prediction value-> The specific operation is as formula (7) and formula (8):
wherein ,Tcls and Treg Respectively representing a category prediction layer and a regression prediction layer in the teacher model;
s44, utilizing boundary frame information predicted by the teacher model to mine boundaries between semantic knowledge and positioning knowledge, and generating a semantic perception maskLocating perceptual masksTo capture sensitivity of the distilled region to semantic knowledge, positional knowledge.
6. The knowledge distillation algorithm for hybrid knowledge decoupling for remote sensing target detection as claimed in claim 5, wherein said step S44 specifically comprises the steps of:
s441, regarding teacher image characteristicsFor each element in (a) calculating the maximum value of all class prediction scores as semantic perception mask +.>The specific operation is as formula (9):
wherein K is the total category number, (c) 1 ,…,c i ,…,c K ) Representing all target categories;
s442, regarding teacher feature mapAccording to regression prediction value +.>Encoding anchor box A into corresponding predictive regression box->IoU between the predictive regression frame and the real frame GT is then calculated and used as a location-aware maskCode->The specific operation is as shown in the formula (10) and the formula (11):
where M represents the number of all prediction frames and decode is the prediction frame coding function.
7. The knowledge distillation algorithm for hybrid knowledge decoupling for remote sensing target detection as claimed in claim 1, wherein in semantic perception maskingUnder the guidance of (2), the output layer distillation loss function value in step S5 is calculated as in formula (12):
wherein ,Hk 、W k The dimensions of the k-layer feature map are indicated,representing probability vectors at the (i, j) positions of the k-layer teacher feature map,/for>Representing probability vectors at the positions of the k-layer student feature map (i, j), +.>Representing Logit distillation loss function for measuring similarity between student's prediction and soft label, T being smoothing factor, ++>Element value of the semantic perception mask (i, j) for k layers, < >>The sum of all element values is masked for k-layer semantic perception.
8. The knowledge distillation algorithm for hybrid knowledge decoupling for remote sensing target detection according to claim 1, wherein the cross-feature distillation process in step S6 comprises the steps of:
s61, according to the size of the receptive field, student characteristics F with different granularities are obtained stu Performing descending order sorting to obtainSequentially and iteratively fusing and updating;
specifically, first, a fusion feature is initialized, a pair ofFeature transformation is achieved>Secondly, carrying out feature fusion; at t iterations, due to the precursor fusion feature +.>Size and current characteristics of->Inconsistent, so that the two features are aligned by an interpolation method, and then the fusion feature of t iterations is obtained by weighted summation>After n-1 iterations, reversing the feature sequence to finally obtain the multi-layer fusion feature ++> The specific operation is shown in the formula (13) and the formula (14):
wherein phi is a characteristic transformation layer, and />Fusion weights for t iterations, +.>Is an interpolation function;
s62, introducing a semantic perception mask se And a location awareness mask lo Feature distillation is respectively carried out on semantic knowledge and positioning knowledge on the feature map, and in addition, student features used in the distillation process are fusion features H stu The loss function is specifically shown in formula (15), formula (16) and formula (17):
wherein ,values of (i, j) elements are masked for k-layer semantic perception, < >>Values of the element of the perceptual mask (i, j) are located for the k-layer, < >>Sum of element values of k-layer semantic perception mask, < +.>Sum of element values of the localization perceptual mask for k layers,/->As a conventional characteristic distillation loss function, W se and Wlo And respectively representing coefficients corresponding to the semantic feature distillation and the locating feature distillation loss.
9. The knowledge distillation algorithm for hybrid knowledge decoupling for remote sensing target detection of claim 1, wherein: in step S7, the total loss function value includes classification loss, regression loss, cross feature distillation loss, and output layer distillation loss of the remote sensing target detection task.
10. The knowledge distillation algorithm for mixed knowledge decoupling for remote sensing target detection as claimed in claim 1, wherein the student model optimization in step S7 comprises the steps of:
s71, calculating the gradient of the total loss function on the student model parameters by using a back propagation mechanism;
s72, updating student model parameters along the gradient direction;
the total loss function value is calculated as in equation (18):
L total =αL cls +βL reg +γL logit +λL feat #(18
wherein, alpha, beta, gamma and lambda respectively represent the corresponding coefficients of the loss of each part.
CN202310521321.5A 2023-05-10 2023-05-10 Mixed knowledge decoupling knowledge distillation algorithm for remote sensing target detection Pending CN116665068A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310521321.5A CN116665068A (en) 2023-05-10 2023-05-10 Mixed knowledge decoupling knowledge distillation algorithm for remote sensing target detection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310521321.5A CN116665068A (en) 2023-05-10 2023-05-10 Mixed knowledge decoupling knowledge distillation algorithm for remote sensing target detection

Publications (1)

Publication Number Publication Date
CN116665068A true CN116665068A (en) 2023-08-29

Family

ID=87712726

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310521321.5A Pending CN116665068A (en) 2023-05-10 2023-05-10 Mixed knowledge decoupling knowledge distillation algorithm for remote sensing target detection

Country Status (1)

Country Link
CN (1) CN116665068A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117521848A (en) * 2023-11-10 2024-02-06 中国科学院空天信息创新研究院 Remote sensing basic model light-weight method and device for resource-constrained scene

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117521848A (en) * 2023-11-10 2024-02-06 中国科学院空天信息创新研究院 Remote sensing basic model light-weight method and device for resource-constrained scene
CN117521848B (en) * 2023-11-10 2024-05-28 中国科学院空天信息创新研究院 Remote sensing basic model light-weight method and device for resource-constrained scene

Similar Documents

Publication Publication Date Title
US20230215166A1 (en) Few-shot urban remote sensing image information extraction method based on meta learning and attention
CN113792113A (en) Visual language model obtaining and task processing method, device, equipment and medium
CN109902293A (en) A kind of file classification method based on part with global mutually attention mechanism
Guo et al. An ensemble learning framework for convolutional neural network based on multiple classifiers
CN110717553A (en) Traffic contraband identification method based on self-attenuation weight and multiple local constraints
CN107636691A (en) Method and apparatus for identifying the text in image
CN114912612A (en) Bird identification method and device, computer equipment and storage medium
CN114332578A (en) Image anomaly detection model training method, image anomaly detection method and device
CN113591988B (en) Knowledge cognitive structure analysis method, system, computer equipment, medium and terminal
CN113065520B (en) Multi-mode data-oriented remote sensing image classification method
CN113469186A (en) Cross-domain migration image segmentation method based on small amount of point labels
CN116665068A (en) Mixed knowledge decoupling knowledge distillation algorithm for remote sensing target detection
US20210056353A1 (en) Joint representation learning from images and text
Qiu et al. Multitask learning for human settlement extent regression and local climate zone classification
CN112396091A (en) Social media image popularity prediction method, system, storage medium and application
CN115858388A (en) Test case priority ordering method and device based on variation model mapping chart
CN115130591A (en) Cross supervision-based multi-mode data classification method and device
CN117217368A (en) Training method, device, equipment, medium and program product of prediction model
CN112420125A (en) Molecular attribute prediction method and device, intelligent equipment and terminal
CN105787045A (en) Precision enhancing method for visual media semantic indexing
CN112861601A (en) Method for generating confrontation sample and related equipment
CN116737876A (en) Education device for assisting scientific popularization and application service
CN109559345B (en) Garment key point positioning system and training and positioning method thereof
CN115019180B (en) SAR image ship target detection method, electronic device and storage medium
CN116467930A (en) Transformer-based structured data general modeling method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination