CN114565045A

CN114565045A - Remote sensing target detection knowledge distillation method based on feature separation attention

Info

Publication number: CN114565045A
Application number: CN202210194931.4A
Authority: CN
Inventors: 赵丹培; 袁智超; 苑博; 史振威; 张浩鹏; 姜志国
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2022-03-01
Filing date: 2022-03-01
Publication date: 2022-05-31

Abstract

The invention discloses a remote sensing target detection knowledge distillation method based on feature separation attention, which comprises the following steps: respectively extracting feature attention diagrams of feature diagrams output by a teacher network and a student network; separating a foreground area and a background area of the feature map, calculating a foreground attention mask through the feature attention map of the teacher network, and calculating a background attention mask through the feature attention map of the student network; calculating the L2 distillation loss using the foreground attention mask and the background attention mask; the knowledge of the teacher network was migrated to the student network based on the L2 distillation loss. The method can effectively select the area to be distilled, improve the distillation efficiency, and improve the detection precision of the final lightweight target detection network on the premise of not changing the structure of a student network and not increasing the calculation consumption.

Description

Remote sensing target detection knowledge distillation method based on feature separation attention

Technical Field

The invention relates to the technical field of knowledge distillation, in particular to a remote sensing target detection knowledge distillation method based on feature separation attention.

Background

The appearance of large-scale high-resolution remote sensing image data sets enables deep learning to be widely applied to remote sensing image target detection. However, the algorithm with high precision is high in time complexity and depends on a high-performance graphics processor. In the actual engineering application of remote sensing target detection, an embedded system is a more general platform. At present, some light-weight deep learning target detection algorithms are available, although the operation speed is high, the detection precision still cannot meet the task requirement.

At present, some knowledge distillation methods are used to improve the performance of the deep neural network, for example, a method of performing migration learning on the output, feature map, and information stream of the network. However, most of these studies are focused on the image classification field, in the target detection field, the definition of information to be migrated by knowledge distillation is not clear, and since the ratio of the background area in the target detection data is much higher than that of the classification data, the interference of the background area is serious when the knowledge distillation is directly performed by using the image classification method, and the ideal effect cannot be achieved.

Therefore, how to provide a knowledge distillation method capable of effectively extracting a region to be distilled, without increasing the calculation consumption, and improving the detection accuracy of a lightweight target detection network is a problem to be solved by those skilled in the art.

Disclosure of Invention

In view of the above, the invention provides a remote sensing target detection knowledge distillation method based on feature separation attention, which can effectively select a region to be distilled, improve distillation efficiency, and improve detection accuracy of a final lightweight target detection network on the premise of not changing a student network structure and not increasing calculation consumption.

In order to achieve the purpose, the invention adopts the following technical scheme:

a remote sensing target detection knowledge distillation method based on feature separation attention comprises the following steps:

respectively extracting feature attention diagrams of feature diagrams output by a teacher network and a student network;

separating a foreground area and a background area of the feature map, calculating a foreground attention mask through a feature attention map of a teacher network, and calculating a background attention mask through a feature attention map of a student network;

calculating the L2 distillation loss using the foreground attention mask and the background attention mask;

the knowledge of the teacher network was migrated to the student network based on the L2 distillation loss.

Preferably, in the above-mentioned remote sensing target detection knowledge distillation method based on feature separation attention, the feature attention map includes spatial attention and channel attention; wherein, the calculation formula of the space attention is as follows:

the channel attention is calculated as:

wherein G is_sRepresenting spatial attention, for characterizing the degree of importance of each pixel position in each channel, G_cRepresenting the attention of the channels, and representing the importance degree of each channel of the feature map; H. w and C represent the height, width and dimension of the feature map, respectively; a. the_k,i,jRepresenting the pixel value of profile a at the i, j coordinate position of its k-th channel.

Preferably, in the above method for distilling knowledge of remote sensing target detection based on feature separation attention, the computing of the foreground attention mask by the feature attention map of the teacher network includes:

calculating characteristic diagram A of teacher network output^tSpatial attention and channel attention of;

respectively matching the characteristic graphs A by using a Softmax function^tCarrying out probability normalization on the spatial attention and the channel attention;

feature map A after probability normalization^tForeground mask S of spatial attention and channel attention and labeling_fMultiplying to obtain weighted foreground attention mask J_f。

Preferably, in the above knowledge distillation method based on remote sensing target detection of feature separation attention, the calculation formula of the foreground attention mask is as follows:

wherein the probability normalization function

Normalizing the variable z between probabilities of belonging to class i to (0,1)And the sum of all probabilities is 1; H. w, C respectively represent characteristic diagrams A^tThe length, width and dimensions of the substrate,

and

respectively representing the spatial attention and the channel attention of the teacher network, S_fFor foreground mask, take 1 at foreground region and 0 at background region.

Preferably, in the above method for distilling knowledge of remote sensing target detection based on feature separation attention, the method further includes: masking the foreground according to the size of the target_fNormalization is carried out to obtain a normalized foreground mask

And masking the normalized foreground

Replacing the foreground mask S_fCalculating a foreground attention mask; normalized foreground mask

The calculation formula of (2) is as follows:

where T represents the sum of all targets, T represents each target, s_tRepresenting the area of the target region.

Preferably, in the above method for distilling knowledge of remote sensing target detection based on feature separation attention, the calculating of the background attention mask by the feature attention map of the student network includes:

calculating characteristic diagram A of student network output^sSpatial attention and channel attention of;

characteristic diagram A is respectively matched through softmax^sSpace (A) ofCarrying out probability normalization on attention and channel attention;

feature map A after probability normalization^sThe background mask S of spatial attention and channel attention and labeling_bMultiplying to obtain a weighted background attention mask J_b。

Preferably, in the above-mentioned distillation method for remote sensing target detection knowledge based on feature separation attention, the background attention mask J_bThe calculation formula of (c) is:

wherein H, W, C respectively represent characteristic diagrams A^sThe length, width and dimensions of the substrate,

and

respectively representing the spatial attention and the channel attention, S, of the student network_bFor background masking, 0 is taken at the foreground and 1 is taken at the background.

Preferably, in the above method for distilling knowledge of remote sensing target detection based on feature separation attention, the method further includes: masking the background with a mask S according to the size of the target_bNormalization is carried out to obtain a normalized background mask

And masking the normalized background

Instead of the background mask S_bCalculating a background attention mask; normalized background mask

The calculation formula of (2) is as follows:

Preferably, in the distillation method based on the knowledge of remote sensing target detection of feature separation attention, the formula for calculating the distillation loss of L2 is as follows:

L_d＝δL₂(A^s·J_f，A^t·J_f)+εL₂(A^s·J_b，A^t·J_b)

where δ and ε are parameters that control the ratio of the computation of the foreground attention mask to the background attention mask loss, A^sAnd A^tFeature diagrams representing student and teacher networks, respectively, J_fAnd J_bRespectively representing a foreground attention mask and a background attention mask; the L2 loss function is a function for solving a space Euclidean distance for two vectors of X and Y, and the calculation mode is as follows:

wherein x_i、y_iEach term of the vectors X, Y, respectively, is represented by n terms.

According to the technical scheme, compared with the prior art, the remote sensing target detection knowledge distillation method based on feature separation attention is provided, and firstly, the foreground and background areas of the feature map are separated and the attention maps are respectively extracted. The characteristic separation adopts a mask mode, the coordinates of the region where the target is located are extracted from the labeling information, and the coordinates are mapped to a characteristic diagram to be distilled according to the resolution. For the foreground area, using a foreground mask, namely only considering the area where the target is located; for the background area, a background mask is used, i.e. the parts other than the target area are considered. The method comprises the steps of extracting space attention and channel attention from a foreground area and a background area respectively, obtaining stronger response in the foreground area containing a target compared with the output of a student network due to the fact that teacher network performance is stronger, distilling the foreground response of the teacher network to improve the capability of the student network in judging the foreground target, and calculating a foreground attention mask through a feature map of the teacher network. And some wrong responses may exist in the background, the feature extraction capability of the student network is relatively weak, compared with a teacher network, the wrong response in a background area of the student network output feature map is more prominent, and the misjudgment of the student network on the background area can be reduced by training the background area of the student network through distillation, so that the background attention mask is calculated through the feature map of the student network. Knowledge of the teacher network can be migrated to the student network using separate foreground and background attention masks to calculate the L2 loss separately. In general, the invention fuses the information of the characteristic diagram into an attention diagram in a way of spatial attention and channel attention, and avoids mutual interference among channels. And the false detection of the target in the foreground region and the false detection of the background region are distilled respectively in the mode of the foreground mask and the background mask, so that the detection effect of the lightweight model is greatly improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flow chart of a distillation method for remote sensing target detection knowledge based on feature separation attention, provided by the invention;

FIG. 2 is a schematic diagram of the structure of a knowledge distillation model provided by the present invention;

FIG. 3 is a schematic diagram of a foreground attention mask acquisition process provided by the present invention;

FIG. 4 is a schematic diagram of a background attention mask process provided by the present invention;

FIG. 5 is a schematic of a distillation based on feature separation attention provided by the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in FIG. 1, the embodiment of the invention discloses a distillation method for remote sensing target detection knowledge based on feature separation attention, which comprises the following steps:

s1, extracting feature attention diagrams of feature diagrams output by the teacher network and the student network respectively;

s2, separating the foreground area and the background area of the feature map, calculating a foreground attention mask through the feature attention map of the teacher network, and calculating a background attention mask through the feature attention map of the student network;

s3, calculating the distillation loss of the L2 by using the foreground attention mask and the background attention mask; the knowledge of the teacher network was migrated to the student network based on the L2 distillation loss.

The overall structure of the remote sensing target detection knowledge distillation method based on feature separation attention is shown in figure 2, and the model is divided into a teacher network and a student network, wherein the teacher network is a high-performance complex neural network, and the student network is a light-weight simple neural network. Firstly, the teacher network is pre-trained to be converged and have high detection performance. And then, in the process of training the student network, the output of the teacher network is used as additional supervision information to train the student network, and the knowledge in the characteristic diagram is migrated, so that the training effect of the student network is improved.

The embodiment of the invention aims at carrying out knowledge distillation on the characteristic diagram in the convolutional neural network. Due to the fact that the dimension of the characteristic diagram is too high, mutual interference among channels is easily caused by directly calculating the characteristic diagram by utilizing the L2 loss, and the distillation effect is damaged. On the other hand, effective information may exist in a background region ignored by a knowledge distillation method for detecting most targets, and the capability of distinguishing positive and negative samples of the model can be improved by effectively learning the background region. Distillation of the foreground part can reduce the omission factor of a student network, namely, the model has stronger capability of identifying a positive sample; and distilling the background part can reduce the false detection rate of the student network, namely, the model has stronger capability of identifying the negative sample. Therefore, the invention provides a knowledge distillation mode based on feature separation attention by utilizing the mode of separating and distilling the foreground and the background and combining the feature map attention.

The above steps are further described below.

And S1, extracting characteristic attention diagrams of the characteristic diagrams output by the teacher network and the student network respectively.

The characteristic attention is divided into spatial attention and channel attention. Spatial attention refers to the dimensionality reduction of features along the channel dimension, represented by only one value at each pixel point. This attention is reflected in the degree of importance of the pixels in the feature map, with points with stronger responses indicating a greater probability of the presence of an object. Channel attention refers to the dimensionality reduction of the feature along the length-width dimension, with each value reflecting the response of one channel. Because the information contained in each channel is uneven when the network extracts the features, the attention reflects the importance degree of each channel in the feature map, and the network can be more focused on the channels rich in effective information through the attention of the channels.

Since there is no label information for training the attention module in the knowledge distillation, the invention adopts a simple manual design mode, namely taking the absolute value average as the attention map. Feature graph A over teacher network^tAnd profile A of a student network^sThe respective spatial attention map and the channel are calculated by the following formulaAttention is drawn to the force diagram.

The formula for calculating the spatial attention is as follows:

the channel attention is calculated as:

wherein G is_sRepresenting spatial attention, for characterizing the degree of importance of each pixel position in each channel, G_cRepresenting the attention of the channels, and representing the importance degree of each channel of the feature map; H. w and C represent the height, width and dimension of the feature map, respectively; a. the_k,i,jThe pixel values representing the i, j coordinate position of feature a at its k-th channel.

And S2, separating the foreground area and the background area of the feature map, calculating a foreground attention mask through the feature attention force diagram of the teacher network, and calculating a background attention mask through the feature attention force diagram of the student network.

As shown in fig. 3-4, knowledge distillation based on feature separation attention requires that a spatial and channel attention map is obtained by absolute value averaging, feature separation is performed on the feature attention map through a foreground mask and a background mask, and distillation losses of the foreground attention mask and the background attention mask are calculated respectively. For the foreground attention mask, multiplying the space of the teacher model and the channel attention and taking a foreground part by using the foreground mask; for the background attention mask, the spatial and channel attention of the student model are multiplied and the background portion is taken using the background mask. Subsequently, the L2 distillation loss function was calculated on the foreground and background attention masks, respectively.

Because the teacher network performance is strong, compared with the output of the student network, strong response can be obtained in the foreground area containing the target, and the ability of the student network to judge the foreground target can be improved by distilling the foreground response of the teacher network, so that the foreground attention mask is calculated through the feature map of the teacher network. In particular, the method comprises the following steps of,

1. the foreground attention mask is obtained by the following steps:

1) calculating characteristic diagram A of teacher network output^tSpatial attention and channel attention of;

2) respectively matching the characteristic graphs A by using a Softmax function^tCarrying out probability normalization on the spatial attention and the channel attention;

3) feature map A after probability normalization^tForeground mask S of spatial attention and channel attention and labeling_fMultiplying to obtain weighted foreground attention mask J_f。

The specific calculation formula is as follows:

wherein the probability normalization function

Normalizing the variable z between the probabilities belonging to class i to (0,1) and the sum of all probabilities being 1; H. w, C respectively represent characteristic diagrams A^tThe length, width and dimensions of the substrate,

and

respectively representing the spatial attention and the channel attention of the teacher network, S_fTaking 1 at foreground region as foreground maskAnd 0 is taken at the scene area.

In order to balance the influence of loss functions of targets with different sizes, the foreground mask needs to be normalized according to the size of the target, and the normalized foreground mask is utilized

Substituted for S_fNormalized foreground mask

The calculation formula of (2) is as follows:

t represents the sum of all targets, T represents each target, s_tThe area of the target region is represented, which ensures that the influence of different sized targets in the loss function is the same.

2. Some wrong responses may exist in the background, the feature extraction capability of the student network is relatively weak, compared with the teacher network, the wrong responses in the background area of the student network output feature map are more prominent, and the misjudgment of the student network on the background area can be reduced by training the background area of the student network through distillation. Similar to the calculation method of the foreground attention mask, the background attention mask is obtained as follows:

1) calculating characteristic diagram A of student network output^sSpatial attention and channel attention of;

2) characteristic diagram A is respectively matched through softmax^sPerforming probability normalization on the spatial attention and the channel attention;

3) feature map A after probability normalization^sThe spatial attention and channel attention and labeling of the background mask S_bMultiplying to obtain a weighted background attention mask J_b。

The specific calculation formula is as follows:

wherein H, W, C represent the characteristic diagrams A respectively^sThe length, width and dimensions of the substrate,

and

Similarly, to balance the effects of the background mask, the mask is also weighted by the area of the background region. Using weighted background masks

Substituted for S_b. Normalized background mask

The calculation formula of (2) is as follows:

where T represents the sum of all targets, T represents each target, s_tRepresenting the area of the target region. Using a weighted background mask, the contribution of the background region to the loss function can be made to coincide with the foreground region.

S3 and L2 distillation loss function design.

As shown in fig. 5, after obtaining the foreground attention mask and the background attention mask, the foreground and background attention masks are multiplied by the output feature maps of the teacher and student respectively to calculate the distillation loss, and the loss function is an L2 function:

L_d＝δL₂(A^s·J_f，A^t·J_f)+εL₂(A^s·J_b，A^t·J_b)，

where δ and ε are parameters that control the ratio of foreground to background loss calculation, A^sAnd A^tFeature diagrams representing models of students and teachers, respectively, J_fAnd J_bThe foreground attention mask and the background attention mask are indicated separately.

The L2 loss function is a function of solving the spatial euclidean distance for the two vectors X, Y, and is calculated as follows:

wherein x is_i、y_iEach term of the vectors X, Y, respectively, is represented by n terms.

Through the distillation loss function, the knowledge of the complex model can be transferred to the lightweight model, so that the detection performance of the lightweight model is improved. In practical application, a complex model is firstly trained aiming at a target task, then the pre-trained complex model is used as additional supervision information, and an additional knowledge distillation loss function based on feature separation attention is added when a lightweight model is trained. After the training convergence, the obtained lightweight remote sensing target detection model has improved detection performance compared with a model without distillation loss.

The invention is used as a lightweight target detection task and integrates the information of the characteristic diagram into an attention diagram in a mode of extending through space attention and channel attention, thereby avoiding mutual interference among channels. And the false detection of the target in the foreground region and the false detection of the background region are distilled respectively in the mode of the foreground mask and the background mask, so that the detection effect of the lightweight model is greatly improved.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A remote sensing target detection knowledge distillation method based on feature separation attention is characterized by comprising the following steps:

separating a foreground area and a background area of the feature map, calculating a foreground attention mask through the feature attention map of the teacher network, and calculating a background attention mask through the feature attention map of the student network;

2. The distillation method for remote sensing target detection knowledge based on feature separation attention, according to claim 1, wherein the feature attention map comprises spatial attention and channel attention; wherein, the calculation formula of the space attention is as follows:

the channel attention is calculated as:

3. The method for distilling knowledge detection based on remote sensing target detection of feature separation attention according to claim 1, wherein the calculating of the foreground attention mask through the feature attention map of the teacher network comprises:

respectively matching the characteristic graphs A by using a Softmax function^tPerforming probability normalization on the spatial attention and the channel attention;

feature map A after probability normalization^tForeground mask S of spatial attention and channel attention and labeling_fMultiplying to obtain a weighted foreground attention mask J_f。

4. The method for distilling knowledge detection of remote sensing target based on feature separation attention according to claim 3, wherein the foreground attention mask is calculated by the formula:

wherein the probability normalization function

and

5. The method for distilling knowledge detection of remote sensing target based on feature separation attention according to claim 4, characterized by further comprising: masking the foreground according to the size of the target_fNormalization is carried out to obtain a normalized foreground mask

And masking the normalized foreground

The calculation formula of (2) is as follows:

6. The method for distilling knowledge detection of remote sensing target based on feature separation attention according to claim 1, wherein the calculating of the background attention mask through the feature attention map of the student network comprises:

characteristic diagram A is respectively matched through softmax^sCarrying out probability normalization on the spatial attention and the channel attention;

feature map A after probability normalization^sThe spatial attention and channel attention and labeling of the background mask S_bMultiply to obtain a weighted background attention mask J_b。

7. The method as claimed in claim 6, wherein the background attention mask J is used for the knowledge distillation of the remote sensing target detection based on the feature separation attention_bThe calculation formula of (2) is as follows:

wherein H, W, C respectively representThe length, width and dimensions of the characteristic map As,

and

8. The method for distilling knowledge detection of remote sensing target based on feature separation attention according to claim 7, further comprising: masking the background with a mask S according to the size of the target_bNormalization is carried out to obtain a normalized background mask

And masking the normalized background

The calculation formula of (2) is as follows:

9. The distillation method for remote sensing target detection knowledge based on feature separation attention according to claim 1, wherein the calculation formula of the distillation loss of L2 is as follows:

L_d＝δL₂(A^s·J_f，A^t·J_f)+εL₂(A^s·J_b，A^t·J_b)