CN112766411A

CN112766411A - Target detection knowledge distillation method for adaptive regional refinement

Info

Publication number: CN112766411A
Application number: CN202110145846.4A
Authority: CN
Inventors: 褚晶辉; 史李栋; 吕卫
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2021-02-02
Filing date: 2021-02-02
Publication date: 2021-05-07
Anticipated expiration: 2041-02-02
Also published as: CN112766411B

Abstract

The invention provides a target detection knowledge distillation method for adaptive regional refinement, which comprises the following steps: preparing data; the method for building the network structure comprises the following steps: the first step is as follows: respectively sending the input training set images into a teacher network and a student network, extracting all prediction anchor frames obtained by the student network, and outputting a characteristic diagram F after passing through a backbone network of the teacher network_TFeature graph F output after image passes through backbone network of student network_s' and the characteristic diagram F output by the convolution layer adjacent to the characteristic diagram_s"; the second step is that: the obtained prediction anchor frame and the feature graph F obtained by the teacher network and the student network_TAnd F_sSending the data to a self-adaptive finishing module to obtain the relationship between the data and the detected target in the feature graph under the joint supervision of a real Label Label and a teacher networkA fine-grained region M with extremely large connection degree; the third step: weights of different channels in the feature map can be obtained; the fourth step: obtaining a lightweight target detection optimal model through continuous feedback optimization; and (5) training a model.

Description

Target detection knowledge distillation method for adaptive regional refinement

Technical Field

The invention relates to the field of image processing, in particular to object detection for mobile devices with low computational power.

Background

Due to wide practical value and theoretical value, target detection based on the neural network becomes a research hotspot in the fields of computer vision and image processing, and the target detection based on the neural network refers to the detection of various types of targets on general daily images by using a neural network model and applying related algorithms. The target detection based on the neural network is widely applied to the fields of automatic driving, video monitoring, aerospace and the like due to the characteristics of high detection precision and high calculation speed.

Although better detection effect is obtained by using an updated, more complex and deeper network, as the performance is gradually slowed down, model parameters and calculation amount of a neural network such as fast-RCNN are huge, and the model is difficult to deploy in vehicle-mounted equipment with extremely small storage and calculation amount^[1]. Therefore, how to deploy the existing neural network to the mobile equipment with lower computing power, such as vehicle-mounted equipment, becomes a problem to be solved urgently at present. If we can simplify the neural network, reduce the model parameters and the computation amount and simultaneously keep the detection performance as much as possible, we can obtain the target detection network model suitable for low-computation-power equipment. The two factors are influenced by balancing the detection precision of the model and the size of the model parameter. Generally, deeper neural networks can yield more accurate models, but are generally accompanied by a large number of parameters and complex operations; and the shallow neural network has small parameter quantity and high operation speed, but has lower detection precision.

A simple method is channel pruning, which reduces the parameter quantity of the model as much as possible and accelerates the reasoning speed of the model by pruning unimportant channels in a complex network model^[2]. The channel pruning method is fast and easy to implement, the number of model parameters is obviously reduced, however, the redundant channels cannot be pruned very accurately, the influence on the precision is very obvious, and the precision of the pruned model is often seriously reduced. Another approach is knowledge distillation, which allows a lightweight student network to mimic the behavior of a complex, high-performance teacher network to obtain performance improvements^[3]. The knowledge distillation method can well transfer the precision of a complex network to a simple network, so thatSo that even a simple model can deduce higher accuracy. As For the research on the lightweight of the neural network, the fact that the knowledge distillation method is used For carrying out the prompt learning (FITINETS: Hints For Thin Deep Nets) of the Deep network in recent years shows excellent effect, and the improvement on the detection precision of the simple neural network model is obvious, so that the interest is high^[4]。

These previous neural network weight reduction methods still have limitations. At present, the number of model parameters is reduced and the detection precision is reduced sharply, but the knowledge distillation method is easy to implement and high in precision retention degree, most of the knowledge distillation method is dedicated to simple tasks such as classification, and the complex detection task is rarely concerned. Compared with a classification task, the detection task is added on the basis of the target detection, the detection task is a priori worked on classification, and unreliable positioning can make subsequent classification meaningless. The knowledge of the teacher network in the accurately positioned space domain and frequency domain is transferred to the student network, so that the problems can be well solved.

Reference documents:

[1]Ren S,He K,Girshick R,et al.Faster R-CNN:Towards Real-Time Object Detection with Region Proposal Networks[J].IEEE Transactions on Pattern Analysis&Machine Intelligence,2017,39(6):1137-1149.

[2]Zhou Z,Zhou W,Hong R,et al.Online Filter Weakening and Pruning for Efficient Convnets[C]//2018IEEE International Conference on Multimedia and Expo(ICME).IEEE Computer Society,2018.

[3]Hinton G,Vinyals O,Dean J.Distilling the knowledge in a neural network[J].arXiv preprint arXiv:1503.02531,2015.

[4]Romero A,Ballas N,Kahou S E,et al.Fitnets:Hints for thin deep nets[J].arXiv preprint arXiv:1412.6550,2014.

[5]Mao J,Xiao T,Jiang Y,et al.What Can Help Pedestrian Detection？[C]//Computer Vision&Pattern Recognition.IEEE,2017.

disclosure of Invention

The invention provides a knowledge distillation algorithm suitable for target detection, which utilizes a self-adaptive area refinement module and a self-attention module to transfer the knowledge of a complex teacher network to a simple student network in a fine-grained space domain and an accurate frequency domain, and further improves the original knowledge distillation method aiming at the characteristics of a target detection task, wherein the technical scheme is as follows:

an adaptive zone refining target detection knowledge distillation method comprises the following steps:

(1) data preparation, the data set is divided into a training set and a test set.

(2) The method for building the network structure comprises the following steps:

the first step is as follows: respectively sending the input training set images into a teacher network and a student network, extracting all N prediction anchor frames Anchors obtained by the student network and outputting a characteristic diagram F after passing through a backbone network of the teacher network_TFeature graph F output after image passes through backbone network of student network_S' and the characteristic diagram F output by the convolution layer adjacent to the characteristic diagram_S″；

The second step is that: the obtained prediction anchor frames Anchors and feature maps F obtained by the teacher network and the student network_TAnd F_SSending the data to a self-adaptive finishing module, and under the common supervision of a real Label Label and a teacher network, obtaining a fine-grained region M with extremely high association degree with a detected target in a characteristic diagram, wherein the self-adaptive finishing module comprises the following steps: firstly, IoU values of each prediction anchor frame Anchors and a real Label Label are calculated, and an anchor frame A' with a large association degree with the real Label Label is obtained after screening; then copying the position information of the anchor frame A' to a feature map F obtained by the teacher network_TAnd F is_TThe position of the upper non-anchor frame is set to be 0, and F is obtained by utilizing the characteristic that the high-level characteristic diagram approaches to the gray level diagram_TMaximum value of upper anchor frame position minus feature map F_TObtaining a thermodynamic diagram by itself; finally, obtaining a fine-grained region M with extremely high association degree with the detected target after filtering;

the third step: feature map F obtained at different convolution layers when passing image through backbone network of student network_S' and F_S"go to the self-attention module, the weights of different channels in the feature map can be obtained to represent the importance of different channels in the overall feature map, and the self-attention module is as follows: calculating a feature map F by means of a square error operation_S' and F_S"difference between m_difHere, the feature map sizes are all C × W × H, where C is the number of channels of the feature map, W and H denote the dimensions of the feature map, and the difference m is obtained_difThen, global average pooling is taken to change the channel into a tensor with the size of C × 1 × 1, and activation and normalization are performed by using nonlinear operation, so that a channel attention weight vector B with the size of C × 1 can be obtained, namely:

B＝{β₁,β₂,…β_c,…β_C}

＝R(Avg((F_S′-F_S″)²))，c∈[0,C)

wherein R is the activation operation of ReLu, Avg is the global average pooling operation, and a channel attention weight vector B is obtained after the operation, wherein beta_cThe value of the c-th dimension in the weight vector B corresponds to the weight value corresponding to the c-th channel;

the fourth step: the characteristic diagram F obtained in the step is processed_TAnd F_S' the attention weight vector B of the fine grain region M and the different channels is sent into a fusion module, so that a characteristic diagram generated by a student network is combined with importance information of the different channels in the fine grain region, the characteristic diagram generated by a teacher network is simulated, and knowledge distillation suitable for target detection is realized, wherein a loss function designed in the fusion module is represented as:

wherein N is_CAShowing fine grain region M in feature map F_TThe number of pixels occupied;

obtaining a lightweight target detection optimal model through continuous feedback optimization;

(3) and (5) training a model.

The invention has the following beneficial effects:

1. knowledge is introduced into target detection of a universal data set in a distillation mode, so that the weight of a complex high-precision target detection neural network model is easier to realize;

2. in the knowledge distillation suitable for target detection, a student network and a teacher network are combined to adaptively select a spatial region of knowledge distillation according to a characteristic diagram, the selection of the region is limited to an approximate range according to the prior knowledge of a real label, and therefore the adaptively selected region cannot be expanded to a far place irrelevant to a detected target. It is worth noting that the knowledge distillation space region selected in the invention is free from the limit of RPN network output, and meanwhile, the prior knowledge brought by the real label is retained, which strengthens the supervision of fine-grained region selection;

3. the proposed self-attention mechanism directly uses the existing network resources to generate the importance information of each characteristic channel, is a lighter gating mechanism realized on target detection, and specially provides significant assistance for knowledge distillation suitable for target detection;

4. a new loss function is added, and experiments prove that the loss function well combines the performances of the self-adaptive region refinement module and the self-attention module, and the superior performance of knowledge distillation on network lightweight is well exerted on target detection.

Drawings

Fig. 1 is a network overall structure diagram of the proposed algorithm for knowledge distillation for target detection.

Detailed Description

The invention provides a knowledge distillation method suitable for target detection, which utilizes a self-adaptive region refinement module and a self-attention module to transfer the knowledge of a complex teacher network to a simple student network in a fine-grained space domain and an accurate frequency domain, and the improvement of the detection performance of a simple network model is obviously superior to that of the existing lightweight method. It is particularly noted that the student network in the experiment almost reaches 98.3% of the teacher network detection precision on the general traffic detection data set, and also shows excellent performance on other general data sets. Meanwhile, the present invention has wide applicability to the current mainstream target detection network, which is extremely advantageous for quickly realizing the lightweight of the existing high-precision model, and the following detailed description of the embodiments is further provided with reference to the accompanying drawings:

(1) preparing data:

dividing a data set, wherein the data source is a KITTI data set jointly created by the German Carlsuhe institute of technology and Toyota American institute of technology, and the data set is an evaluation data set of a computer vision algorithm under the current international largest automatic driving scene. The data set comprises real image data collected in urban areas, villages, expressways and other scenes, wherein each image has 15 vehicles and 30 pedestrians at most, and has various degrees of shielding and truncation. According to a common data set segmentation method, a data set is segmented into a training set and a test set^[5]。

(2) Building a network: the network structure of the invention mainly comprises a self-adaptive area refinement module, a self-attention module and a fusion module, and the network structure built by the invention will be described in detail with reference to the attached drawing 1.

(a) At each iteration, the input training set images are respectively sent to a teacher (complex) network and a student (lightweight) network, all N prediction anchor frames Anchors obtained by RPN (region pro social network) of the student network are extracted, and a feature diagram F obtained by the teacher network is extracted_TFeature graph F output after image passes through backbone network of student network_S' and the characteristic diagram F output by the convolution layer adjacent to the characteristic diagram_S″。

(b) The obtained prediction anchor frames Anchors and feature maps F obtained by the teacher network and the student network_TAnd F_SSending the data to a self-adaptive refining module, and under the common supervision of a real Label Label and a high-performance teacher network, obtaining a fine-grained region M with extremely high association degree with the detected target in the characteristic diagram. The operation of the self-adaptive refining module is as follows: firstly, IoU values of a prediction anchor frame Anchors and a real Label Label are calculated, and an anchor frame A' with high relevance degree with the real Label Label is obtained after screening; then will beCopying the position information of the anchor frame A' to a feature map F obtained by the teacher network_TAnd F is_TThe position of the upper non-anchor frame is set to be 0, and F is obtained by utilizing the characteristic that the high-level characteristic diagram approaches to the gray level diagram_TSubtracting the characteristic diagram from the maximum value of the upper anchor frame position to obtain a thermodynamic diagram; and finally, obtaining a fine-grained region M with extremely high association degree with the detected target after filtering.

(c) Characteristic diagram F obtained by different convolution layers of student network_S' and F_S"go to the self-attention module, get the weights of different channels in the feature map to represent their importance in the overall feature map. The operation of the self-attention module is as follows: calculating a feature map F by means of a square error operation_S' and F_S"difference between m_difHere, the feature map sizes are each C × W × H, where C is the number of channels of the feature map, and W and H denote the dimensions of the feature map. To obtain a difference m_difAnd then, taking global average pooling to change the global average pooling into a tensor with the size of C multiplied by 1, and activating and normalizing by using nonlinear operation to obtain a channel attention weight vector B with the size of C multiplied by 1. Namely:

B＝{β₁,β₂,…β_c,…β_C}

＝R(Avg((F_S′-F_S″)²))，c∈[0,C) (1)

wherein R is the activation operation of ReLu, Avg is the global average pooling operation, and a channel attention weight vector B is obtained after the operation, wherein beta_cThe value of the c-th dimension in the weight vector B corresponds to the weight value of the c-th channel. This way, the channel importance information B is obtained with little sacrifice of reasoning complexity.

(d) Finally, the characteristic diagram F obtained in the step_TAnd F_S' the fine grain area M and the weights B of different channels are sent into a fusion module, so that the feature diagram generated by the student network is combined with the importance information of different channels in the fine grain area, and the feature diagram generated by the teacher network is simulated to realize the knowledge distillation suitable for target detection. Wherein the designed loss function in the fusion module can be expressedComprises the following steps:

wherein N is_CAThe number of pixels occupied by the fine-grained region M in the feature map is represented, so that the performance of the adaptive region refinement module and the self-attention module is well combined by designing the penalty function.

(3) Model training: the learning rate of the present invention is set to 3 × 10^-3. The L2 norm is used as a loss function. By adopting the SGD optimization method, the weight attenuation rate is 0.1, and the momentum value is 0.9.

(4) Evaluation indexes are as follows: the invention adopts Average accuracy (mean Average Precision) to measure the effect of the algorithm.

(5) In the experiment, the detection result obtained by the proposed knowledge distillation algorithm suitable for target detection is used, the detection result mAP (average accuracy) of the student network is calculated to be 62.3%, and compared with the detection result mAP of the teacher network to be 62.3%, the detection capability of the teacher network is almost reproduced on the basis of reducing the calculation amount and the parameter amount.

Claims

1. An adaptive zone refining target detection knowledge distillation method comprises the following steps:

The second step is that: the obtained prediction anchor frames Anchors and feature maps F obtained by the teacher network and the student network_TAnd F_S' feeding into an adaptive finishing module in realityAnd obtaining a fine-grained region M with extremely high association degree with the detected target in the characteristic diagram under the joint supervision of the Label Label and a teacher network, wherein the self-adaptive refining module comprises the following steps: firstly, IoU values of each prediction anchor frame Anchors and a real Label Label are calculated, and an anchor frame A' with a large association degree with the real Label Label is obtained after screening; then copying the position information of the anchor frame A' to a feature map F obtained by the teacher network_TAnd F is_TThe position of the upper non-anchor frame is set to be 0, and F is obtained by utilizing the characteristic that the high-level characteristic diagram approaches to the gray level diagram_TMaximum value of upper anchor frame position minus feature map F_TObtaining a thermodynamic diagram by itself; finally, obtaining a fine-grained region M with extremely high association degree with the detected target after filtering;

B＝{β₁,β₂,…β_c,…β_C}

＝R(Avg((F_S′-F_S″)²))，c∈[0,C)

the fourth step: the characteristic diagram F obtained in the step is processed_TAnd F_S', fine grain region M, and attention weight vector B for different channelsThe fusion module enables the feature diagram generated by the student network to be combined with importance information of different channels in a fine-grained region, simulates the feature diagram generated by the teacher network to realize knowledge distillation suitable for target detection, wherein a loss function designed in the fusion module is represented as:

(3) and (5) training a model.

2. The method of claim 1, wherein the model training is performed by:

the first step is as follows: the learning rate is set to 3 × 10^-3；

The second step is that: using the L2 norm as a loss function;

the third step: by adopting the SGD optimization method, the weight attenuation rate is 0.1, and the momentum value is 0.9.