CN110245754B

CN110245754B - Knowledge distillation guiding method based on position sensitive graph

Info

Publication number: CN110245754B
Application number: CN201910517256.2A
Authority: CN
Inventors: 常立博; 卢通; 杜慧敏; 张霞; 张丽果; 王一鸣; 徐一丁
Original assignee: Xian University of Posts and Telecommunications
Current assignee: Xian University of Posts and Telecommunications
Priority date: 2019-06-14
Filing date: 2019-06-14
Publication date: 2021-04-06
Anticipated expiration: 2039-06-14
Also published as: CN110245754A

Abstract

The invention belongs to the field of deep learning and target detection, and relates to a knowledge distillation guiding method based on a position sensitive graph.

Description

Knowledge distillation guiding method based on position sensitive graph

Technical Field

The invention belongs to the field of deep learning and target detection, and particularly relates to a knowledge distillation guiding method based on a position sensitive graph.

Background

The task of object detection is to find out some specific classes of objects from an image, including two processes of detection and identification, and is a key technology in complex visual tasks such as video analysis and image semantic understanding. The accuracy of the target detection result directly affects the effect of the high-level computer vision task. The target detection technology has wide application in many fields, such as robot vision, automatic driving, intelligent monitoring, medical image analysis, human-computer interaction and the like.

The knowledge distillation is a network model compression method, and the method guides the training of a student network model by constructing a teacher-student framework and using a teacher network model to 'distill' the knowledge about feature expression learned by the teacher network with complex network structure, large parameter quantity and strong learning ability, and transfers the knowledge to the student network with simple network structure, small parameter quantity and weak learning ability. The knowledge distillation method can provide soft label information which can not be learned in the hard label information for a student network, wherein the soft label information comprises inter-class information and characteristic representation knowledge learned by a teacher network. The knowledge distillation method can improve the performance of the network under the condition of not increasing the complexity of the network.

The classical knowledge distillation algorithm is mostly based on a convolutional neural network for an image classification task, and compared with a target detection task, the image classification task is simple in output and only outputs image classification probability vectors. The output of the target detection task is complex, and besides the probability of outputting the target classification, the coordinate information of target positioning is also output, so that the characteristic image of the target detection network contains the characteristic information of the image and the position information of the target, and the classical knowledge distillation method cannot be simply applied to the target detection network.

Disclosure of Invention

The invention aims to provide a knowledge distillation guiding method based on a position sensitive graph aiming at a deep learning lightweight target detection network, and the performance of the lightweight target detection network is improved under the condition of not increasing the network detection time and the parameter scale.

The technical scheme of the invention is to provide a knowledge distillation guidance method based on a position sensitive graph, which comprises guidance based on the position sensitive graph and guidance based on network output and labeling information;

obtaining a guidance loss function based on the position sensitive graph through guidance based on the position sensitive graph, and obtaining a guidance loss function based on network output and a guidance loss function based on labeling information through guidance based on network output and labeling information;

weighting and adding a guidance loss function based on a position sensitive graph, a guidance loss function based on network output and a guidance loss function based on labeling information to obtain a total guidance loss function, and guiding the parameter update of the student network by using the total guidance loss function;

wherein the location-sensitive graph-based guidance comprises the steps of:

s1, selecting at least one network layer in the teacher network as a guiding layer; selecting at least one network layer in the student network as a guided layer; the finger guiding layers correspond to the guided layers one by one;

s2, obtaining a multichannel position sensitive graph of the teacher network and the student network;

s21, respectively dividing output characteristic graphs of the guiding layer and the guided layer into C groups according to the number of channels of the position sensitive graph which is expected to be obtained, wherein the group number is the same as the number of channels of the position sensitive graph, and the group number is an integer which is the power of 2;

s22, performing 3D maximum pooling on each group of output characteristic graphs in the guiding layer and the guided layer respectively to obtain at least one group of teacher network multi-channel position sensitive graphs and at least one group of student network multi-channel position sensitive graphs, wherein C is a positive integer greater than or equal to 1;

when the channel numbers of the output characteristic graphs of the finger guiding layer and the guided layer are not consistent, dividing the finger guiding layer into C groups according to the channel number of a position sensitive graph expected to be obtained, wherein each group comprises M characteristic graphs, performing 3D maximum pooling on each group of output characteristic graphs, and dividing the guided layer into C groups, each group comprises N characteristic graphs, and performing 3D maximum pooling on each group of output characteristic graphs.

And S3, loss is conducted on the obtained teacher network multi-channel position sensitivity graph and the corresponding student network multi-channel position sensitivity graph, and a guidance loss function based on the position sensitivity graph is obtained.

Further, the location sensitivity map-based guidance loss function is:

wherein gps is an abbreviation for generated position sensitive map, indicating that a position sensitive map is generated; u (x; w)_l') An output characteristic diagram representing the tutor network guiding layers, v (x; w is a_j) An output characteristic diagram representing directed layers of the student network; l ' is the l ' th layer in the teacher network and j is the j ' th layer in the student network. Other loss functions are also possible, such as: l1 loss function, SmoothL1 loss function.

Further, in order to better improve the performance of the student network, a Yolov3 target detection network is selected as a teacher network; and selecting a lightweight target detection network as a student network.

Further, the lightweight target detection network comprises a lightweight base convolutional neural network and a lightweight branch prediction network;

the lightweight basic convolutional neural network comprises a convolutional layer, a first network layer, a second network layer and a third network layer;

the first network layer comprises at least one convolution module of a first type and at least one convolution module of a second type;

the second network layer comprises at least one convolution module of a second type, or at least one convolution module of a second type and at least one convolution module of a first type;

the third network layer comprises at least one convolution module of the second type, or at least one convolution module of the second type and at least one convolution module of the first type;

the number of output channels of the first type of convolution module is the same as the number of input channels, and the first type of convolution module consists of a depth convolution layer with a convolution kernel size of t multiplied by t and a step length of 1, a convolution layer with a convolution kernel size of 1 multiplied by 1 and a step length of 1 and a bypass; wherein t is a positive integer;

the number of output channels of the second type of convolution module is 2 times of the number of input channels, and the second type of convolution module consists of a depth convolution layer with convolution kernel size of l multiplied by l and step length of 2 and a convolution layer with convolution kernel size of 1 multiplied by 1 and step length of 1; wherein l is a positive integer;

the input of the convolutional layer is an image to be detected and is used for acquiring the characteristics of the image to be detected;

the image output by the convolution layer outputs an a characteristic image after being subjected to convolution operation of a first network layer;

inputting the characteristic image a into a second network layer convolution operation and outputting a characteristic image b;

b, inputting the characteristic image to a third network layer convolution operation and outputting a c characteristic image;

the lightweight branch prediction network comprises at least one convolution module of a fourth type and a convolution layer c';

the number of output channels of the fourth type of convolution module is the same as the number of input channels, and the fourth type of convolution module consists of a depth convolution layer with a convolution kernel size of t multiplied by t and a step length of 1 and a convolution layer with a convolution kernel size of 1 multiplied by 1 and a step length of 1;

the fourth convolution module I receives a c characteristic image output by the lightweight basic convolution neural network;

and c, the characteristic image is subjected to convolution operation by a fourth convolution module I and then output to a convolution layer c', and a first group of prediction tensors are output after the convolution operation.

Furthermore, the lightweight branch prediction network further comprises three fourth-class convolution modules, namely a fourth-class convolution module II, a fourth-class convolution module III, a fourth-class convolution module IV, an anti-convolution layer b 'and a convolution layer b';

a fourth convolution module III receives the b characteristic image output by the lightweight basic convolution neural network and performs convolution operation on the b characteristic image;

after the feature image c is subjected to convolution operation by the fourth type convolution module I, the feature image c is sequentially input to a deconvolution layer b' and a fourth type convolution module II for convolution operation;

the fourth type convolution module II is added with the output characteristic diagram of the fourth type convolution module III and then output to a fourth type convolution module IV for convolution operation;

and the output characteristic diagram of the fourth convolution module IV is input to the convolution layer b' for operation, and then a second group of prediction tensors are output.

Furthermore, the lightweight branch prediction network further comprises three fourth-class convolution modules, namely a fourth-class convolution module V, a fourth-class convolution module VI, a fourth-class convolution module VII, a deconvolution layer a 'and a convolution layer a';

a fourth convolution module VI receives the a characteristic image output by the lightweight basic convolution neural network; performing convolution operation on the a characteristic image;

the output characteristic diagram of the fourth type convolution module IV is also sequentially output to the deconvolution layer a' to carry out convolution operation with the fourth type convolution module V;

and the output characteristic graphs of the fourth type convolution module V and the fourth type convolution module VI are added and then sequentially input to the fourth type convolution module VII and the convolution layer a' for operation, and then the third group of prediction tensors are output.

Further, the first network layer further comprises a third type convolution module;

the number of output channels of the third type of convolution module is 2 times of the number of input channels, and the third type of convolution module consists of a depth convolution layer with a convolution kernel size of m multiplied by m and a step length of 1 and a convolution layer with a convolution kernel size of 1 multiplied by 1 and a step length of 1; wherein m is a positive integer.

Furthermore, the number of the first type of convolution module and the number of the second type of convolution module in the first network layer are both two, and the number of the third type of convolution module is one; the first type of convolution module I, the first type of convolution module II, the second type of convolution module I, the second type of convolution module II and the third type of convolution module I are respectively arranged;

the third convolution module I receives the detection image input by the convolution layer, performs convolution operation on the detection image, and then sequentially inputs the detection image to the second convolution module I, the first convolution module I, the second convolution module II and the first convolution module II to perform convolution operation and output an a characteristic image.

Furthermore, one second-type convolution module and five first-type convolution modules are respectively a second-type convolution module III, a first-type convolution module III to a first-type convolution module VII (11) in the second network layer; the second type convolution module III receives the characteristic image a, performs convolution operation on the characteristic image a, sequentially inputs the characteristic image a to five first type convolution modules for convolution operation, and outputs a characteristic image b;

or the like, or, alternatively,

the second network layer only comprises a second type convolution module, and the second type convolution module III receives the a characteristic image and outputs a b characteristic image after convolution operation.

Furthermore, one second type of convolution module and one first type of convolution module are arranged in the third network layer; respectively a second convolution module IV and a first convolution module VIII; the second type of convolution module IV receives the b characteristic image, performs convolution operation on the b characteristic image, sequentially inputs the b characteristic image to the first type of convolution module VIII, and outputs a c characteristic image;

or the like, or, alternatively,

and the third network layer only comprises a second type convolution module, and the second type convolution module IV receives the b characteristic image and outputs a c characteristic image after convolution operation.

Further, t is equal to 3, l is equal to 3, and m is equal to 3.

Further, three network layers in the teacher network are selected as guiding layers; correspondingly selecting three network layers in the student network as guided layers; the finger guiding layers correspond to the guided layers one by one.

The invention has the beneficial effects that:

1. according to the method, the multichannel position sensitive graph is generated through 3D maximum pooling, the statistical characteristics of the teacher network guiding layer characteristic graph can be extracted into the position sensitive graph, the statistical characteristics of the teacher network guiding layer characteristic graph are migrated into the guided layer of the student network through the loss of the two groups of position sensitive graphs, and compared with a knowledge distillation method in the prior art, the method has a better effect of improving the student network target detection performance.

2. By using the method for acquiring the position sensitive graph, whether the number of the channels of the guiding layer is consistent with that of the guided layer or not can be acquired, the position sensitive graph with the same number of the channels can be acquired, and the loss function based on the position sensitive graph is convenient to acquire.

Drawings

FIG. 1 is a schematic diagram of a knowledge distillation guidance process based on a location sensitive plot;

FIG. 2 is a diagram of a YOLOv3 network architecture;

FIG. 3 is a block diagram of a lightweight object detection network;

FIG. 4 is a first type of convolution module;

FIG. 5 is a second type of convolution module;

FIG. 6 is a third type of convolution module;

FIG. 7 is a diagram of a lightweight branch prediction network architecture;

FIG. 8 is a fourth type of convolution module;

FIG. 9 generates a location sensitive graph representation for a feature graph grouping;

FIG. 10 is a schematic illustration of a position sensitive map generation;

FIG. 11 is a graph of an experimental protocol directed based on a location sensitive map;

the reference numbers in the figures are: 1-a third convolution module i, 2-a second convolution module i, 3-a first convolution module i, 4-a second convolution module ii, 5-a first convolution module ii, 6-a second convolution module iii, 7-a first convolution module iii, 8-a first convolution module iv, 9-a first convolution module v, 10-a first convolution module vi, 11-a first convolution module vii, 12-a second convolution module iv, 13-a first convolution module viii, 14-a fourth convolution module i, 15-a fourth convolution module ii, 16-a fourth convolution module iii, 17-a fourth convolution module iv, 18-a fourth convolution module v, 19-a fourth convolution module vi, 20-a fourth convolution module vii.

Detailed Description

The invention is further described with reference to the following figures and specific embodiments.

The invention provides a knowledge distillation guiding method based on a position sensitive graph for a target detection network.

As shown in FIG. 1, the knowledge distillation guidance method based on the position sensitive graph consists of two parts of guidance, one part is guidance based on the position sensitive graph, and the other part is guidance based on network output and labeled information. In this embodiment, the YOLOv3 target detection network is used as a teacher network, and the lightweight target detection network is used as a student network.

As shown in fig. 2, for the YOLOv3 target detection network structure, this embodiment selects three network layers in Darknet-53 as the guiding layers, and guides the corresponding three guided layers in the lightweight target detection network by generating a location-sensitive map.

As shown in fig. 3, the lightweight target detection network of the present embodiment is composed of two parts, namely a lightweight basic convolutional neural network and a lightweight branch prediction network. The input to the lightweight base convolutional neural network is a 416 x 416 color image. In fig. 3, S in each convolution block indicates a step size in convolution operation, and xn indicates that the number of channels of the output feature map is N. It can be seen that the convolution modules in fig. 3 are divided into three types, namely, a first type convolution module, a second type convolution module and a third type convolution module. The first-class convolution module comprises a first-class convolution module I3, a first-class convolution module II 5, a first-class convolution module III 7-a first-class convolution module VII 11 and a first-class convolution module VIII 13; the second convolution module comprises a second convolution module I2, a second convolution module II 4, a second convolution module III 6 and a second convolution module IV 12; the third type of convolution module comprises a third type of convolution module I1.

In this embodiment, the third type convolution module i 1, the second type convolution module i 2, the first type convolution module i 3, the second type convolution module ii 4, and the first type convolution module ii 5 are sequentially ordered to form a first network layer; the second type convolution module III 6, the first type convolution module III 7 to the first type convolution module VII 11 form a second network layer; the second type convolution module IV 12 and the first type convolution module VIII 13 form a third network layer. The image output by the convolution layer outputs an a characteristic image after convolution operation of a first network layer; inputting the characteristic image a into a second network layer convolution operation and outputting a characteristic image b; and b, inputting the characteristic image to a third network layer convolution operation, and outputting a c characteristic image.

In other embodiments, to achieve different requirements, the first network layer may include only the convolution modules of the second type and the convolution modules of the first type, and the number is not limited.

As shown in fig. 4, the step lengths of the convolution modules of the first type are all 1, and the number of output channels is the same as the number of input channels. The convolution module of the first kind is composed of a depth convolution layer (step size is 1) with convolution kernel size of 3 × 3, a convolution layer (step size is 1) of 1 × 1, and a bypass, and the bypass realizes the operation of adding feature maps. A BN layer and an activation layer are added behind the deep convolutional layer and convolutional layer. The activation function used by the activation layer in the first convolution module is a Relu function, the Relu function is simple to operate and high in calculation speed, and only the negative part of input data needs to be set to 0, namely y is max (0, x), so that the convolution module is more suitable for being used in an embedded environment.

In the first convolution module, the step length of both the depth convolution layer and the convolution layer is 1, and the convolution kernel number of the convolution layer is the same as the channel number of the input feature map, so that the channel number and the space size of the feature map are not changed. The input characteristic diagram is connected to the output through the bypass, so that the stability of the network can be improved, and the training effect is improved. The parameter number of the convolution module of the first type is 3 multiplied by n + n multiplied by 1 multiplied by n, and n is the channel number (effect) of the input feature map. When n is 128, the parameter quantity of this module is 17536.

As shown in fig. 5, the second type of convolution module, which is composed of a depth convolution layer with convolution kernel size of 3 × 3 (step size of 2) and a convolution layer with convolution kernel size of 1 × 1 (step size of 1), has step size of 2, and the number of output channels is 2 times the number of input channels. A BN layer and an activation layer are added behind the deep convolutional layer and convolutional layer. The down sampling of the characteristic diagram is realized through the depth convolution with the step length of 2, and the number of output channels is increased through the convolution layer with the convolution kernel number being 2 times of the number of input channels. The parameter quantity of the convolution module is 9n +2n²And n is the number of channels of the input feature map. When n is 128, the value is 33920.

As shown in fig. 6, the step size of the third convolution module is 1, the number of output channels is 2 times of the number of input channels, and the third convolution module is a depth convolution module in the second convolution module, which is a lightweight convolution module with a step size of 2 and a doubled channel numberThe layer step size may be set to 1. The parameter quantity of the convolution module is 9n +2n²And n is the number of channels of the input feature map. When n is 128, the value is 33920.

The lightweight branch prediction network structure of this embodiment is composed of 7 convolution modules of the fourth type with step size of 1 and the number of input channels equal to the number of output channels, 2 deconvolution modules, 3 convolution layers with step size of 1 × 1 and two feature map addition operations, as shown in fig. 7. The input of the lightweight branch prediction network is three characteristic graphs output by the lightweight basic convolution network.

The 7 fourth-type convolution modules are respectively a fourth-type convolution module I14, a fourth-type convolution module II 15, a fourth-type convolution module III 16, a fourth-type convolution module IV 17, a fourth-type convolution module V18, a fourth-type convolution module VI 19 and a fourth-type convolution module VII 20; the 2 deconvolution modules are a deconvolution layer b 'and a deconvolution layer a' respectively; the 3 convolutional layers are convolutional layer a ', convolutional layer b ' and convolutional layer c ', respectively.

The fourth type convolution module I14, the fourth type convolution module III 16 and the fourth type convolution module VI 19 respectively receive a c characteristic image, a b characteristic image and an a characteristic image output by the lightweight basic convolution neural network; the characteristic image c is subjected to convolution operation by a fourth convolution module I14 and then output to a convolution layer c' for operation, and then a first group of prediction tensors are output; after the feature image c is subjected to convolution operation by a fourth type convolution module I14, the feature image c is sequentially input to a deconvolution layer b' and a fourth type convolution module II 15 for convolution operation; the characteristic image b is output after convolution operation of a fourth convolution module III 16; the output characteristic diagrams of the second convolution module II 15 and the second convolution module III 16 are added and then output to a second convolution module IV 17 for convolution operation; the output characteristic diagram of the fourth convolution module IV 17 is input to a convolution layer b' for operation, and then a second group of prediction tensors are output; the output characteristic diagram of the fourth type convolution module IV 17 is also sequentially output to the deconvolution layer to be subjected to convolution operation with a fourth type convolution module V18; the feature image a is output after being subjected to convolution operation by a fourth convolution module VI 19; and the output characteristic graphs of the fourth type convolution module V18 and the fourth type convolution module VI 19 are added and then sequentially input into a fourth type convolution module VII 20 to be operated with the convolution layer a', and then a third group of prediction tensors are output.

As shown in fig. 8, the fourth type of convolution module consists of one depth convolution layer (step size 1) with convolution kernel size 3 × 3 and one convolution layer (step size 1) of 1 × 1. A BN layer and an activation layer are added behind the deep convolutional layer and convolutional layer. In the convolution module, the step length of both the depth convolution layer and the convolution layer is 1, and the convolution kernel number of the convolution layer is the same as the channel number of the input feature map, so that the channel number and the space size of the feature map cannot be changed. The reason why the convolution module does not use a bypass is that the main function of the lightweight branch prediction network is different from that of the basic convolution neural network, the lightweight branch prediction network is mainly used as a prediction network, and at the moment, the input feature map and the output feature map are directly added, so that the generation of the prediction tensor is interfered.

Deconvolution, also called transposed convolution, is currently widely applied to tasks such as scene segmentation and model generation. The deconvolution operation can recover the size of the feature map, but cannot recover the values before the convolution operation. Compared with the method of directly performing up-sampling on the image, the method has the advantages that the function of up-sampling can be completed through deconvolution, parameters can be learned, and information of the characteristic diagram can be better utilized through training. The lightweight convolution module needs to be subjected to deep convolution first and then is 1 × 1 convolution, and the deep convolution cannot mix information between channels. For deep convolution, information of a high-level feature map can be fused on a low-level feature map by using addition operation, so that the fused feature map has both detail features and high-level semantic features, and the detection effect of small targets is further improved.

As shown in fig. 11, the present embodiment uses three network layers in the underlying convolutional neural network Darknet-53 of YOLOv3 as guiding layers. And guiding three guided layers in the corresponding lightweight target detection network by generating a position sensitivity map, wherein the three guided layers are three network layers which are formed by mutually connecting a lightweight basic convolutional neural network and a lightweight branch prediction network of the lightweight target detection network in the graph 3. The position sensitivity graph is generated through the characteristic graph output by the corresponding network layer of the teacher network guiding layer and the student network guiding layer, and the two corresponding position sensitivity graphs are used for making an L2 loss function to guide the training of the student network.

The activation degree of the region containing the target to be detected in the high-level feature map is higher than that of the background region. By integrating the information in the characteristic diagram, the part with high activation degree is extracted into a diagram, namely a position sensitive diagram. The position sensitivity graph can reflect the position sensitivity of the characteristic graph to the target to be detected. The position sensitivity characteristic in the feature map of the teacher network can be transmitted to the student network by generating the position sensitivity map to guide the student network, so that the feature map of the student network contains a target area to be detected, and the total activation degree of the target area is higher than that of a background area.

The guidance based on the location-sensitive graph in the embodiment mainly includes the following processes:

1 ] using three network layers in a basic convolutional neural network Darknet-53 of YOLOv3 as guiding layers, for the convenience of description later, the three network layers may be respectively defined as a first guiding network layer, a second guiding network layer and a third guiding network layer; correspondingly selecting a first network layer, a second network layer and a third network layer in the student network as three guided layers; in other embodiments, one or two network layers may be selected as the guiding layer and the guided layer, as required.

Generating a corresponding position sensitivity graph by utilizing the output characteristic graph of each network layer;

2.1, respectively averaging output characteristic graphs of the three guiding layers into C groups according to the number of channels of the position sensitive graph which is expected to be obtained, wherein the number of the groups is the same as the number of the channels of the position sensitive graph, and C is an integer of power of 2; dividing the output characteristic graphs of the three guided layers into C groups according to the number of channels of the position sensitive graph which is expected to be obtained, wherein the number of the groups is the same as the number of the channels of the position sensitive graph, and C is an integer of power of 2; when the number of channels of the output characteristic diagrams of the finger-guiding layer and the finger-guided layer is inconsistent, the finger-guiding layer is divided into C groups according to the number of channels of the position sensitive diagram which is expected to be obtained, each group contains M characteristic diagrams, and the finger-guided layer is divided into C groups, each group contains N characteristic diagrams, as shown in FIG. 9.

2.2, performing 3D maximum pooling on each group of characteristic graphs of the guiding layer and the guided layer respectively to finally obtain a position sensitive graph of a teacher network and a student network; as shown in fig. 10, performing 3D maximum pooling including depth direction on each set of feature maps can extract the maximum value of the set of feature maps in the current pooling range, which reflects the degree of activation of the set of feature maps for the current location. After 3D max pooling of all regions of the feature map, a location sensitivity map reflecting the overall degree of activation of the set of feature maps can be obtained.

And 3, obtaining a position-sensitive-graph-based guidance loss function (position _ sensing _ loss) of each network layer by using position-sensitive graphs corresponding to every two guiding layers and corresponding guided layers (a first guiding network layer corresponds to a first guided network layer, a second guiding network layer corresponds to a second guided network layer, and a third guiding network layer corresponds to a third guided network layer) and performing L2 loss on the two position-sensitive graphs, wherein the position-sensitive-graph-based guidance loss function (position _ sensing _ loss) of each network layer is shown as the following formula:

where gps is an abbreviation for the generated position sensitive map, indicating that a position sensitive map is generated. u (x; w)_l') An output characteristic diagram representing the tutor network guiding layers, v (x; w is a_j) Other loss functions, such as L1 loss function, can also be obtained by representing the output characteristic diagram of the student network guided layer:

position_sensitive_hintloss＝|gps[u(x；w_l')]-gps[v(x；w_j)]|；

or SmoothL1 loss function:

and 4, finally, the obtained guidance loss functions of the network layers based on the position sensitive graph, the guidance loss functions based on the network output and the labeling information are weighted and added to together guide the parameters of the student network to be updated.

In this embodiment, the guidance based on the network output and the label information is implemented by using an existing method:

the guidance based on the teacher network output tensor needs to form different loss functions according to the physical meanings of the corresponding positions, which are respectively as follows: the frame center point loss function, the frame width and height loss function, the frame confidence coefficient loss function and the target classification loss function form an overall loss function, and the loss is the same as that of YOLOv 3. The guidance based on the labeling information needs to reversely convert the real coordinates and classification categories of the target in the image into a prediction tensor of the network, and as with the guidance based on the network output, four loss functions need to be formed.

The performance of the inventive method was tested by specific experiments as follows.

The processor of the hardware platform used for the experiment is Intel (R) core (TM) i7-8700K @3.70GHz, the memory is 16.0GB, and the graphics display card is NVIDIA GeForce GTX 1080. The experiment was based on the ubuntu16.04 operating system, using CUDA9.0 and cudnn7.0 acceleration libraries, using the Keras and tensrflow deep learning framework and opencv3.2 computer vision library, using Python as the programming language. The data sets used for the knowledge distillation training were the PASCAL VOC 2007, PASCAL VOC 2012 and microsoft coco 2017 data sets.

Knowledge distillation training is respectively carried out on PSACAL VOC and COCO data sets in experiments, and the effectiveness of a knowledge distillation algorithm based on a position sensitive graph is verified by comparing the performance of a light weight target detection network using the knowledge distillation algorithm and conventional training. The results of the VOC data set are shown in table 1.

Table 1 results of knowledge distillation experiments based on psacl VOC data set

Therefore, the light-weight target detection network is trained by using the knowledge distillation algorithm based on the position sensitive graph, so that the performance of the light-weight target detection network can be improved, and the improved mAP tested under the VOC data set is 76.5 percent

The experimental results of the COCO data set are shown in table 2.

Table 2 experimental results based on COCO data set

The experimental results based on the VOC data set and the COCO data set show that the training of the light-weight target detection network by using the knowledge distillation algorithm based on the position sensitive graph is higher than the performance of the light-weight target detection network obtained by using conventional training, and the effectiveness of the knowledge distillation algorithm based on the position sensitive graph is proved.

Claims

1. A knowledge distillation guidance method based on a position sensitive graph is characterized by comprising guidance based on the position sensitive graph and guidance based on network output and labeling information;

wherein the location-sensitive graph-based guidance comprises the steps of:

s21, dividing the output characteristic graphs of the guiding layer and the guided layer into C groups according to the channel number of the position sensitive graph which is expected to be obtained, wherein the group number is the same as the channel number of the position sensitive graph;

s3, loss is conducted on the obtained teacher network multi-channel position sensitive graph and the corresponding student network multi-channel position sensitive graph, and a guidance loss function based on the position sensitive graph is obtained;

the student network is a lightweight target detection network; the lightweight target detection network comprises a lightweight basic convolutional neural network and a lightweight branch prediction network;

the image output by the convolution layer outputs an a characteristic image after convolution operation of a first network layer;

the fourth convolution module I (14) receives a c characteristic image output by the lightweight basic convolution neural network;

the c characteristic image is subjected to convolution operation by a fourth convolution module I (14) and then output to a convolution layer c' for operation, and then a first group of prediction tensors are output.

2. The knowledge distillation guidance method based on position sensitive map of claim 1, characterized in that: the guidance loss function based on the location sensitivity map is:

wherein gps is an abbreviation for generated position sensitive map, indicating that a position sensitive map is generated; u (x; w)_l') An output characteristic diagram representing the tutor network guiding layers, v (x; w is a_j) An output characteristic diagram representing directed layers of the student network; l ' is the l ' th layer in the teacher network and j is the j ' th layer in the student network.

3. The knowledge distillation guidance method based on position sensitive map of claim 1, characterized in that: the teacher network is a Yolov3 target detection network.

4. The knowledge distillation guidance method based on position sensitive map of claim 3, characterized in that: the lightweight branch prediction network further comprises three fourth convolution modules, namely a fourth convolution module II (15), a fourth convolution module III (16), a fourth convolution module IV (17), a deconvolution layer b 'and a convolution layer b';

a fourth convolution module III (16) receives the b characteristic image output by the lightweight basic convolution neural network and performs convolution operation on the b characteristic image;

after being subjected to convolution operation by a fourth convolution module I (14), the characteristic image c is sequentially input to a deconvolution layer b' and a fourth convolution module II (15) for convolution operation;

the output characteristic diagrams of the second convolution module II (15) and the second convolution module III (16) are added and then output to a second convolution module IV (17) for convolution operation;

and the output characteristic diagram of the fourth convolution module IV (17) is input to the convolution layer b' for operation, and then a second group of prediction tensors are output.

5. The knowledge distillation guidance method based on position sensitive map of claim 4, characterized in that:

the lightweight branch prediction network further comprises three fourth convolution modules, namely a fourth convolution module V (18), a fourth convolution module VI (19), a fourth convolution module VII (20), a deconvolution layer a 'and a convolution layer a';

a fourth convolution module VI (19) receives the a characteristic image output by the lightweight basic convolution neural network; performing convolution operation on the a characteristic image;

the output characteristic diagram of the fourth type convolution module IV (17) is also sequentially output to the deconvolution layer a' to carry out convolution operation with the fourth type convolution module V (18);

and the output characteristic graph of the fourth type convolution module V (18) and the fourth type convolution module VI (19) are added and then sequentially input to the fourth type convolution module VII (20) and the convolution layer a' for operation, and then the third group of prediction tensors are output.

6. The knowledge distillation guidance method based on position sensitive map of claim 5, characterized in that:

the first network layer further comprises a third type convolution module;

7. The knowledge distillation guidance method based on position sensitive map of claim 6, characterized in that:

the number of the first type of convolution modules and the number of the second type of convolution modules in the first network layer are two, and the number of the third type of convolution modules is one; respectively a first convolution module I (3), a first convolution module II (5), a second convolution module I (2), a second convolution module II (4) and a third convolution module I (1);

the third convolution module I (1) receives the detection image input by the convolution layer, performs convolution operation, and then sequentially inputs the detection image to the second convolution module I (2), the first convolution module I (3), the second convolution module II (4) and the first convolution module II (5) to perform convolution operation, and then outputs an a characteristic image.

8. The knowledge distillation guidance method based on position sensitive map of claim 7, characterized in that:

one second-type convolution module and five first-type convolution modules are respectively a second-type convolution module III (6), a first-type convolution module III (7) to a first-type convolution module VII (11) in the second network layer; the second type convolution module III (6) receives the a characteristic image, performs convolution operation on the a characteristic image, sequentially inputs the a characteristic image to the five first type convolution modules for convolution operation, and outputs a b characteristic image;

or the like, or, alternatively,

the second network layer only comprises a second type convolution module, and the second type convolution module III (6) receives the a characteristic image and outputs a b characteristic image after convolution operation.

9. The knowledge distillation guidance method based on position sensitive map of claim 8, characterized in that:

one second type of convolution module and one first type of convolution module are arranged in the third network layer; a second convolution module IV (12) and a first convolution module VIII (13) respectively; the second type convolution module IV (12) receives the b characteristic image, performs convolution operation on the b characteristic image, sequentially inputs the b characteristic image to the first type convolution module VIII (13) for convolution operation, and outputs a c characteristic image;

or the like, or, alternatively,

and the third network layer only comprises a second type convolution module, and the second type convolution module IV (12) receives the b characteristic image and outputs a c characteristic image after convolution operation.

10. The knowledge distillation guidance method based on position sensitive map of claim 8, characterized in that: t is equal to 3, l is equal to 3, and m is equal to 3.

11. The knowledge distillation guidance method based on position sensitive maps according to claim 9, characterized in that:

selecting three network layers in a teacher network as guiding layers; correspondingly selecting three network layers in the student network as guided layers; the finger guiding layers correspond to the guided layers one by one.