CN108764462A

CN108764462A - A kind of convolutional neural networks optimization method of knowledge based distillation

Info

Publication number: CN108764462A
Application number: CN201810530304.7A
Authority: CN
Inventors: 王标; 隆刚; 史方
Original assignee: Chengdu View World Science And Technology Co Ltd
Current assignee: Chengdu View World Science And Technology Co Ltd
Priority date: 2018-05-29
Filing date: 2018-05-29
Publication date: 2018-11-06

Abstract

The invention discloses a kind of convolutional neural networks optimization methods of knowledge based distillation, and chosen position establishes bridge joint from the additional structure of the feature pyramid part of FPN；Multiple feature adaptation layers are established in bridge joint position between teacher FPN networks T and student's FPN networks S；It is used as the loss function of network training using the multiple dimensioned loss function of level weighting.The positive effect of the present invention is：On the one hand, the knowledge based on the present invention distills design, can compress complicated teacher's FPN networks, obtains a scale smaller, calculates faster student FPN networks.FPN is directly used compared to the existing target detection technique based on CNN, is more convenient for carrying out the deployment of edge side calculating；On the other hand the mode for considering knowledge distillation, compared with existing knowledge distillation technique, the present invention can preferably be adapted to multiscale target detection network FPN, can preferably train student's FPN networks of high quality.

Description

A kind of convolutional neural networks optimization method of knowledge based distillation

Technical field

The present invention relates to a kind of convolutional neural networks optimization methods of knowledge based distillation.

Background technology

At present, deep learning is due to its powerful characterization ability, and the feature of extraction is compared to the spy that conventional method constructs by hand Sign has stronger robustness, therefore is widely used in figure with the depth learning technology that convolutional neural networks (CNN) are representative As in a variety of traditional computer visual tasks such as classification, target detection, image segmentation.Wherein, the CNN applications in image classification Typical way is to carry out the training of CNN models as loss function based on cross-entropy method.

In recent years, three main trend are presented in the development of deep learning：Model structure is increasingly sophisticated, model hierarchy is constantly deepened, Mass data collection continues to develop.However, with continuous with the demand of CNN progress edge calculations in mobile terminal and embedded platform It is soaring, since edge side computing platform is resource-constrained, it is desirable that CNN models are small as far as possible, and computational efficiency is high.For this purpose, close Nian Lai academias and industrial quarters propose various types of model compression methods, as model beta pruning, low-rank decomposition, model parameter are low Accuracy quantification etc..The paper Distilling the Knowledge in a Neural Network in 2014 such as Hinton In, it is proposed that a kind of method of knowledge distillation, can using a large-scale CNN that training obtains on large-scale training dataset as Teacher's network, and using a Small-sized C NN as student network, the ProbabilityDistribution Vector exported by joint training teacher's network and The artificial mark of training set, is trained student network.They, which demonstrate this method, can overcome Small-sized C NN in big data Training difficulty on collection can obtain the test result near or above teacher's network in classification task after the completion of training.It should Method can be considered as a kind of means of knowledge migration, and knowledge is passed through from teacher's network in training transfer to student network.It completes Big and bulky teacher's network is replaced by the student network of small fast spirit by design object after migration, carries out task utilization, to Deployment of the deep learning in edge side platform is facilitated significantly.

After Hinton etc. proposes knowledge distillation theory, Romero etc. proposes a kind of new method, i.e., by teacher Characteristic matching is carried out in the middle layer of network and student network, to provide a kind of reliable supervisory signals (hint), to tool There is the training of the student network of big depth structure effectively to be instructed, to overcome the weakness of Hinton methods so that big The knowledge of depth network is distilled into possibility.

Existing knowledge distillation technique is studied primarily directed to the CNN networks for classification at present.On the other hand, in depth The application of learning art is spent, such as is flourished at present by the related application of representative of recognition of face, various scene layers go out Not poor, this proposes challenge for the robustness of Face datection algorithm.For example, the difference due to face apart from camera distance, Larger different scale is presented in different moments in different people or same people in the picture.Similar, other kinds of target inspection Survey can also have this problem under special scenes.Even if many algorithm of target detection are based on CNN, for detection ruler The smaller target performance of degree is still not good enough.Therefore, Lin etc. proposes a kind of multiple dimensioned target detection network, referred to as feature gold Word tower network (FPN), for backbone network (backbone) typical case, based on using ResNet as the big depth model of representative, together Shi Ronghe Analysis On Multi-scale Features detect target, extraordinary detection performance are achieved, relative to traditional algorithm in small target deteection etc. Aspect embodies advantage.But due to using this larger backbone networks of similar ResNet, if in edge side platform part It affixes one's name to FPN and then there are many limitations such as calculating, storage.The present invention carries out knowledge distillation for FPN, in the target detection for maintaining FPN While performance, its network size and computational load are reduced, to be conducive to it in edge side Platform deployment.

The background of FPN：

The case where Fig. 1 is typical target detection network such as Faster R-CNN, input picture from its backbone network (such as ResNet first layer) enters, and data carry out the relevant prediction of target detection after being propagated forward to last layer always, therefore is a kind of Structure under single scale, and not comprising the multiple dimensioned design of pyramid.

Fig. 2 is the skeleton diagram of FPN, it is seen that data are in addition to common backbone bottom-up, from first layer to last layer Within propagation other than, also top-down connection, and horizontal connection is (general to generate additional feature pyramid Sketch map right half part).

Present to Fig. 3 details the horizontal connection between feature pyramidal layer and top-down connection, it is seen that FPN is in backbone Additional 1x1 convolutional layers, up-sampling and Eltwise layer are increased except net.

Since FPN can just be produced general in the pyramidal each layer of progress relevant prediction of target detection of feature The advantage of multiscale target detection not available for logical target detection network.

For common target detection network, Chen etc. is in document Learning Efficient Object A kind of method of knowledge distillation is proposed in Detection Models with Knowledge Distillation.Wherein, In view of the difference of object detection task and basic classification task, the loss letter needed for trained network is pointedly devised The feature adaptation layer being attached to over the backbone when counting, and training between teacher's network and student network.

But FPN detects network as a kind of special objective of Multi-scale model, is not particularly suited for directly using the side of the document Method carries out knowledge distillation.For example, the feature adaptation layer of Chen documents design, is directly bridge joint teacher's network and over the backbone Raw network carries out the intermediate features adaptation between them.Due to its design, for Detection task feature also really in backbone It generates on the net, therefore is suitable.However FPN is but and indirect using the feature exported in backbone network, but in backbone network Except increase additional additional structure, and what the feature for applying to Detection task exactly exported in these structures, therefore If the above method is equally used for FPN just and is not suitable for.

For another example, only there are one feature adaptation layers for literature method, because it is based only upon the task of single scale, therefore can not be thick It is non-.But FPN is different, it is that multiple features outputs are used for multiple dimensioned task, so if feature adaptation layer there are one only, it may not It can ensure the matched well of Analysis On Multi-scale Features between teacher's network and student network well.

Invention content

The shortcomings that in order to overcome the prior art, the present invention provides a kind of convolutional neural networks optimizations of knowledge based distillation Method, the present invention devise a kind of multiple dimensioned knowledge distillating method, so that it is preferably suitable for multiscale target and detect network FPN。

The technical solution adopted in the present invention is：A kind of convolutional neural networks optimization method of knowledge based distillation, from FPN Feature pyramid part additional structure in chosen position establish bridge joint；Between teacher FPN networks T and student's FPN networks S Bridge joint position establish multiple feature adaptation layers；It is used as the loss letter of network training using the multiple dimensioned loss function of level weighting Number.

Compared with prior art, the positive effect of the present invention is：

On the one hand, the knowledge based on the present invention distills design, training of students FPN networks, energy under teacher's FPN guiding via networks It is enough to compress complicated teacher's FPN networks, it obtains a scale smaller, calculate faster student FPN networks.Compared to existing There is the target detection technique based on CNN directly to use FPN, is more convenient for carrying out the deployment of edge side calculating.

On the other hand consider the mode of knowledge distillation, compared with existing knowledge distillation technique, the present invention can be fitted preferably Network FPN is detected with multiscale target, can preferably train student's FPN networks of high quality.

Description of the drawings

Examples of the present invention will be described by way of reference to the accompanying drawings, wherein：

The case where Fig. 1 is typical target detection network；

Fig. 2 is the skeleton diagram of FPN；

Fig. 3 is characterized horizontal connection and top-down connection diagram between pyramidal layer.

Specific implementation mode

The method of the present invention includes following content：

First, the present invention designs feature adaptation layer on the additional structure of backbone network, to be different from Chen documents in bone Dry network portion carries out the feature adaptation between teacher's network and student network.

Secondly, what the present invention designed on backbone network additional structure is multiple (optional) feature adaptation layers, to be based on They between teacher's network and student network, carry out and Analysis On Multi-scale Features pyramid be adapted, between multilayer feature in pairs Characteristic matching.

Third, in the design aspect of loss function, the present invention considers the function of FPN multiscale targets detection, devises A kind of multiple dimensioned loss function of multi-level loss weighting, to be different from Chen documents.

Specifically, main separation structure, loss function two large divisions discuss.

First in structure, the present invention take with feature adaptation method as Chen document categories, i.e., in teacher's network and On bridge joint position between raw network feature adaptation is carried out with 1x1 convolutional layers.The effect of 1x1 convolution is mainly so that as input Teacher's network middle layer characteristic pattern (feature map) port number, adaptation as export student network middle layer feature The port number of figure.

But unlike Chen document essence, the present invention is not the chosen position foundation bridge joint from backbone network, but Chosen position establishes bridge joint from the additional structure of the feature pyramid part of FPN.

Assuming that teacher's FPN network T additional structures output characteristic pattern quantity is Nt, and student's FPN network S additional structures export Characteristic pattern quantity is Ns, then the present invention establishes n feature adaptation layer between T and S, wherein 1≤n≤N, N=min (Nt, Ns). In other words, it is assumed that the additional structure output characteristic pattern of student network is less, then the quantitative range of the invention for establishing adaptation layer exists Between [1, Ns]；If the output characteristic pattern of teacher's network building-out structure on the contrary is less, the quantitative range of adaptation layer is in [1, Nt] Between.

Assuming that Nt and Ns is 4, then the value range of adaptation layer is 1~4.

Using in FPN originals by ResNet as in case of backbone network, it is assumed that T is identical as S structures, if C2, C3, C4, C5 } it is conv2, conv3, conv4 in ResNet, the output characteristic pattern of these convolutional layers of conv5, and FPN is additional { P2, P3, P4, P5 } in structure is respectively and { C2, C3, C4, C5 } an equal amount of output characteristic pattern on backbone network.If point { P2, P3, P4, P5 } on { P2, P3, P4, P5 } and S on T is not distinguished with suffix _ t and suffix _ s, then the present embodiment is based on { P2_t, P3_t, P4_t, P5_t } and { P2_s, P3_s, P4_s, P5_s } these characteristic patterns select the positioning of adaptation layer.

Such as take n=1, then the adaptation layer position that the present embodiment selects for set (P2_t, P2_s), (P3_t, P3_s), (P4_t, P4_s), (P5_t, P5_s) } in any one.By taking (P2_t, P2_s) as an example, mean using the P2_t of T as adaptation The input of layer, is output to after 1x1 convolution in S, and characteristic matching is carried out with the P2_s of S, and so on.Preferably, selection is suitable It is (P3_t, P3_s) or (P4_t, P4_s) with layer position, i.e., in optional position, opposite centered position is selected to establish bridge joint.

Such as take n=4, then the position of 4 feature adaptation layers be (P2_t, P2_s), (P3_t, P3_s), (P4_t, P4_s), (P5_t,P5_s)}。

If n takes medium value, such as n=2, then preferably, a position can be selected in { (P2_t, P2_s), (P3_t, P3_s) } Vertical bridge joint is set up, a position is then selected to establish bridge joint again in { (P4_t, P4_s), (P5_t, P5_s) }.

Assuming that establishing multiple bridge joints between T and S, then any two of which bridges (Pi_t, Pk_s), (Pj_t, Pl_s) Position relationship must meet constraint：(1) i is not equal to j, and k is not equal to 1；(2) if i>J then needs k>1.Each bridge is ensured in this way The position connect is without coincidence, and there is no intersecting on characteristic dimension for each bridge joint position between T and S.

Second, in the design of the loss function for network training.The method of Chen documents is, in target detection network In RPN (region recommendation network) partly loss item L with RCN (territorial classification and frame return) part_RPNAnd L_RCNRespectively：

Wherein λ is hyper parameter, and N is the batch sizes of RCN, and M is the batch sizes of RPN, Classification Loss L_clsIt is to be based on The combination of the soft loss of the softmax of ground truth labels losses firmly and knowledge based distillation, frame return loss L_reg It is then the combination of smooth L1 losses and the L2 losses of teacher's network limit.

The present invention is in loss item L_RPNAnd L_RCNDefinition in terms of, used for reference Chen documents.And the total losses function L of the document For:

L=L_RPN+L_RCN+γL_hint

Wherein, γ is hyper parameter, L_HintIt is aspect ratio between teacher's network and student network after feature adaptation to damage It loses.As it can be seen that L is the summation of RPN losses, RCN losses and aspect ratio to loss.

The present invention is based on similar thoughts, are optimized further combined with the multiple dimensioned feature of feature, definition loss Function：

WhereinIt is aspect ratio after n feature adaptation layer to the weighted sum of loss, γ_iFor hyper parameter, correspond to every Aspect ratio is to loss after a feature adaptation layerWeight.

Wherein, Zi is the middle layer feature of input terminal, that is, teacher's network of current signature adaptation layer, after feature adaptation Feature, Vi be current signature adaptation layer output end (i.e. the middle layer feature of its corresponding student network).

In the present invention, γ_iPresence can be used for control two kinds balance：First, aspect ratio is to loss after feature adaptation layer The tradeoff of importance between being lost with other types；Second is that aspect ratio is to the tradeoff inside loss, i.e., each ruler after feature adaptation layer Spend the balance of aspect ratio importance between loss.

In the training of network, γ can be passed through_iBe adjusted flexibly, so that loss function is preferably adapted to the inspection of specific target Survey task.Preferably, it is assumed that γ₂And γ₃Respectively on corresponding position (P2_t, P2_s) and (P3_t, P3_s) after feature adaptation Aspect ratio is to losing weight, such as when the small scaled target for needing to detect in specific tasks is relatively more, then can set γ₂>γ₃, To reinforce optimizing small scaled target detection, if otherwise large scale target can more at most set γ₂<γ₃, and so on.

As described in FPN original papers, entirety is a kind of generic structure, therefore backbone network in FPN structures involved in the present invention The part of network is equally also not limited to the ResNet of embodiment, can be other depth convolutional neural networks.

The main distinction of the present invention and the prior art is summarized as follows：

1, the position (additional structure of backbone network vs. backbone networks) of feature adaptation layer, than over the backbone closer in appoint The feature of business actual use so that teacher's network can provide more effective supervisory signals (hint) for student network；

2, the quantity (single scale is adapted to the multiple dimensioned adaptations of vs., i.e., the n in given value range is a) of feature adaptation layer, more It is bonded the feature pyramid of FPN marrow；

3, the design (the multiple dimensioned loss function of single scale loss function vs. levels weighting) of loss function, can be preferably It controls the internal i.e. equilibrium of each dimensional losses item of Analysis On Multi-scale Features comparison loss and they loses the balanced of item with other types, Etc..

Claims

1. a kind of convolutional neural networks optimization method of knowledge based distillation, it is characterised in that：From the feature pyramid part of FPN Additional structure in chosen position establish bridge joint；Bridge joint position between teacher FPN networks T and student's FPN networks S is established more A feature adaptation layer；It is used as the loss function of network training using the multiple dimensioned loss function of level weighting.

2. a kind of convolutional neural networks optimization method of knowledge based distillation according to claim 1, it is characterised in that：? On bridge joint position between teacher FPN networks T and student's FPN networks S feature adaptation is carried out with 1x1 convolutional layers.

3. a kind of convolutional neural networks optimization method of knowledge based distillation according to claim 1, it is characterised in that：If Establish multiple bridge joints between teacher FPN networks T and student's FPN networks S, then any two of which bridge joint (Pi_t, Pk_s) and The position relationship of (Pj_t, Pl_s) must meet following constraint：(1) i is not equal to j, and k is not equal to 1；(2) if i>J, then k> 1。

4. a kind of convolutional neural networks optimization method of knowledge based distillation according to claim 1, it is characterised in that：It is special The number n of sign adaptation layer is determined in the following way：1≤n≤N and N=min (Nt, Ns), wherein：Nt indicates teacher's FPN networks T additional structures export characteristic pattern quantity, and Ns indicates that student's FPN network S additional structures export characteristic pattern quantity.

5. a kind of convolutional neural networks optimization method of knowledge based distillation according to claim 1, it is characterised in that：Institute Stating the multiple dimensioned loss function that level weights is：

Wherein：L_RPNAnd L_RCNIt is illustrated respectively in the loss item of the RPN part and the parts RCN in target detection network；For Aspect ratio is to the weighted sum of loss, γ after n feature adaptation layer_iFor hyper parameter, correspond to aspect ratio after each feature adaptation layer To lossWeight.

6. a kind of convolutional neural networks optimization method of knowledge based distillation according to claim 5, it is characterised in that：Institute Aspect ratio is to loss after stating each feature adaptation layerIt is calculated as follows：

Wherein, Z_iFor the input terminal of current signature adaptation layer, V_iFor the output end of current signature adaptation layer.

7. a kind of convolutional neural networks optimization method of knowledge based distillation according to claim 5, it is characterised in that： L_RPNAnd L_RCNFollowing formula is respectively adopted to calculate：

Wherein:λ is hyper parameter, and N is the batch sizes of RCN, and M is the batch sizes of RPN, Classification Loss L_clsIt is to be based on The combination of the soft loss of the softmax of ground truth labels losses firmly and knowledge based distillation, frame return loss L_reg It is the combination of smooth L1 losses and the L2 losses of teacher's network limit.

8. a kind of convolutional neural networks optimization method of knowledge based distillation according to claim 5, it is characterised in that：When When needing the small scaled target detected relatively more in specific tasks, then γ is set_i>γ_i+1If otherwise large scale target is more Then set γ_i<γ_i+1。