CN114049527B

CN114049527B - Self-knowledge distillation method and system based on online cooperation and fusion

Info

Publication number: CN114049527B
Application number: CN202210019067.4A
Authority: CN
Inventors: 李树涛; 龙祖祥; 孙斌
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2022-01-10
Filing date: 2022-01-10
Publication date: 2022-06-14
Anticipated expiration: 2042-01-10
Also published as: CN114049527A

Abstract

The invention discloses a self-knowledge distillation method and a system based on online cooperation and fusionnAn auxiliary branch network, a main network includingn+Level 1 convolutional Block, frontnThe output of the stage convolution block serves as the input to the auxiliary branch network,nan auxiliary branch network andn+the output of the 1-level convolution block is used as the input of the characteristic fusion network, the output of each network is respectively sent to a classifier network to obtain the class prediction probability for self-adaptive fusion, and the output between adjacent auxiliary branch networks are respectively used as the input of the characteristic fusion networknAn auxiliary branch network andn+diversity regularization items are arranged among the 1-level convolution blocksL _divThe method is used for slowing down the network homogenization phenomenon and improving the fusion quality. The invention aims to combineL _divThe knowledge of each branch is fully utilized, and the network homogenization phenomenon is slowed down, so that the performance of each network is improved.

Description

Self-knowledge distillation method and system based on online cooperation and fusion

Technical Field

The invention relates to the field of deep learning model compression and accelerated research, in particular to a self-knowledge distillation method and system based on online cooperation and fusion.

Background

Convolutional Neural Networks (CNNs), the most important technique in deep learning, exhibit excellent performance in many tasks. However, to achieve higher accuracy, CNN further expands the number of channels and layers, with an exponential increase in the number of parameters and computations. This is a huge challenge for deploying models on edge devices. In view of the above problems, the prior art proposes a number of model compression and acceleration methods, mainly including network pruning, weight quantization, lightweight network design, and knowledge distillation. (1) As a three-stage method, network pruning needs pre-training a model, pruning unimportant channels according to an importance evaluation result, and finally carrying out fine tuning to restore the performance. This method is very time consuming. Furthermore, even with fine-tuning, networks are often still more or less affected by performance degradation. (2) Weighting reduces the amount of computation and parameters by compressing the number of bits of the model weights, so that the model can be deployed on specific hardware. (3) Lightweight network design relies on the experience of the designer and extensive experimentation. (4) Unlike the above methods, knowledge distillation achieves model compression and acceleration through knowledge transfer from the teacher network to the student network. A compact student network learns knowledge from a cumbersome teacher network, e.g., class prediction as soft targets, feature mapping activation boundaries, and intermediate layer feature mappings. The teacher network and the student network are trained on the same task, and the knowledge of the teacher network is used as a supervision signal to train compact students, so that the student network can realize excellent performance with less resource consumption. However, we need to train a cumbersome teacher network in advance and use its synchronous reasoning results in the student network training process. The resource cost of these processes becomes a final barrier to their practical application.

In order to avoid training an additional teacher network, the prior art proposes a self-knowledge distillation method based on the knowledge distillation method. The method adds the auxiliary branch networks to different layers of the backbone network and treats the backbone network as the deepest branch. The knowledge of the deep branches is refined to the shallow branches, i.e., the deep branches are treated as teacher networks and the shallow branches are treated as student networks. Self-knowledge distillation uses the backbone network as a shared layer for the remaining branches, which is key to reducing training overhead. More significantly, the appropriate branch network may be selected based on different resource constraints. The multi-branch self-knowledge distillation method not only effectively improves the accuracy of the network, but also reduces the training cost to the maximum extent. Nevertheless, it faces the following challenges: (1) knowledge flows only from the deepest branch to the shallow branch. This resulted in a lack of cooperation between the branches, ignoring the positive impact of the knowledge of the shallow branches on knowledge distillation. (2) All shallow branches are learned from the characteristic diagram and prediction of the deepest branch during training, which may cause the homogeneity of the network and limit the improvement of the network performance. Because all branches learned from the same knowledge source may generate similar semantic features and prediction error distributions, different branches are not complementary.

Disclosure of Invention

The technical problems to be solved by the invention are as follows: aiming at the technical problems faced by the existing multi-branch self-knowledge distillation method, the invention provides a self-knowledge distillation method and a system based on-line cooperation and fusion.

In order to solve the technical problems, the invention adopts the technical scheme that:

a self-knowledge distillation method based on-line cooperation and fusion comprises the step of carrying out image classification training or application by adopting a self-knowledge distillation network based on-line cooperation and fusion, wherein the self-knowledge distillation network based on-line cooperation and fusion comprises a feature extraction network, a feature fusion network and a classifier network which are connected with each other, and the feature extraction network comprises a trunk network and a classifier networknAn auxiliary branch network designed based on attention mechanism, the main network including a plurality of sub-branch networks for extracting feature maps respectivelyn+Convolution chunk with learnable level 1 parameter, arbitrarynThe characteristic diagram of the output of the stage convolution block is used as the input of the auxiliary branch networknAn auxiliary branch network andn+the output of the level 1 convolutional block is fed into a feature fusion network for generating a fusion feature map, the backbone network,nThe feature graphs output by the auxiliary branch network and the feature fusion network are sent to a classifier network to obtain corresponding sample class prediction probabilities, and the classifier network is used for outputting the feature extraction network and the feature fusion network respectivelyThe classification prediction probabilities obtained by the classification prediction of the features of the sub-networks are fused to obtain a fusion prediction probability, and the fusion prediction probability is obtained between the adjacent auxiliary branch networks and between the adjacent auxiliary branch networksnAn auxiliary branch network andn+diversity regularization terms used for slowing down homogenization phenomena and improving feature fusion quality of a feature fusion network and prediction fusion quality of a classifier network are arranged among the 1-level convolution blocksL _div。

Optionally, the diversity regularization termL _divThe functional expression of (a) is:

L _div =-Σⁿ _j=1||F ^j- F ^j+1||₂ ²，（1）

in the above formula, n is the number of the auxiliary branch networks,F ^jis as followsjA characteristic diagram of the output of each auxiliary branch network,F ^j+1is as followsj+1 auxiliary branch network output characteristic diagram.

Optionally, the auxiliary branch network includes an attention module and a down-sampling layer module connected to each other, where the attention module is configured to perform feature weight modeling on an input feature map to obtain an updated feature mapF _{i_At}The down-sampling layer module is used for updating the updated feature mapF _{i_At}Performing size reduction and channel conversion to obtain the second onen+Signature graph of level 1 convolution block outputF _n+1Consistent spatial dimension of the firstiBranch feature mapF _i-down。

Optionally, the attention module includes a computation channel attention and a computation space attention parallel two paths and a feature map updating module, wherein the computation space attention step includes: (1) will input the feature mapF _iFrom the spatial dimension R by convolution of Conv1 × 1^W*H*CReduced to R^W*H*1Obtaining a characteristic diagramF _{i_s}Wherein W is the width, H is the height, C is the channel, and R is the dimension; (2) for characteristic diagramF _{i_s}Performing average pooling operation on H and W dimensions respectively to obtain two one-dimensional global featuresF _i-sW∈R^W*1*1、F _i-sH∈R^1*H*1(ii) a Normalizing the obtained global features on the H dimension and the W dimension by using an activation function Sigmoid, and calculating an outer product of the two normalized feature vectors to obtain a space attention moment arrayA _s∈R^W*H*1(ii) a The step of calculating the channel attention comprises: (1) will input the feature mapF _iFrom the spatial dimension R by an average pooling operation^W*H*CReduced to R^1*1*CObtaining a characteristic diagramF _{i_c}(ii) a (2) By Conv1 × 1 convolution willF _{i_c}Performing dimensionality reduction and dimensionality lifting to obtain a pre-weight vectorF _i-pre∈R^1*1*C(ii) a (3) Pre-weight vector obtained by applying Sigmoid pair of activation functionsF _i-preNormalizing to obtain the final channel attention vectorA _c∈R^1*1*C(ii) a The feature map updating module is used for merging the outputs of two parallel paths of computing channel attention and computing space attention into an updated feature map, and the function expression for merging the outputs of the two parallel paths of computing channel attention and computing space attention into the feature map is as follows:

F _{i_At}= F _i⊗A _s⊗A _c，F _{i_At}∈R^W*H*C

in the above-mentioned formula, the compound has the following structure,F _{i_At}is as followsiAn updated profile output by the secondary branch network.

Optionally, the downsampling layer module includes a different number of downsampling layers such that any of the first and second downsampling layers is arbitraryiUpdated profile output by an auxiliary branch networkF _{i_At}After performing size reduction and channel conversion to obtain the secondiBranch feature mapF _i-downAnd a firstn +Signature graph of level 1 convolution block outputF _n+1The spatial dimensions are consistent, and the step of performing downsampling processing on the input feature map by the downsampling layer comprises the following steps of: (1) firstly inputting a feature mapF _{i_At}Or the feature map output by the last down-sampling layer is sent to the first layer depth separable convolution to obtain the feature map with half-reduced size in two dimensions of H and WF _i-temp∈R^W/2*H/2*C(ii) a (2) Feature map with half-reduced size in both H and W dimensionsF _i-tempPerforming channel conversion by one-layer point-by-point convolution and one-layer depth convolution to obtain a converted characteristic diagramF _i-trans∈R^W/2*H/2*Cmid(ii) a (3) Finally will beF _i-transConverting dimensionality to a specified value by a layer of point-by-point convolutionC _outObtaining a characteristic diagramF _i-out∈R^W/2*H/2*CoutIf the down-sampling layer is the last down-sampling layer, the feature map obtainedF _i-outAs obtained after size reduction and channel conversioniBranch feature mapF _i-down。

Optionally, the feature fusion network fuses the feature extraction networks to obtain a fusion feature map with richer semantic informationF _ReComprises the following steps: (1) each toiBranch characteristic diagramF _i-downAnd the firstn+Signature graph for level 1 convolution block outputF _n+1Splicing in channel dimension to obtain a spliced characteristic diagramF _cat∈R^W*H*4C(ii) a (2) The obtained splicing characteristic diagramF _catRespectively obtaining two one-dimensional global feature vectors through maximum pooling operation and average pooling operationV _max、V _avg(ii) a (3) Two one-dimensional global feature vectors to be obtainedV _max、V _avgPerforming global feature modeling through convolution of two layers of Conv1 multiplied by 1 respectively, and then adding element by element to obtain a one-dimensional pre-weight vector; (4) normalizing the obtained pre-weight vector through an activation function Sigmoid to obtain a weight vectorU∈R^1*1*4CAnd the obtained weight vector isUAnd a mosaic profileF _catMultiplying element by element to obtain the updated splicing characteristic diagramM∈R^W*H*4C(ii) a (5) Finally, the updated spliced feature map is subjected to depth separable convolution and point-by-point convolution through Conv1 multiplied by 1MThe channel is compressed to one fourth of the original channel to obtain a fusion characteristic diagram with richer semantic informationF _Re∈R^W*H*C。

Optionally, the classifier network comprisesnAn auxiliary classifier and a backbone network classifier,nan auxiliary classifier andnthe auxiliary branch networks are in one-to-one correspondence for outputting the first auxiliary branch networkiBranch feature mapF _i-downClassifying to obtainiBranch classification prediction probability, trunk network classifier for predicting probability according ton+Signature graph of level 1 convolution block outputF _n+1And obtaining the classification prediction probability of the backbone network by classification.

Optionally, when the image classification training is performed by using the self-knowledge distillation network based on online cooperation and fusion, the image classification training further includes using a sensor for performing image classification training according to the inputted first inputiThe branch classification prediction probability, the trunk network classification prediction probability and the feature fusion branch prediction probability are combined to obtain a fusion prediction probability, and when the self-knowledge distillation network based on online cooperation and fusion is adopted for image classification training, each training round comprises the following steps:

s1) inputting the training data into the feature extraction network, and outputting the training data to the second oneiBranch feature mapF _i-downAnd the firstn+Signature graph of level 1 convolution block outputF _n+1Extracting a fused feature map through a feature fusion networkF _ReFusing feature mapsF _ReObtaining a feature fusion branch prediction probability through a fusion classifier;

s2) will beiPerception of branch classification prediction probability, trunk network classification prediction probability, feature fusion branch prediction probability as learnable parametersInputting a one-dimensional fusion prediction probability through the sensing machine, judging to obtain the optimal fusion prediction probability according to the difference loss between the fusion prediction probability and the real label of the sample if the difference loss reaches the error range of a preset value, and skipping to execute the next step; otherwise, updating parameters of the perceptron by a gradient descent method, and skipping to execute the step S1) to continue training the perceptron;

s3) taking the optimal fusion prediction probability as a soft target in knowledge distillation, and training the main network, the auxiliary branch network and the feature fusion network based on training data to realize knowledge migration and complete knowledge distillation.

In addition, the invention also provides an online collaboration and fusion based self-knowledge distillation system, which comprises a microprocessor and a memory which are connected with each other, wherein the microprocessor and the memory are programmed or configured to execute the steps of the online collaboration and fusion based self-knowledge distillation method.

In addition, the present invention also provides a computer readable storage medium having stored therein the steps for execution by a computer device to implement the online collaboration and fusion based self-knowledge distillation method.

Compared with the prior art, the invention has the following advantages:

1. the inventive feature extraction network comprises a backbone network andnan auxiliary branch network designed based on attention mechanism, a main network including a plurality of sub-branch networks for extracting feature mapsn+Convolution block with learnable level 1 parameters, arbitrarynThe feature map output by the stage convolution block is used as the input of the auxiliary branch network, and the knowledge of the deep branches is refined to the shallow branches by adding the auxiliary branch network to different layers of the main network and regarding the main network as the deepest branch. By sharing the layer of the backbone network among different branches, the shallow branch learns the output of the deep branch, and model compression and acceleration can be realized.

2. The diversity regularization item in the invention can effectively avoid homogenization phenomenon between networks, enriches knowledge sources of feature fusion and prediction fusion, and the feature fusion network and the perceptron can fully utilize knowledge of different branches to construct a stronger knowledge source for guiding training of partial or all networks such as an auxiliary branch network, a main network and a feature fusion network in a self-knowledge distillation network based on online cooperation and fusion.

3. The method of the invention provides a plurality of networks with different calculated amounts and storage consumption for practical application, and a user can select network deployment with proper accuracy and calculated amounts and storage consumption according to practical environmental limits. The method realizes online collaborative learning among networks through feature fusion and prediction fusion, and can reduce the advantages of network parameters and calculated amount under the condition of not influencing performance, which is important for practical scenes with limited resources.

Drawings

FIG. 1 is a schematic structural diagram of a self-knowledge distillation network based on online collaboration and fusion in an embodiment of the present invention.

Fig. 2 is a schematic structural diagram of an attention module according to an embodiment of the present invention.

Fig. 3 is a schematic structural diagram of a downsampling layer module according to an embodiment of the present invention.

Fig. 4 is a diagram of a feature fusion network structure provided in the present invention.

FIG. 5 is a schematic diagram illustrating the training of the self-knowledge distillation network based on online collaboration and fusion according to an embodiment of the present invention.

FIG. 6 is a comparative illustration of experimental results in the examples of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the present invention is further described in detail below with reference to the accompanying drawings.

The first embodiment is as follows:

the self-knowledge distillation method based on online cooperation and fusion comprises the step of carrying out image classification training or application by adopting a self-knowledge distillation network based on online cooperation and fusion. As shown in FIG. 1, the self-knowledge distillation network based on online cooperation and fusion comprises a feature extraction network, a feature fusion network (CSFM) and a component network which are connected with each otherThe classifier network, the feature extraction network including a backbone network andnan auxiliary branch network, a main network includingn+Convolution chunk with learnable level 1 parameter, arbitrarynThe characteristic diagram of the stage convolution block output is used as the input of the auxiliary branch network,noutput of an auxiliary branch network andn+the output of the level 1 convolution block is sent to a feature fusion network for generating a fusion feature map, and a backbone network,nThe feature graphs output by the auxiliary branch networks and the feature fusion network are sent to a classifier network to obtain corresponding sample class prediction probabilities, the classifier network is used for fusing the classification prediction probabilities obtained by classifying and predicting the features output by the feature extraction network and the feature fusion network respectively to obtain fusion prediction probabilities, and the adjacent auxiliary branch networks are connected in seriesnAn auxiliary branch network andn+diversified regularization terms for slowing down homogenization phenomena and improving the feature fusion quality of the feature fusion network and the prediction fusion quality of the classifier network are arranged among the 1-level convolution blocksL _div。

As an optional implementation manner, in this embodimentn=3, i.e. the backbone network comprises 4 levels of parameter learnable convolution chunks, so the feature extraction network comprises 3 auxiliary branch networks designed based on attention mechanism, noted as: a first auxiliary branch, a second auxiliary branch and a third auxiliary branch. Referring to fig. 1, diversity regularization terms are set between the first auxiliary branch and the second auxiliary branch, between the second auxiliary branch and the third auxiliary branch, and between the third auxiliary branch and the main networkL _divThe method is used for slowing down the homogenization phenomenon among networks and inducing different branch networks to generate diversified features for subsequent feature map fusion and prediction fusion. In this embodiment, the diversity regularization termL _divThe functional expression of (a) is:

L _div =-Σⁿ _j=1||F ^j- F ^j+1||₂ ²，（1）

in the above formula, n is an auxiliary branchThe number of the networks is such that,F ^jis as followsjA characteristic diagram of the output of each auxiliary branch network,F ^j+1is as followsj+1 auxiliary branch network output characteristic diagram. In this embodiment, the feature map output by the auxiliary branch network includes a first branch feature mapF _1-downSecond branch profileF _2-downThe third branch characteristic diagramF _3-downThus using the first branch profileF _1-downSecond branch profileF _2-downThird branch feature diagramF _3-downOutput characteristic diagram at tail end of backbone networkF ₄Construction of an input feature setF ¹,F ²,F ³,F ⁴}. Similarity between feature maps is described by calculating the Euclidean distance between adjacent network output feature maps, and diversity constraint is realized by maximizing the Euclidean distance in network training, so that the Euclidean distance is used as a diversity regularization term in the embodimentL _div。

The core module of the backbone network is used for extracting a characteristic diagramn+Level 1 parameter learnable convolution chunks. As an optional implementation manner, the backbone network in this embodiment isn+The convolution block with learnable level 1 parameters also includes two layers before: the first layer is a convolutional layer for channel expansion and the second layer is a pooling layer for reduced resolution.

In this embodiment, the backbone network specifically adopts a ResNet network, where the first layer (convolution layer for channel expansion) is a Conv7 × 7 convolution layer, and the remaining pooling layers and convolution blocks are all set according to the initial network structure of the ResNet network. Of the trunk networkn+The feature map extracted by any ith-level parameter learnable convolution block in the level-1 parameter learnable convolution block isF _iIn this embodiment, the feature maps extracted by the convolution chunk with learnable level-4 parameters are respectivelyF ₁ 、F ₂ 、F ₃ 、F ₄。

In this embodiment, the auxiliary branch network includes interconnected postsAn attention module and a down-sampling layer module, wherein the attention module is used for performing feature weight modeling on the input feature map to obtain an updated feature mapF _{i_At}The down-sampling layer module is used for updating the updated feature mapF _{i_At}Performing size reduction and channel conversion to obtain the second onen+Signature graph of level 1 convolution block outputF _n+1The spatial dimension is consistentiBranch feature mapF _i-down。

As shown in fig. 2, the attention module in this embodiment includes two parallel paths for calculating the channel attention and the spatial attention, and a feature map updating module, wherein the step of calculating the spatial attention includes: (1) will input the feature mapF _iFrom the spatial dimension R by convolution of Conv1 × 1^W*H*CReduced to R^W*H*1Obtaining a characteristic diagramF _{i_s}Wherein W is the width, H is the height, C is the channel, and R is the dimension; (2) for characteristic diagramF _{i_s}Performing average pooling operation on H and W dimensions respectively to obtain two one-dimensional global featuresF _i-sW∈R^W*1*1、F _i-sH∈R^1*H*1(ii) a Normalizing the obtained global features on the H dimension and the W dimension by using an activation function Sigmoid, and calculating an outer product of the two normalized feature vectors to obtain a space attention moment arrayA _s∈R^W*H*1(ii) a The step of calculating the channel attention comprises: (1) will input the feature mapF _iFrom the spatial dimension R by an average pooling operation^W*H*CReduced to R^1*1*CObtaining a characteristic diagramF _{i_c}(ii) a (2) By Conv1 × 1 convolution willF _{i_c}Performing dimensionality reduction and then dimensionality lifting to obtain a pre-weight vectorF _i-pre∈R^1*1*C(ii) a (3) Pre-weight vector obtained by applying Sigmoid pair of activation functionsF _i-preNormalizing to obtain the final channel attention vectorA _c∈R^1*1*C(ii) a The feature map updating module is used for averaging the calculated channel attention and the calculated space attentionThe outputs of the paths of the rows are merged into an updated feature map.

In this embodiment, the function expression that combines the outputs of two parallel paths of the calculation channel attention and the calculation space attention as the feature map is:

F _{i_At}= F _i⊗A _s⊗A _c，F _{i_At}∈R^W*H*C

in the above formula, the first and second carbon atoms are,F _{i_At}is as followsiAn updated profile output by the secondary branch network.

As shown in fig. 3, the down-sampling layer module in this embodiment includes different numbers of down-sampling layers such that any one of the down-sampling layers is arbitrarily the second layeriUpdated profile output by an auxiliary branch networkF _{i_At}After performing size reduction and channel conversion to obtain the secondiBranch feature mapF _i-downAnd a firstn+Signature graph of level 1 convolution block outputF _n+1The spatial dimensions are consistent, and the step of performing downsampling processing on the input feature map by the downsampling layer comprises the following steps of: (1) firstly inputting a feature mapF _{i_At}Or the feature map output by the last down-sampling layer is sent to the first layer depth separable convolution to obtain the feature map with half-reduced size in two dimensions of H and WF _i-temp∈R^W ^/2*H/2*C(ii) a (2) Feature map with half-reduced size in both H and W dimensionsF _i-tempPerforming channel conversion by one-layer point-by-point convolution and one-layer depth convolution to obtain a converted characteristic diagramF _i-trans∈R^W/2*H/2*Cmid(ii) a (3) Finally will beF _i-transConverting dimensionality to a specified value by a layer of point-by-point convolutionC _outObtaining a characteristic diagramF _i-out∈R^W/2*H/2*CoutIf the down-sampling layer is the last down-sampling layer, the feature map obtainedF _i-outAs obtained after size reduction and channel conversioniBranch characteristic diagramF _i-down。

The attention module designed in the embodiment comprises a calculation channel attention modeling path and a calculation space attention modeling path; designing a down-sampling layer module, which comprises a preset number of depth separable convolutions and point-by-point convolutions; adding auxiliary branch networks, namely a first auxiliary branch network, a second auxiliary branch network and a third auxiliary branch network, at the tail ends of the first three groups of convolution chunks respectively, and introducing characteristic graphs output by convolution layers at the tail ends of the first three groups of convolution chunks into the corresponding auxiliary branch networks; performing channel and space importance modeling on the feature graph introduced into the auxiliary branch network through the attention module, updating the feature graph in a self-adaptive manner, giving greater weight to important features beneficial to subsequent tasks, and obtaining the updated feature graph; and inputting the updated features into a down-sampling layer to perform resolution down-sampling and channel number conversion to obtain a first branch feature map, a second branch feature map and a third branch feature map which are consistent with the spatial dimension of the feature map output by the convolution layer at the tail end of the backbone network. The process of generating the feature map by the auxiliary branch network comprises the following steps: outputting characteristic graphs of a level 1 convolution block, a level 2 convolution block and a level 3 convolution block of a backbone networkF ₁ 、F ₂ 、F ₃Respectively inputting the first, second and third auxiliary branch networks. Firstly, the attention module pair of each auxiliary branch network inputs a feature diagramF ₁ 、F ₂ 、F ₃Performing feature weight modeling to obtain an updated feature mapF _{1_At} 、F _{2_At} 、 F _{3_At}(ii) a Then will beF _{1_At} 、F _{2_At} 、F _{3_At}Inputting down-sampling layers of each auxiliary branch network for size reduction and channel conversion to obtain output characteristic diagram of convolution block 4F ₄First branch feature map with consistent spatial dimensionF _1-downSecond branch profileF _2-downThe third branch characteristic diagramF _3-down. Attention Module vs. input feature mapF ₁ 、F ₂ 、F ₃In performing feature weight modeling, the attention module contains two parallel paths at the top and bottom, the top path computing the channel attention and the bottom path computing the spatial attention. One down-sampling layer can only realize the reduction of the input feature map by half in the sizes of H and W and the conversion of the input channel toC _outAnd in order to make the dimension of the feature map output by all the auxiliary branches finally consistent with the feature map output by the last layer of the main network, the down-sampling layer module sets N to adjust the number of down-sampling layers. Specifically, the number N of downsampling layers of the first auxiliary branching network is 3, the number N of downsampling layers of the second auxiliary branching network is 2, and the number N of downsampling layers of the third auxiliary branching network is 1. Finally, the output characteristic diagram of the convolution block 4 is obtainedF ₄First branch feature map with consistent spatial dimensionF _1-downSecond branch profileF _2-downThe third branch characteristic diagramF _3-down。

In this embodiment, the classifier network includesnAn auxiliary classifier and a backbone network classifier,nan auxiliary classifier andnthe auxiliary branch networks are in one-to-one correspondence for outputting the first auxiliary branch networkiBranch feature mapF _i-downIs classified to obtainiBranch classification prediction probability, backbone network classifier for predicting probability based onn+Signature graph for level 1 convolution block outputF _n+1And obtaining the classification prediction probability of the backbone network by classification.

Referring to fig. 4, in the embodiment, when the image classification training is performed by using the self-knowledge distillation network based on online collaboration and fusion, the image classification training further includes using a feature fusion network (CSFM), a fusion classifier, and a perceptron to perform image classification training, where the feature fusion network is used to fuse feature extraction networks to obtain a fusion feature map with richer semantic informationF _ReThe fusion classifier is used for outputting a fusion feature map based on a feature fusion networkF _RePerforming classification to obtain feature fusion branch prediction probabilities, the senseThe learning machine is used for the first time according to the inputiThe branch classification prediction probability, the trunk network classification prediction probability and the feature fusion branch prediction probability are combined to obtain a fusion prediction probability, and when the self-knowledge distillation network based on online cooperation and fusion is adopted for image classification training, each training round comprises the following steps:

s1) inputting the training data into the feature extraction network, and outputting the training data to the second oneiBranch feature mapF _i-downAnd the firstn+Signature graph of level 1 convolution block outputF _n+1Extracting a fused feature map through a feature fusion networkF _ReFusing feature mapsF _ReObtaining a feature fusion branch prediction probability through a fusion classifier; in this embodiment, the first branch feature map isF _1-downSecond branch profileF _2-downThe third branch characteristic diagramF _3-downAnd feature map of the convolution layer output at the end of the backbone networkF ₄Respectively inputting the global pooling layers to obtain 5 one-dimensional global feature vectorsV_Global _i，i=1,2,3,4, 5; will obtainV_Global ₁An input auxiliary classifier 1,V_Global ₂An input auxiliary classifier 2,V_Global ₃An input auxiliary classifier 3,V_Global ₄Input to a main network classifier,V_Global ₅Inputting the fusion classifier to obtain a first branch prediction probabilityP ₁Second branch prediction probabilityP ₂Third branch prediction probabilityP ₃Backbone network prediction probabilityP ₄Feature fusion branch prediction probabilityP ₅；

The training data is obtained by pre-collection, and the processing process comprises the following steps: collecting samples for network training and preprocessing the samples, dividing the preprocessed samples into a training set and a verification set, and placing the samples in the same folder according to categories so as to facilitate subsequent calling. The training set is used for completing training, and the verification set is used for verifying the trained network.

S2) one or more secondiThe branch classification prediction probability, the trunk network classification prediction probability and the feature fusion branch prediction probability are used as the input of a perception machine capable of learning parameters, and a one-dimensional fusion prediction probability is output through the perception machineEAccording to the difference loss between the fusion prediction probability and the sample real label, if the difference loss reaches the error range of a preset value, the optimal fusion prediction probability is judged to be obtainedESkipping to execute the next step; otherwise, the parameters of the perceptron are updated by the gradient descent method, and the step S1) is executed to continue training the perceptron.

In this embodiment, the prediction probabilities (one or more of the second network)iBranch classification prediction probability, trunk network classification prediction probability, feature fusion branch prediction probability) and the sample true label are cross entropy functions, and the calculation function expression is as follows:

L _CE=Σ⁵ _i=1 y log(softmax(P _i))

in the above formula, the first and second carbon atoms are,L _CErepresenting the loss of difference between the fused prediction probability and the sample true label,L _CEthe larger the value, the greater the difference between the two prediction probabilities and the true sample,ya true label that represents the specimen is provided,P _ipredicting probability for first branchP ₁Second branch prediction probabilityP ₂Third Branch prediction probabilityP ₃Backbone network prediction probabilityP ₄Feature fusion branch prediction probabilityP ₅One of the five;softmaxan activation function for normalizing the prediction probability; log is the natural logarithm. Specifically, in this embodiment, the first branch feature map, the second branch feature map, the third branch feature map, the feature map output by the convolutional layer at the end of the backbone network, and the fused feature map are respectively pooled into 5 one-dimensional global feature vectors through global average pooling; 5 one-dimensional global feature vectors to be obtainedRespectively inputting classifiers corresponding to the network to obtain prediction probabilities of different branches to the sample, namely a first branch prediction probability, a second branch prediction probability, a third branch prediction probability, a trunk network prediction probability and a feature fusion branch prediction probability; and calculating difference loss between the obtained 5 prediction probabilities and the real label respectively. As an alternative implementation, the third branch prediction probability in this embodimentP ₃Backbone network prediction probabilityP ₄Feature fusion branch prediction probabilityP ₅As the input of the perceptron with learnable parameters, the perceptron with 3 learnable parameters is initialized, and the fusion weight parameters are respectivelyβ ₁、β ₂、β ₃(ii) a Predicting the third branch probabilityP ₃Backbone network prediction probabilityP ₄Feature fusion branch prediction probabilityP ₅Inputting the perceptron to obtain a fusion prediction probabilityE=β ^T [P ₃ , P ₄ , P ₅ ]And calculating the difference loss between the fusion prediction probability and the input sample real label, and optimizing the loss difference to converge to the minimum value through gradient descent, namely: the functional expression of the difference loss between the fusion prediction probability and the sample true label is:

，

Subject toΣ³ _k=1 β=1，β _k≥0

in the above formula, the first and second carbon atoms are,L _CErepresenting the prediction probability (one or more of the second network)iBranch classification prediction probability, trunk network classification prediction probability, feature fusion branch prediction probability, one or more of the secondiBranch class prediction probabilities specifically refer to three branch class prediction probabilities) and the loss of difference between sample true labels,ya true label that represents the specimen is provided,β _kis the first of a perceptronkA fusion weight parameter, whereinβ _kIs subject to all ofβThe sum of (a) and (b) is 1, andβ _kis greater than or equal to zero. Minimizing lossesβ ^TI.e. the optimal fusion weight parameter.

In the embodiment, the self-knowledge distillation network based on online cooperation and fusion is adopted for image classification training, and the self-adaptive integration method is adopted for prediction and fusion, so that a fusion prediction probability with stronger robustness is constructed for guiding each network learning.

In this embodiment, when the main network, the auxiliary branch network, and the feature fusion network are trained based on training data to implement knowledge migration and complete knowledge distillation, the loss function is expressed as:

，

in the above formula, the first and second carbon atoms are,Efor the best of the fusion prediction probabilities,P _ipredicting probability for first branchP ₁Second branch prediction probabilityP ₂Third branch prediction probabilityP ₃Backbone network prediction probabilityP ₄Feature fusion branch prediction probabilityP ₅One of the five; σ is a normalization function, which can be expressed as:

，

in the above-mentioned formula, the compound has the following structure,x∈R^m，Tthe value is a hyper-parameter for controlling the degree of numerical smoothing, and is set to 3 in this embodiment.L _KLIs used for describing the prediction probabilityP _iAnd fusion of prediction probabilitiesEBetweenTo the extent of the similarity in the direction of the line,L _KLthe larger the value of (a) is, the larger the difference between the two probabilities is. In network training, through minimizationL _KLTo realize thatiIndividual network prediction probabilityP _iThe probability distribution of each category in the cluster is as consistent as possible with the fused predicted probability distribution.

Referring to fig. 5, in the embodiment, the feature fusion network fuses the feature extraction networks to obtain a fusion feature map with richer semantic informationF _ReComprises the following steps: (1) each toiBranch characteristic diagramF _i-downAnd the firstn+Signature graph of level 1 convolution block outputF _n+1Splicing in channel dimension to obtain a spliced characteristic diagramF _cat∈R^W*H*4C(ii) a (2) The obtained splicing characteristic diagramF _catRespectively obtaining two one-dimensional global feature vectors through maximum pooling operation and average pooling operationV _max、V _avg(ii) a (3) Two one-dimensional global feature vectors to be obtainedV _max、V _avgPerforming global feature modeling through convolution of two layers of Conv1 multiplied by 1 respectively, and then adding element by element to obtain a one-dimensional pre-weight vector; (4) normalizing the obtained pre-weight vector through an activation function Sigmoid to obtain a weight vectorU∈R^1*1*4CAnd the obtained weight vector isUAnd a mosaic profileF _catMultiplying element by element to obtain the updated splicing characteristic diagramM∈R^W*H*4C(ii) a (5) Finally, the updated splicing feature map is subjected to point-by-point convolution and depth separable convolution of Conv1 multiplied by 1MThe channel is compressed to one fourth of the original channel to obtain a fusion characteristic diagram with richer semantic informationF _Re∈R^W*H*C. In this embodiment, the obtained first branch feature map, the second branch feature map, the third branch feature map, and the feature map output by the convolution layer at the end of the backbone network are specifically spliced in the channel dimension to obtain a spliced feature map; performing maximal pooling and average pooling on the spliced feature map in the channel dimension,obtaining two one-dimensional global features; performing dimension conversion on the obtained two one-dimensional global features through two Conv1 multiplied by 1 convolutional layers, and performing element-by-element addition to obtain a one-dimensional pre-weight vector; normalizing the obtained one-dimensional pre-weight vector based on an activation function Sigmoid, multiplying the normalized pre-weight vector and the spliced feature map element by element on corresponding channels, and updating channel weights of the spliced feature map; and compressing the channel of the updated spliced feature map to one fourth of the original channel through Conv1 multiplied by 1 depth separable convolution and point-by-point convolution to obtain a fusion feature map with richer semantic information.

In this embodiment, the data set CIFAR100 is specifically used for evaluating the knowledge distillation effect, the evaluation index is the classification accuracy, and the higher the accuracy is, the better the model is represented. The CIFAR100 is the most common reference data set in the image classification task, and includes 50000 training samples and 10000 testing samples, including 100 different classes, and all samples are RGB images with a resolution of 32 × 32, and the obtained results are shown in table 1 and fig. 6.

Table 1: the test results are compared with a data table.

In table 1, the network model represents an experimental backbone network structure; the baseline network represents the original network for normal training; acc (%) represents the classification accuracy of the network; f (G) represents network FLOPs with the unit of G; p (M) represents the network parameter number in M; AC1 denotes auxiliary classifier 1; AC2 denotes auxiliary classifier 2; AC3 denotes the auxiliary classifier 3; BC denotes a backbone network classifier; FFC denotes fusion classifier; the parameters of the baseline network are consistent with those of the FLOPs and the backbone network. From table 1 it can be found that: (1) compared with a standard training Baseline network (Baseline), the method effectively improves the performance of the network. Under the condition that the parameters and the floating point numbers are the same, the average precision of the main network classifier BC is improved by 2.82% compared with that of a baseline classifier. (2) The method outperformed baseline on the shallowest helper classifier AC1 among WRN50-2, ResNext50-4, and Shufflenetv 2. In addition, the parameters of AC1 and the number of FLOPs were significantly lower than the baseline, which can be seen intuitively from the table. (3) The classification precision of the fused feature classifier FFC is always higher than that of other classifiers, which shows that the feature map generated by the feature fusion network provided by the method has richer semantics, and the same conclusion can be drawn from FIG. 6. In fig. 6, A, B, C shows three different sets of input sample images and their corresponding feature maps, the first column shows the input sample images, the auxiliary branches 1 to 3 show the feature maps (the first branch feature map, the second branch feature map, and the third branch feature map) obtained by the auxiliary branches 1 to 3 in fig. 4, the trunk network shows the feature map output at the end of the trunk network in fig. 4, the CSFM shows the feature map output by the feature fusion network (CSFM), and the base network shows the feature map obtained by the original network modified by the method of the present embodiment and not including the normal training.

In summary, the self-knowledge distillation network based on online cooperation and fusion in the present embodiment includes a main network, an auxiliary branch network, a feature fusion network, a classifier module, and a perceptron module, and the method of the present embodiment includes acquiring a preprocessed image; adding auxiliary branch networks at the tail ends of layers with different depths of a main network to obtain a plurality of networks which can cooperate on line and comprise the main network, and extracting the characteristics of the preprocessed image based on the obtained networks; and setting a diversity regularization term between adjacent networks to slow down the homogenization phenomenon of the networks, and inducing different networks to generate diversity output for subsequent feature map fusion and prediction fusion. Splicing feature graphs output by the convolution layers at the tail ends of the main network and the auxiliary branch network in a channel dimension to obtain spliced feature graphs; inputting the obtained spliced feature map into a feature fusion network designed based on an attention mechanism, and performing self-adaptive feature extraction and conversion on channel dimensions to obtain a fusion feature map with richer semantic information; respectively sending the feature graphs output by the convolutional layers at the tail ends of the main network and the auxiliary branch network and the fusion feature graph from the feature fusion network into corresponding classifiers to obtain the class prediction probability of the sample, and calculating the difference loss with the real label of the sample; according to the obtained prediction probability of each network to the sample category, performing prediction level fusion by utilizing dynamic integration to obtain more robust fusion prediction probability; and the obtained fusion prediction probability is used as a soft target in knowledge distillation, and knowledge migration is carried out on the main network, the auxiliary branch network and the feature fusion network to complete the knowledge distillation. By adopting the self-knowledge distillation method based on online cooperation and fusion, better classification prediction accuracy can be obtained. And a diversity regularization item is introduced, the knowledge of feature fusion and prediction fusion is enriched, and the homogeneity of different networks is avoided. The method of the embodiment shows the advantage of reducing network parameters and calculation amount without affecting performance, which is important for practical scenes with limited resources, and in addition, the method of the embodiment provides a plurality of networks with different calculation amounts and storage consumption for practical application, and a user can select network deployment with proper accuracy, calculation amount and storage consumption according to practical environmental restrictions.

In one embodiment, the embodiment provides a self-knowledge distillation method based on online collaboration and fusion, which includes a preprocessed data set, a main network, auxiliary branch networks added at the ends of different depth layers of the main network, and the main network and the auxiliary branch networks performing feature extraction on samples in the data set; and the diversity regularization item is used for avoiding the homogenization phenomenon of the network and inducing different networks to generate diversity characteristics for subsequent characteristic diagram fusion and prediction fusion. The feature fusion network is used for carrying out self-adaptive feature extraction and conversion on the obtained first branch feature graph, the second branch feature graph, the third branch feature graph and the tail end output feature graph of the main network in the channel dimension to obtain a fusion feature graph; the classifier module is used for converting the output feature map of each network tail end convolution layer into a one-dimensional global feature vector and obtaining the class prediction probability of the sample from the one-dimensional global feature vector; the perceptron module is used for dynamically integrating the prediction probabilities from different classifications to obtain more robust fusion prediction probability; and carrying out knowledge migration on all networks based on the obtained fusion prediction probability to finish knowledge distillation.

In addition, the present embodiment also provides an online collaboration and fusion based self-knowledge distillation system, which comprises a microprocessor and a memory connected with each other, wherein the microprocessor and the memory are programmed or configured to execute the steps of the online collaboration and fusion based self-knowledge distillation method. In addition, the present embodiment also provides a computer readable storage medium, which stores therein the steps for execution by a computer device to implement the aforementioned online collaboration and fusion based self-knowledge distillation method.

The second embodiment:

the present embodiment is basically the same as the first embodiment, and is mainly different from the first embodiment in the implementation manner of the backbone network. In this embodiment, the backbone network specifically adopts a MobileNetV2 network, and for a MobileNetV2 network, the network is changed to: the convolutional layer of the first layer Conv3 × 3 remains unchanged, of the four convolutional blocks: the 1 st and 2 nd convolution chunks in the original mobilenetV2 network are classified as the 1 st convolution chunks in a backbone network, the 2 nd convolution chunks in the original mobilenetV2 network are kept as the 2 nd convolution chunks of the backbone network, the 3 rd and 4 th convolution chunks in the original mobilenetV2 network are classified as the 3 rd convolution chunks in the backbone network, and the 5 th and 6 th convolution chunks in the original mobilenetV2 network are classified as the 4 th convolution chunks in the backbone network, so that four-level chunk convolution is realized. In addition, according to convolution chunks of different stages of the backbone network, the original MobileNetV2 network can be adjusted as required to perform different merging matching.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present invention, and the scope of the present invention is not limited to the above embodiments, and all technical solutions that belong to the idea of the present invention belong to the scope of the present invention. It should be noted that modifications and adaptations to those skilled in the art without departing from the principles of the present invention should also be considered as within the scope of the present invention.

Claims

1. The self-knowledge distillation method based on online cooperation and fusion is characterized by comprising the step of carrying out image classification training or application by adopting a self-knowledge distillation network based on online cooperation and fusion, wherein the self-knowledge distillation network based on online cooperation and fusion comprises a feature extraction network, a feature fusion network and a classifier network which are connected with each other, and the feature extraction network comprises a backbone network and a classifier networknAn auxiliary branch network designed based on attention mechanism, the main network including a plurality of sub-branch networks for extracting feature maps respectivelyn+Convolution chunk with learnable level 1 parameter, arbitrarynA feature map of the output of the stage convolution block as an input to the auxiliary branch networknAn auxiliary branch network andn+the output of the level 1 convolutional block is fed into a feature fusion network for generating a fusion feature map, the backbone network,nThe feature graphs output by the auxiliary branch networks and the feature fusion network are sent to a classifier network to obtain corresponding sample class prediction probabilities, the classifier network is used for fusing the classification prediction probabilities obtained by classifying and predicting the features output by the feature extraction network and the feature fusion network respectively to obtain fusion prediction probabilities, and the adjacent auxiliary branch networks are connected in seriesnAn auxiliary branch network andn+diversified regularization terms for slowing down homogenization phenomena and improving the feature fusion quality of the feature fusion network and the prediction fusion quality of the classifier network are arranged among the 1-level convolution blocksL _divThe diversity regularization termL _divThe functional expression of (a) is:

L _div =-Σ^n-1 _j=1||F ^j- F ^j+1||₂ ²，（1）

in the above formula, n is auxiliaryThe number of the subsidiary branch networks is,F ^jis a firstjA characteristic diagram of the output of each auxiliary branch network,F ^j+1is a firstj+1 auxiliary branch network output characteristic diagram.

2. The on-line collaboration and fusion based self-knowledge distillation method as claimed in claim 1, wherein the auxiliary branch network comprises an attention module and a down-sampling layer module connected with each other, the attention module is used for performing feature weight modeling on the input feature map to obtain an updated feature mapF _{i_At}The down-sampling layer module is used for comparing the updated feature mapF _{i_At}Performing size reduction and channel conversion to obtain the second onen+Signature graph of level 1 convolution block outputF _n+1The spatial dimension is consistentiBranch feature mapF _i-down。

3. The online collaboration and fusion based self-knowledge distillation method according to claim 2, wherein the attention module comprises two parallel paths of computing channel attention and computing spatial attention and a feature map updating module, wherein the step of computing spatial attention comprises: (1) will input the feature mapF _iFrom the spatial dimension R by convolution with Conv1 × 1^W ^*H*CReduced to R^W*H*1Obtaining a characteristic diagramF _{i_s}Wherein W is the width, H is the height, C is the channel, and R is the dimension; (2) for characteristic diagramF _{i_s}Carrying out average pooling operation on H and W dimensions respectively to obtain two one-dimensional global featuresF _i-sW∈R^W*1*1、F _i-sH∈R^1*H*1(ii) a Normalizing the obtained global features on the H dimension and the W dimension by using an activation function Sigmoid, and calculating an outer product of the two normalized feature vectors to obtain a space attention moment arrayA _s∈R^W*H*1(ii) a The step of calculating the channel attention comprises: (1) will input the feature mapF _iFrom the spatial dimension R by an average pooling operation^W*H*CReduced to R^1*1*CObtaining a characteristic diagramF _{i_c}(ii) a (2) By Conv1 × 1 convolution willF _{i_c}Performing dimensionality reduction and dimensionality lifting to obtain a pre-weight vectorF _i-pre∈R^1*1*C(ii) a (3) Pre-weight vector obtained by applying Sigmoid pair of activation functionsF _i-preNormalizing to obtain the final channel attention vectorA _c∈R^1*1*C(ii) a The feature map updating module is used for merging the outputs of two parallel paths of computing channel attention and computing space attention into an updated feature map, and the function expression for merging the outputs of the two parallel paths of computing channel attention and computing space attention into the feature map is as follows:

F _{i_At}= F _i⊗A _s⊗A _c，F _{i_At}∈R^W*H*C

4. The on-line collaboration and fusion based self-knowledge distillation method as claimed in claim 2, wherein the down-sampling layer module comprises different number of down-sampling layers such that any of the first to third layersiUpdated profile output by an auxiliary branch networkF _{i_At}After size reduction and channel conversion, the first one is obtainediBranch feature mapF _i-downAnd a firstn+Signature graph for level 1 convolution block outputF _n+1The spatial dimensions are consistent, and the step of performing downsampling processing on the input feature map by the downsampling layer comprises the following steps of: (1) firstly inputting a feature mapF _{i_At}Or the feature map output by the last down-sampling layer is sent to the first layer depth separable convolution to obtain the feature map with half-reduced size in two dimensions of H and WF _i-temp∈R^W/2*H/2*C；（2）Reducing feature size by half in both H and W dimensionsF _i-tempPerforming channel conversion by one-layer point-by-point convolution and one-layer depth convolution to obtain a converted characteristic diagramF _i-trans∈R^W/2*H/2*Cmid(ii) a (3) Finally will beF _i-transConversion of dimensionality to a specified value C by a layer of point-by-point convolution_outObtaining a characteristic diagramF _i-out∈R^W/2*H/2*CoutIf the down-sampling layer is the last down-sampling layer, the feature map obtainedF _i-outAs obtained after size reduction and channel conversioniBranch characteristic diagramF _i-down(ii) a Wherein Cmid is the obtained characteristic diagramF _i-transThe number of channels in (1).

5. The self-knowledge distillation method based on online collaboration and fusion as claimed in claim 1, wherein the feature fusion network fuses feature extraction networks to obtain a fused feature map with richer semantic informationF _ReComprises the following steps: (1) each toiBranch feature mapF _i-downAnd the firstn+Signature graph of level 1 convolution block outputF _n+1Splicing in channel dimension to obtain a spliced characteristic diagramF _cat∈R^W*H*4C(ii) a (2) The obtained splicing characteristic diagramF _catRespectively obtaining two one-dimensional global feature vectors through maximum pooling operation and average pooling operationV _max、V _avg(ii) a (3) Two one-dimensional global feature vectors to be obtainedV _max、V _avgPerforming global feature modeling through convolution of two layers of Conv1 multiplied by 1 respectively, and then adding element by element to obtain a one-dimensional pre-weight vector; (4) normalizing the obtained pre-weight vector through an activation function Sigmoid to obtain a weight vectorU∈R¹ ^*1*4CAnd the obtained weight vector is usedUAnd a mosaic profileF _catPerforming element-by-element multiplication to obtain the updated mosaicConnection characteristic diagramM∈R^W*H*4C(ii) a (5) Finally, the updated splicing feature map is subjected to point-by-point convolution and depth separable convolution of Conv1 multiplied by 1MThe channel is compressed to one fourth of the original channel to obtain a fusion characteristic diagram with richer semantic informationF _Re∈R^W*H*C。

6. The online collaboration and fusion based self-knowledge distillation method of claim 1, wherein the classifier network comprisesnAn auxiliary classifier, a main network classifier and a fusion classifier,nan auxiliary classifier andnthe auxiliary branch networks are in one-to-one correspondence for outputting according to the corresponding auxiliary branch networksiBranch feature mapF _i-downIs classified to obtainiBranch classification prediction probability, backbone network classifier for predicting probability based onn+Signature graph of level 1 convolution block outputF _n+1Classifying to obtain the prediction probability of the classification of the backbone network, and fusing the classifier for fusing the feature map output by the network according to the featuresF _ReAnd obtaining the classification prediction probability of the fusion branch network by classification.

7. The on-line collaboration and fusion based self-knowledge distillation method as claimed in claim 6, wherein the training of image classification using the on-line collaboration and fusion based self-knowledge distillation network further comprises training of image classification using a perceptron, the perceptron being configured to perform the training of image classification according to the inputted first inputiThe branch classification prediction probability, the trunk network classification prediction probability and the feature fusion branch prediction probability are fused to obtain a fusion prediction probability, and when image classification training is carried out by adopting a self-knowledge distillation network based on online cooperation and fusion, each training round comprises the following steps:

s1) inputting the training data into the feature extraction network, and outputting the training data to the second oneiBranch feature mapF _i-downAnd the firstn+Signature graph of level 1 convolution block outputF _n+1Extracting fusion features through a feature fusion networkDrawing (A)F _ReFusing feature mapsF _ReObtaining a feature fusion branch prediction probability through a fusion classifier;

s2) will beiThe branch classification prediction probability, the trunk network classification prediction probability and the feature fusion branch prediction probability are used as the input of a perception machine capable of learning parameters, a one-dimensional fusion prediction probability is output through the perception machine, the optimal fusion prediction probability is judged to be obtained according to the difference loss between the fusion prediction probability and the real label of the sample if the difference loss reaches the error range of a preset value, and the next step is skipped to be executed; otherwise, updating parameters of the perceptron by a gradient descent method, and skipping to execute the step S1) to continue training the perceptron;

8. An online collaboration and fusion based self-knowledge distillation system comprising a microprocessor and a memory connected to each other, wherein the microprocessor is programmed or configured to perform the steps of the online collaboration and fusion based self-knowledge distillation method according to any one of claims 1 to 7.

9. A computer readable storage medium storing computer program instructions for execution by a computer device to perform the method of online collaboration and fusion based self-knowledge distillation according to any one of claims 1 to 7.