CN114049527B - Self-knowledge distillation method and system based on online cooperation and fusion - Google Patents

Self-knowledge distillation method and system based on online cooperation and fusion Download PDF

Info

Publication number
CN114049527B
CN114049527B CN202210019067.4A CN202210019067A CN114049527B CN 114049527 B CN114049527 B CN 114049527B CN 202210019067 A CN202210019067 A CN 202210019067A CN 114049527 B CN114049527 B CN 114049527B
Authority
CN
China
Prior art keywords
network
fusion
feature
branch
feature map
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210019067.4A
Other languages
Chinese (zh)
Other versions
CN114049527A (en
Inventor
李树涛
龙祖祥
孙斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University
Original Assignee
Hunan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University filed Critical Hunan University
Priority to CN202210019067.4A priority Critical patent/CN114049527B/en
Publication of CN114049527A publication Critical patent/CN114049527A/en
Application granted granted Critical
Publication of CN114049527B publication Critical patent/CN114049527B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a self-knowledge distillation method and a system based on online cooperation and fusionnAn auxiliary branch network, a main network includingn+Level 1 convolutional Block, frontnThe output of the stage convolution block serves as the input to the auxiliary branch network,nan auxiliary branch network andn+the output of the 1-level convolution block is used as the input of the characteristic fusion network, the output of each network is respectively sent to a classifier network to obtain the class prediction probability for self-adaptive fusion, and the output between adjacent auxiliary branch networks are respectively used as the input of the characteristic fusion networknAn auxiliary branch network andn+diversity regularization items are arranged among the 1-level convolution blocksL div The method is used for slowing down the network homogenization phenomenon and improving the fusion quality. The invention aims to combineL div The knowledge of each branch is fully utilized, and the network homogenization phenomenon is slowed down, so that the performance of each network is improved.

Description

Self-knowledge distillation method and system based on online cooperation and fusion
Technical Field
The invention relates to the field of deep learning model compression and accelerated research, in particular to a self-knowledge distillation method and system based on online cooperation and fusion.
Background
Convolutional Neural Networks (CNNs), the most important technique in deep learning, exhibit excellent performance in many tasks. However, to achieve higher accuracy, CNN further expands the number of channels and layers, with an exponential increase in the number of parameters and computations. This is a huge challenge for deploying models on edge devices. In view of the above problems, the prior art proposes a number of model compression and acceleration methods, mainly including network pruning, weight quantization, lightweight network design, and knowledge distillation. (1) As a three-stage method, network pruning needs pre-training a model, pruning unimportant channels according to an importance evaluation result, and finally carrying out fine tuning to restore the performance. This method is very time consuming. Furthermore, even with fine-tuning, networks are often still more or less affected by performance degradation. (2) Weighting reduces the amount of computation and parameters by compressing the number of bits of the model weights, so that the model can be deployed on specific hardware. (3) Lightweight network design relies on the experience of the designer and extensive experimentation. (4) Unlike the above methods, knowledge distillation achieves model compression and acceleration through knowledge transfer from the teacher network to the student network. A compact student network learns knowledge from a cumbersome teacher network, e.g., class prediction as soft targets, feature mapping activation boundaries, and intermediate layer feature mappings. The teacher network and the student network are trained on the same task, and the knowledge of the teacher network is used as a supervision signal to train compact students, so that the student network can realize excellent performance with less resource consumption. However, we need to train a cumbersome teacher network in advance and use its synchronous reasoning results in the student network training process. The resource cost of these processes becomes a final barrier to their practical application.
In order to avoid training an additional teacher network, the prior art proposes a self-knowledge distillation method based on the knowledge distillation method. The method adds the auxiliary branch networks to different layers of the backbone network and treats the backbone network as the deepest branch. The knowledge of the deep branches is refined to the shallow branches, i.e., the deep branches are treated as teacher networks and the shallow branches are treated as student networks. Self-knowledge distillation uses the backbone network as a shared layer for the remaining branches, which is key to reducing training overhead. More significantly, the appropriate branch network may be selected based on different resource constraints. The multi-branch self-knowledge distillation method not only effectively improves the accuracy of the network, but also reduces the training cost to the maximum extent. Nevertheless, it faces the following challenges: (1) knowledge flows only from the deepest branch to the shallow branch. This resulted in a lack of cooperation between the branches, ignoring the positive impact of the knowledge of the shallow branches on knowledge distillation. (2) All shallow branches are learned from the characteristic diagram and prediction of the deepest branch during training, which may cause the homogeneity of the network and limit the improvement of the network performance. Because all branches learned from the same knowledge source may generate similar semantic features and prediction error distributions, different branches are not complementary.
Disclosure of Invention
The technical problems to be solved by the invention are as follows: aiming at the technical problems faced by the existing multi-branch self-knowledge distillation method, the invention provides a self-knowledge distillation method and a system based on-line cooperation and fusion.
In order to solve the technical problems, the invention adopts the technical scheme that:
a self-knowledge distillation method based on-line cooperation and fusion comprises the step of carrying out image classification training or application by adopting a self-knowledge distillation network based on-line cooperation and fusion, wherein the self-knowledge distillation network based on-line cooperation and fusion comprises a feature extraction network, a feature fusion network and a classifier network which are connected with each other, and the feature extraction network comprises a trunk network and a classifier networknAn auxiliary branch network designed based on attention mechanism, the main network including a plurality of sub-branch networks for extracting feature maps respectivelyn+Convolution chunk with learnable level 1 parameter, arbitrarynThe characteristic diagram of the output of the stage convolution block is used as the input of the auxiliary branch networknAn auxiliary branch network andn+the output of the level 1 convolutional block is fed into a feature fusion network for generating a fusion feature map, the backbone network,nThe feature graphs output by the auxiliary branch network and the feature fusion network are sent to a classifier network to obtain corresponding sample class prediction probabilities, and the classifier network is used for outputting the feature extraction network and the feature fusion network respectivelyThe classification prediction probabilities obtained by the classification prediction of the features of the sub-networks are fused to obtain a fusion prediction probability, and the fusion prediction probability is obtained between the adjacent auxiliary branch networks and between the adjacent auxiliary branch networksnAn auxiliary branch network andn+diversity regularization terms used for slowing down homogenization phenomena and improving feature fusion quality of a feature fusion network and prediction fusion quality of a classifier network are arranged among the 1-level convolution blocksL div
Optionally, the diversity regularization termL div The functional expression of (a) is:
L div =-Σn j=1||F j - F j+1||2 2,(1)
in the above formula, n is the number of the auxiliary branch networks,F j is as followsjA characteristic diagram of the output of each auxiliary branch network,F j+1is as followsj+1 auxiliary branch network output characteristic diagram.
Optionally, the auxiliary branch network includes an attention module and a down-sampling layer module connected to each other, where the attention module is configured to perform feature weight modeling on an input feature map to obtain an updated feature mapF i_At The down-sampling layer module is used for updating the updated feature mapF i_At Performing size reduction and channel conversion to obtain the second onen+Signature graph of level 1 convolution block outputF n+1Consistent spatial dimension of the firstiBranch feature mapF i-down
Optionally, the attention module includes a computation channel attention and a computation space attention parallel two paths and a feature map updating module, wherein the computation space attention step includes: (1) will input the feature mapF i From the spatial dimension R by convolution of Conv1 × 1W*H*CReduced to RW*H*1Obtaining a characteristic diagramF i_s Wherein W is the width, H is the height, C is the channel, and R is the dimension; (2) for characteristic diagramF i_s Performing average pooling operation on H and W dimensions respectively to obtain two one-dimensional global featuresF i-sW ∈RW*1*1F i-sH ∈R1*H*1(ii) a Normalizing the obtained global features on the H dimension and the W dimension by using an activation function Sigmoid, and calculating an outer product of the two normalized feature vectors to obtain a space attention moment arrayA s ∈RW*H*1(ii) a The step of calculating the channel attention comprises: (1) will input the feature mapF i From the spatial dimension R by an average pooling operationW*H*CReduced to R1*1*CObtaining a characteristic diagramF i_c (ii) a (2) By Conv1 × 1 convolution willF i_c Performing dimensionality reduction and dimensionality lifting to obtain a pre-weight vectorF i-pre ∈R1*1*C(ii) a (3) Pre-weight vector obtained by applying Sigmoid pair of activation functionsF i-pre Normalizing to obtain the final channel attention vectorA c ∈R1*1*C(ii) a The feature map updating module is used for merging the outputs of two parallel paths of computing channel attention and computing space attention into an updated feature map, and the function expression for merging the outputs of the two parallel paths of computing channel attention and computing space attention into the feature map is as follows:
F i_At = F i A s A c F i_At ∈RW*H*C
in the above-mentioned formula, the compound has the following structure,F i_At is as followsiAn updated profile output by the secondary branch network.
Optionally, the downsampling layer module includes a different number of downsampling layers such that any of the first and second downsampling layers is arbitraryiUpdated profile output by an auxiliary branch networkF i_At After performing size reduction and channel conversion to obtain the secondiBranch feature mapF i-down And a firstn +Signature graph of level 1 convolution block outputF n+1The spatial dimensions are consistent, and the step of performing downsampling processing on the input feature map by the downsampling layer comprises the following steps of: (1) firstly inputting a feature mapF i_At Or the feature map output by the last down-sampling layer is sent to the first layer depth separable convolution to obtain the feature map with half-reduced size in two dimensions of H and WF i-temp ∈RW/2*H/2*C(ii) a (2) Feature map with half-reduced size in both H and W dimensionsF i-temp Performing channel conversion by one-layer point-by-point convolution and one-layer depth convolution to obtain a converted characteristic diagramF i-trans ∈RW/2*H/2*Cmid(ii) a (3) Finally will beF i-trans Converting dimensionality to a specified value by a layer of point-by-point convolutionC out Obtaining a characteristic diagramF i-out ∈RW/2*H/2*CoutIf the down-sampling layer is the last down-sampling layer, the feature map obtainedF i-out As obtained after size reduction and channel conversioniBranch feature mapF i-down
Optionally, the feature fusion network fuses the feature extraction networks to obtain a fusion feature map with richer semantic informationF Re Comprises the following steps: (1) each toiBranch characteristic diagramF i-down And the firstn+Signature graph for level 1 convolution block outputF n+1Splicing in channel dimension to obtain a spliced characteristic diagramF cat ∈RW*H*4C(ii) a (2) The obtained splicing characteristic diagramF cat Respectively obtaining two one-dimensional global feature vectors through maximum pooling operation and average pooling operationV max V avg (ii) a (3) Two one-dimensional global feature vectors to be obtainedV max V avg Performing global feature modeling through convolution of two layers of Conv1 multiplied by 1 respectively, and then adding element by element to obtain a one-dimensional pre-weight vector; (4) normalizing the obtained pre-weight vector through an activation function Sigmoid to obtain a weight vectorU∈R1*1*4CAnd the obtained weight vector isUAnd a mosaic profileF cat Multiplying element by element to obtain the updated splicing characteristic diagramM∈RW*H*4C(ii) a (5) Finally, the updated spliced feature map is subjected to depth separable convolution and point-by-point convolution through Conv1 multiplied by 1MThe channel is compressed to one fourth of the original channel to obtain a fusion characteristic diagram with richer semantic informationF Re ∈RW*H*C
Optionally, the classifier network comprisesnAn auxiliary classifier and a backbone network classifier,nan auxiliary classifier andnthe auxiliary branch networks are in one-to-one correspondence for outputting the first auxiliary branch networkiBranch feature mapF i-down Classifying to obtainiBranch classification prediction probability, trunk network classifier for predicting probability according ton+Signature graph of level 1 convolution block outputF n+1And obtaining the classification prediction probability of the backbone network by classification.
Optionally, when the image classification training is performed by using the self-knowledge distillation network based on online cooperation and fusion, the image classification training further includes using a sensor for performing image classification training according to the inputted first inputiThe branch classification prediction probability, the trunk network classification prediction probability and the feature fusion branch prediction probability are combined to obtain a fusion prediction probability, and when the self-knowledge distillation network based on online cooperation and fusion is adopted for image classification training, each training round comprises the following steps:
s1) inputting the training data into the feature extraction network, and outputting the training data to the second oneiBranch feature mapF i-down And the firstn+Signature graph of level 1 convolution block outputF n+1Extracting a fused feature map through a feature fusion networkF Re Fusing feature mapsF Re Obtaining a feature fusion branch prediction probability through a fusion classifier;
s2) will beiPerception of branch classification prediction probability, trunk network classification prediction probability, feature fusion branch prediction probability as learnable parametersInputting a one-dimensional fusion prediction probability through the sensing machine, judging to obtain the optimal fusion prediction probability according to the difference loss between the fusion prediction probability and the real label of the sample if the difference loss reaches the error range of a preset value, and skipping to execute the next step; otherwise, updating parameters of the perceptron by a gradient descent method, and skipping to execute the step S1) to continue training the perceptron;
s3) taking the optimal fusion prediction probability as a soft target in knowledge distillation, and training the main network, the auxiliary branch network and the feature fusion network based on training data to realize knowledge migration and complete knowledge distillation.
In addition, the invention also provides an online collaboration and fusion based self-knowledge distillation system, which comprises a microprocessor and a memory which are connected with each other, wherein the microprocessor and the memory are programmed or configured to execute the steps of the online collaboration and fusion based self-knowledge distillation method.
In addition, the present invention also provides a computer readable storage medium having stored therein the steps for execution by a computer device to implement the online collaboration and fusion based self-knowledge distillation method.
Compared with the prior art, the invention has the following advantages:
1. the inventive feature extraction network comprises a backbone network andnan auxiliary branch network designed based on attention mechanism, a main network including a plurality of sub-branch networks for extracting feature mapsn+Convolution block with learnable level 1 parameters, arbitrarynThe feature map output by the stage convolution block is used as the input of the auxiliary branch network, and the knowledge of the deep branches is refined to the shallow branches by adding the auxiliary branch network to different layers of the main network and regarding the main network as the deepest branch. By sharing the layer of the backbone network among different branches, the shallow branch learns the output of the deep branch, and model compression and acceleration can be realized.
2. The diversity regularization item in the invention can effectively avoid homogenization phenomenon between networks, enriches knowledge sources of feature fusion and prediction fusion, and the feature fusion network and the perceptron can fully utilize knowledge of different branches to construct a stronger knowledge source for guiding training of partial or all networks such as an auxiliary branch network, a main network and a feature fusion network in a self-knowledge distillation network based on online cooperation and fusion.
3. The method of the invention provides a plurality of networks with different calculated amounts and storage consumption for practical application, and a user can select network deployment with proper accuracy and calculated amounts and storage consumption according to practical environmental limits. The method realizes online collaborative learning among networks through feature fusion and prediction fusion, and can reduce the advantages of network parameters and calculated amount under the condition of not influencing performance, which is important for practical scenes with limited resources.
Drawings
FIG. 1 is a schematic structural diagram of a self-knowledge distillation network based on online collaboration and fusion in an embodiment of the present invention.
Fig. 2 is a schematic structural diagram of an attention module according to an embodiment of the present invention.
Fig. 3 is a schematic structural diagram of a downsampling layer module according to an embodiment of the present invention.
Fig. 4 is a diagram of a feature fusion network structure provided in the present invention.
FIG. 5 is a schematic diagram illustrating the training of the self-knowledge distillation network based on online collaboration and fusion according to an embodiment of the present invention.
FIG. 6 is a comparative illustration of experimental results in the examples of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the present invention is further described in detail below with reference to the accompanying drawings.
The first embodiment is as follows:
the self-knowledge distillation method based on online cooperation and fusion comprises the step of carrying out image classification training or application by adopting a self-knowledge distillation network based on online cooperation and fusion. As shown in FIG. 1, the self-knowledge distillation network based on online cooperation and fusion comprises a feature extraction network, a feature fusion network (CSFM) and a component network which are connected with each otherThe classifier network, the feature extraction network including a backbone network andnan auxiliary branch network, a main network includingn+Convolution chunk with learnable level 1 parameter, arbitrarynThe characteristic diagram of the stage convolution block output is used as the input of the auxiliary branch network,noutput of an auxiliary branch network andn+the output of the level 1 convolution block is sent to a feature fusion network for generating a fusion feature map, and a backbone network,nThe feature graphs output by the auxiliary branch networks and the feature fusion network are sent to a classifier network to obtain corresponding sample class prediction probabilities, the classifier network is used for fusing the classification prediction probabilities obtained by classifying and predicting the features output by the feature extraction network and the feature fusion network respectively to obtain fusion prediction probabilities, and the adjacent auxiliary branch networks are connected in seriesnAn auxiliary branch network andn+diversified regularization terms for slowing down homogenization phenomena and improving the feature fusion quality of the feature fusion network and the prediction fusion quality of the classifier network are arranged among the 1-level convolution blocksL div
As an optional implementation manner, in this embodimentn=3, i.e. the backbone network comprises 4 levels of parameter learnable convolution chunks, so the feature extraction network comprises 3 auxiliary branch networks designed based on attention mechanism, noted as: a first auxiliary branch, a second auxiliary branch and a third auxiliary branch. Referring to fig. 1, diversity regularization terms are set between the first auxiliary branch and the second auxiliary branch, between the second auxiliary branch and the third auxiliary branch, and between the third auxiliary branch and the main networkL div The method is used for slowing down the homogenization phenomenon among networks and inducing different branch networks to generate diversified features for subsequent feature map fusion and prediction fusion. In this embodiment, the diversity regularization termL div The functional expression of (a) is:
L div =-Σn j=1||F j - F j+1||2 2,(1)
in the above formula, n is an auxiliary branchThe number of the networks is such that,F j is as followsjA characteristic diagram of the output of each auxiliary branch network,F j+1is as followsj+1 auxiliary branch network output characteristic diagram. In this embodiment, the feature map output by the auxiliary branch network includes a first branch feature mapF 1-down Second branch profileF 2-down The third branch characteristic diagramF 3-down Thus using the first branch profileF 1-down Second branch profileF 2-down Third branch feature diagramF 3-down Output characteristic diagram at tail end of backbone networkF 4 Construction of an input feature setF 1,F 2,F 3,F 4}. Similarity between feature maps is described by calculating the Euclidean distance between adjacent network output feature maps, and diversity constraint is realized by maximizing the Euclidean distance in network training, so that the Euclidean distance is used as a diversity regularization term in the embodimentL div
The core module of the backbone network is used for extracting a characteristic diagramn+Level 1 parameter learnable convolution chunks. As an optional implementation manner, the backbone network in this embodiment isn+The convolution block with learnable level 1 parameters also includes two layers before: the first layer is a convolutional layer for channel expansion and the second layer is a pooling layer for reduced resolution.
In this embodiment, the backbone network specifically adopts a ResNet network, where the first layer (convolution layer for channel expansion) is a Conv7 × 7 convolution layer, and the remaining pooling layers and convolution blocks are all set according to the initial network structure of the ResNet network. Of the trunk networkn+The feature map extracted by any ith-level parameter learnable convolution block in the level-1 parameter learnable convolution block isF i In this embodiment, the feature maps extracted by the convolution chunk with learnable level-4 parameters are respectivelyF 1 、F 2 、F 3 、F 4
In this embodiment, the auxiliary branch network includes interconnected postsAn attention module and a down-sampling layer module, wherein the attention module is used for performing feature weight modeling on the input feature map to obtain an updated feature mapF i_At The down-sampling layer module is used for updating the updated feature mapF i_At Performing size reduction and channel conversion to obtain the second onen+Signature graph of level 1 convolution block outputF n+1The spatial dimension is consistentiBranch feature mapF i-down
As shown in fig. 2, the attention module in this embodiment includes two parallel paths for calculating the channel attention and the spatial attention, and a feature map updating module, wherein the step of calculating the spatial attention includes: (1) will input the feature mapF i From the spatial dimension R by convolution of Conv1 × 1W*H*CReduced to RW*H*1Obtaining a characteristic diagramF i_s Wherein W is the width, H is the height, C is the channel, and R is the dimension; (2) for characteristic diagramF i_s Performing average pooling operation on H and W dimensions respectively to obtain two one-dimensional global featuresF i-sW ∈RW*1*1F i-sH ∈R1*H*1(ii) a Normalizing the obtained global features on the H dimension and the W dimension by using an activation function Sigmoid, and calculating an outer product of the two normalized feature vectors to obtain a space attention moment arrayA s ∈RW*H*1(ii) a The step of calculating the channel attention comprises: (1) will input the feature mapF i From the spatial dimension R by an average pooling operationW*H*CReduced to R1*1*CObtaining a characteristic diagramF i_c (ii) a (2) By Conv1 × 1 convolution willF i_c Performing dimensionality reduction and then dimensionality lifting to obtain a pre-weight vectorF i-pre ∈R1*1*C(ii) a (3) Pre-weight vector obtained by applying Sigmoid pair of activation functionsF i-pre Normalizing to obtain the final channel attention vectorA c ∈R1*1*C(ii) a The feature map updating module is used for averaging the calculated channel attention and the calculated space attentionThe outputs of the paths of the rows are merged into an updated feature map.
In this embodiment, the function expression that combines the outputs of two parallel paths of the calculation channel attention and the calculation space attention as the feature map is:
F i_At = F i A s A c F i_At ∈RW*H*C
in the above formula, the first and second carbon atoms are,F i_At is as followsiAn updated profile output by the secondary branch network.
As shown in fig. 3, the down-sampling layer module in this embodiment includes different numbers of down-sampling layers such that any one of the down-sampling layers is arbitrarily the second layeriUpdated profile output by an auxiliary branch networkF i_At After performing size reduction and channel conversion to obtain the secondiBranch feature mapF i-down And a firstn+Signature graph of level 1 convolution block outputF n+1The spatial dimensions are consistent, and the step of performing downsampling processing on the input feature map by the downsampling layer comprises the following steps of: (1) firstly inputting a feature mapF i_At Or the feature map output by the last down-sampling layer is sent to the first layer depth separable convolution to obtain the feature map with half-reduced size in two dimensions of H and WF i-temp ∈RW /2*H/2*C(ii) a (2) Feature map with half-reduced size in both H and W dimensionsF i-temp Performing channel conversion by one-layer point-by-point convolution and one-layer depth convolution to obtain a converted characteristic diagramF i-trans ∈RW/2*H/2*Cmid(ii) a (3) Finally will beF i-trans Converting dimensionality to a specified value by a layer of point-by-point convolutionC out Obtaining a characteristic diagramF i-out ∈RW/2*H/2*CoutIf the down-sampling layer is the last down-sampling layer, the feature map obtainedF i-out As obtained after size reduction and channel conversioniBranch characteristic diagramF i-down
The attention module designed in the embodiment comprises a calculation channel attention modeling path and a calculation space attention modeling path; designing a down-sampling layer module, which comprises a preset number of depth separable convolutions and point-by-point convolutions; adding auxiliary branch networks, namely a first auxiliary branch network, a second auxiliary branch network and a third auxiliary branch network, at the tail ends of the first three groups of convolution chunks respectively, and introducing characteristic graphs output by convolution layers at the tail ends of the first three groups of convolution chunks into the corresponding auxiliary branch networks; performing channel and space importance modeling on the feature graph introduced into the auxiliary branch network through the attention module, updating the feature graph in a self-adaptive manner, giving greater weight to important features beneficial to subsequent tasks, and obtaining the updated feature graph; and inputting the updated features into a down-sampling layer to perform resolution down-sampling and channel number conversion to obtain a first branch feature map, a second branch feature map and a third branch feature map which are consistent with the spatial dimension of the feature map output by the convolution layer at the tail end of the backbone network. The process of generating the feature map by the auxiliary branch network comprises the following steps: outputting characteristic graphs of a level 1 convolution block, a level 2 convolution block and a level 3 convolution block of a backbone networkF 1 、F 2 、F 3Respectively inputting the first, second and third auxiliary branch networks. Firstly, the attention module pair of each auxiliary branch network inputs a feature diagramF 1 、F 2 、F 3 Performing feature weight modeling to obtain an updated feature mapF 1_At 、F 2_At F 3_At (ii) a Then will beF 1_At 、F 2_At 、F 3_At Inputting down-sampling layers of each auxiliary branch network for size reduction and channel conversion to obtain output characteristic diagram of convolution block 4F 4 First branch feature map with consistent spatial dimensionF 1-down Second branch profileF 2-down The third branch characteristic diagramF 3-down . Attention Module vs. input feature mapF 1 、F 2 、F 3 In performing feature weight modeling, the attention module contains two parallel paths at the top and bottom, the top path computing the channel attention and the bottom path computing the spatial attention. One down-sampling layer can only realize the reduction of the input feature map by half in the sizes of H and W and the conversion of the input channel toC out And in order to make the dimension of the feature map output by all the auxiliary branches finally consistent with the feature map output by the last layer of the main network, the down-sampling layer module sets N to adjust the number of down-sampling layers. Specifically, the number N of downsampling layers of the first auxiliary branching network is 3, the number N of downsampling layers of the second auxiliary branching network is 2, and the number N of downsampling layers of the third auxiliary branching network is 1. Finally, the output characteristic diagram of the convolution block 4 is obtainedF 4 First branch feature map with consistent spatial dimensionF 1-down Second branch profileF 2-down The third branch characteristic diagramF 3-down
In this embodiment, the classifier network includesnAn auxiliary classifier and a backbone network classifier,nan auxiliary classifier andnthe auxiliary branch networks are in one-to-one correspondence for outputting the first auxiliary branch networkiBranch feature mapF i-down Is classified to obtainiBranch classification prediction probability, backbone network classifier for predicting probability based onn+Signature graph for level 1 convolution block outputF n+1And obtaining the classification prediction probability of the backbone network by classification.
Referring to fig. 4, in the embodiment, when the image classification training is performed by using the self-knowledge distillation network based on online collaboration and fusion, the image classification training further includes using a feature fusion network (CSFM), a fusion classifier, and a perceptron to perform image classification training, where the feature fusion network is used to fuse feature extraction networks to obtain a fusion feature map with richer semantic informationF Re The fusion classifier is used for outputting a fusion feature map based on a feature fusion networkF Re Performing classification to obtain feature fusion branch prediction probabilities, the senseThe learning machine is used for the first time according to the inputiThe branch classification prediction probability, the trunk network classification prediction probability and the feature fusion branch prediction probability are combined to obtain a fusion prediction probability, and when the self-knowledge distillation network based on online cooperation and fusion is adopted for image classification training, each training round comprises the following steps:
s1) inputting the training data into the feature extraction network, and outputting the training data to the second oneiBranch feature mapF i-down And the firstn+Signature graph of level 1 convolution block outputF n+1Extracting a fused feature map through a feature fusion networkF Re Fusing feature mapsF Re Obtaining a feature fusion branch prediction probability through a fusion classifier; in this embodiment, the first branch feature map isF 1-down Second branch profileF 2-down The third branch characteristic diagramF 3-down And feature map of the convolution layer output at the end of the backbone networkF 4 Respectively inputting the global pooling layers to obtain 5 one-dimensional global feature vectorsV_Global i i=1,2,3,4, 5; will obtainV_Global 1 An input auxiliary classifier 1,V_Global 2 An input auxiliary classifier 2,V_Global 3 An input auxiliary classifier 3,V_Global 4 Input to a main network classifier,V_Global 5 Inputting the fusion classifier to obtain a first branch prediction probabilityP 1 Second branch prediction probabilityP 2 Third branch prediction probabilityP 3 Backbone network prediction probabilityP 4 Feature fusion branch prediction probabilityP 5
The training data is obtained by pre-collection, and the processing process comprises the following steps: collecting samples for network training and preprocessing the samples, dividing the preprocessed samples into a training set and a verification set, and placing the samples in the same folder according to categories so as to facilitate subsequent calling. The training set is used for completing training, and the verification set is used for verifying the trained network.
S2) one or more secondiThe branch classification prediction probability, the trunk network classification prediction probability and the feature fusion branch prediction probability are used as the input of a perception machine capable of learning parameters, and a one-dimensional fusion prediction probability is output through the perception machineEAccording to the difference loss between the fusion prediction probability and the sample real label, if the difference loss reaches the error range of a preset value, the optimal fusion prediction probability is judged to be obtainedESkipping to execute the next step; otherwise, the parameters of the perceptron are updated by the gradient descent method, and the step S1) is executed to continue training the perceptron.
In this embodiment, the prediction probabilities (one or more of the second network)iBranch classification prediction probability, trunk network classification prediction probability, feature fusion branch prediction probability) and the sample true label are cross entropy functions, and the calculation function expression is as follows:
L CE 5 i=1 y log(softmax(P i ))
in the above formula, the first and second carbon atoms are,L CE representing the loss of difference between the fused prediction probability and the sample true label,L CE the larger the value, the greater the difference between the two prediction probabilities and the true sample,ya true label that represents the specimen is provided,P i predicting probability for first branchP 1 Second branch prediction probabilityP 2 Third Branch prediction probabilityP 3 Backbone network prediction probabilityP 4 Feature fusion branch prediction probabilityP 5 One of the five;softmaxan activation function for normalizing the prediction probability; log is the natural logarithm. Specifically, in this embodiment, the first branch feature map, the second branch feature map, the third branch feature map, the feature map output by the convolutional layer at the end of the backbone network, and the fused feature map are respectively pooled into 5 one-dimensional global feature vectors through global average pooling; 5 one-dimensional global feature vectors to be obtainedRespectively inputting classifiers corresponding to the network to obtain prediction probabilities of different branches to the sample, namely a first branch prediction probability, a second branch prediction probability, a third branch prediction probability, a trunk network prediction probability and a feature fusion branch prediction probability; and calculating difference loss between the obtained 5 prediction probabilities and the real label respectively. As an alternative implementation, the third branch prediction probability in this embodimentP 3Backbone network prediction probabilityP 4Feature fusion branch prediction probabilityP 5As the input of the perceptron with learnable parameters, the perceptron with 3 learnable parameters is initialized, and the fusion weight parameters are respectivelyβ 1 β 2 β 3 (ii) a Predicting the third branch probabilityP 3 Backbone network prediction probabilityP 4 Feature fusion branch prediction probabilityP 5 Inputting the perceptron to obtain a fusion prediction probabilityE=β T [P 3 , P 4 , P 5 ]And calculating the difference loss between the fusion prediction probability and the input sample real label, and optimizing the loss difference to converge to the minimum value through gradient descent, namely: the functional expression of the difference loss between the fusion prediction probability and the sample true label is:
Figure 91442DEST_PATH_IMAGE001
Subject toΣ3 k=1 β=1,β k≥0
in the above formula, the first and second carbon atoms are,L CE representing the prediction probability (one or more of the second network)iBranch classification prediction probability, trunk network classification prediction probability, feature fusion branch prediction probability, one or more of the secondiBranch class prediction probabilities specifically refer to three branch class prediction probabilities) and the loss of difference between sample true labels,ya true label that represents the specimen is provided,β kis the first of a perceptronkA fusion weight parameter, whereinβ kIs subject to all ofβThe sum of (a) and (b) is 1, andβ kis greater than or equal to zero. Minimizing lossesβ T I.e. the optimal fusion weight parameter.
S3) taking the optimal fusion prediction probability as a soft target in knowledge distillation, and training the main network, the auxiliary branch network and the feature fusion network based on training data to realize knowledge migration and complete knowledge distillation.
In the embodiment, the self-knowledge distillation network based on online cooperation and fusion is adopted for image classification training, and the self-adaptive integration method is adopted for prediction and fusion, so that a fusion prediction probability with stronger robustness is constructed for guiding each network learning.
In this embodiment, when the main network, the auxiliary branch network, and the feature fusion network are trained based on training data to implement knowledge migration and complete knowledge distillation, the loss function is expressed as:
Figure 905815DEST_PATH_IMAGE002
in the above formula, the first and second carbon atoms are,Efor the best of the fusion prediction probabilities,P i predicting probability for first branchP 1 Second branch prediction probabilityP 2 Third branch prediction probabilityP 3 Backbone network prediction probabilityP 4 Feature fusion branch prediction probabilityP 5 One of the five; σ is a normalization function, which can be expressed as:
Figure 790594DEST_PATH_IMAGE003
in the above-mentioned formula, the compound has the following structure,x∈R m Tthe value is a hyper-parameter for controlling the degree of numerical smoothing, and is set to 3 in this embodiment.L KL Is used for describing the prediction probabilityP i And fusion of prediction probabilitiesEBetweenTo the extent of the similarity in the direction of the line,L KL the larger the value of (a) is, the larger the difference between the two probabilities is. In network training, through minimizationL KL To realize thatiIndividual network prediction probabilityP i The probability distribution of each category in the cluster is as consistent as possible with the fused predicted probability distribution.
Referring to fig. 5, in the embodiment, the feature fusion network fuses the feature extraction networks to obtain a fusion feature map with richer semantic informationF Re Comprises the following steps: (1) each toiBranch characteristic diagramF i-down And the firstn+Signature graph of level 1 convolution block outputF n+1Splicing in channel dimension to obtain a spliced characteristic diagramF cat ∈RW*H*4C(ii) a (2) The obtained splicing characteristic diagramF cat Respectively obtaining two one-dimensional global feature vectors through maximum pooling operation and average pooling operationV max V avg (ii) a (3) Two one-dimensional global feature vectors to be obtainedV max V avg Performing global feature modeling through convolution of two layers of Conv1 multiplied by 1 respectively, and then adding element by element to obtain a one-dimensional pre-weight vector; (4) normalizing the obtained pre-weight vector through an activation function Sigmoid to obtain a weight vectorU∈R1*1*4CAnd the obtained weight vector isUAnd a mosaic profileF cat Multiplying element by element to obtain the updated splicing characteristic diagramM∈RW*H*4C(ii) a (5) Finally, the updated splicing feature map is subjected to point-by-point convolution and depth separable convolution of Conv1 multiplied by 1MThe channel is compressed to one fourth of the original channel to obtain a fusion characteristic diagram with richer semantic informationF Re ∈RW*H*C. In this embodiment, the obtained first branch feature map, the second branch feature map, the third branch feature map, and the feature map output by the convolution layer at the end of the backbone network are specifically spliced in the channel dimension to obtain a spliced feature map; performing maximal pooling and average pooling on the spliced feature map in the channel dimension,obtaining two one-dimensional global features; performing dimension conversion on the obtained two one-dimensional global features through two Conv1 multiplied by 1 convolutional layers, and performing element-by-element addition to obtain a one-dimensional pre-weight vector; normalizing the obtained one-dimensional pre-weight vector based on an activation function Sigmoid, multiplying the normalized pre-weight vector and the spliced feature map element by element on corresponding channels, and updating channel weights of the spliced feature map; and compressing the channel of the updated spliced feature map to one fourth of the original channel through Conv1 multiplied by 1 depth separable convolution and point-by-point convolution to obtain a fusion feature map with richer semantic information.
In this embodiment, the data set CIFAR100 is specifically used for evaluating the knowledge distillation effect, the evaluation index is the classification accuracy, and the higher the accuracy is, the better the model is represented. The CIFAR100 is the most common reference data set in the image classification task, and includes 50000 training samples and 10000 testing samples, including 100 different classes, and all samples are RGB images with a resolution of 32 × 32, and the obtained results are shown in table 1 and fig. 6.
Table 1: the test results are compared with a data table.
Figure 57627DEST_PATH_IMAGE004
In table 1, the network model represents an experimental backbone network structure; the baseline network represents the original network for normal training; acc (%) represents the classification accuracy of the network; f (G) represents network FLOPs with the unit of G; p (M) represents the network parameter number in M; AC1 denotes auxiliary classifier 1; AC2 denotes auxiliary classifier 2; AC3 denotes the auxiliary classifier 3; BC denotes a backbone network classifier; FFC denotes fusion classifier; the parameters of the baseline network are consistent with those of the FLOPs and the backbone network. From table 1 it can be found that: (1) compared with a standard training Baseline network (Baseline), the method effectively improves the performance of the network. Under the condition that the parameters and the floating point numbers are the same, the average precision of the main network classifier BC is improved by 2.82% compared with that of a baseline classifier. (2) The method outperformed baseline on the shallowest helper classifier AC1 among WRN50-2, ResNext50-4, and Shufflenetv 2. In addition, the parameters of AC1 and the number of FLOPs were significantly lower than the baseline, which can be seen intuitively from the table. (3) The classification precision of the fused feature classifier FFC is always higher than that of other classifiers, which shows that the feature map generated by the feature fusion network provided by the method has richer semantics, and the same conclusion can be drawn from FIG. 6. In fig. 6, A, B, C shows three different sets of input sample images and their corresponding feature maps, the first column shows the input sample images, the auxiliary branches 1 to 3 show the feature maps (the first branch feature map, the second branch feature map, and the third branch feature map) obtained by the auxiliary branches 1 to 3 in fig. 4, the trunk network shows the feature map output at the end of the trunk network in fig. 4, the CSFM shows the feature map output by the feature fusion network (CSFM), and the base network shows the feature map obtained by the original network modified by the method of the present embodiment and not including the normal training.
In summary, the self-knowledge distillation network based on online cooperation and fusion in the present embodiment includes a main network, an auxiliary branch network, a feature fusion network, a classifier module, and a perceptron module, and the method of the present embodiment includes acquiring a preprocessed image; adding auxiliary branch networks at the tail ends of layers with different depths of a main network to obtain a plurality of networks which can cooperate on line and comprise the main network, and extracting the characteristics of the preprocessed image based on the obtained networks; and setting a diversity regularization term between adjacent networks to slow down the homogenization phenomenon of the networks, and inducing different networks to generate diversity output for subsequent feature map fusion and prediction fusion. Splicing feature graphs output by the convolution layers at the tail ends of the main network and the auxiliary branch network in a channel dimension to obtain spliced feature graphs; inputting the obtained spliced feature map into a feature fusion network designed based on an attention mechanism, and performing self-adaptive feature extraction and conversion on channel dimensions to obtain a fusion feature map with richer semantic information; respectively sending the feature graphs output by the convolutional layers at the tail ends of the main network and the auxiliary branch network and the fusion feature graph from the feature fusion network into corresponding classifiers to obtain the class prediction probability of the sample, and calculating the difference loss with the real label of the sample; according to the obtained prediction probability of each network to the sample category, performing prediction level fusion by utilizing dynamic integration to obtain more robust fusion prediction probability; and the obtained fusion prediction probability is used as a soft target in knowledge distillation, and knowledge migration is carried out on the main network, the auxiliary branch network and the feature fusion network to complete the knowledge distillation. By adopting the self-knowledge distillation method based on online cooperation and fusion, better classification prediction accuracy can be obtained. And a diversity regularization item is introduced, the knowledge of feature fusion and prediction fusion is enriched, and the homogeneity of different networks is avoided. The method of the embodiment shows the advantage of reducing network parameters and calculation amount without affecting performance, which is important for practical scenes with limited resources, and in addition, the method of the embodiment provides a plurality of networks with different calculation amounts and storage consumption for practical application, and a user can select network deployment with proper accuracy, calculation amount and storage consumption according to practical environmental restrictions.
In one embodiment, the embodiment provides a self-knowledge distillation method based on online collaboration and fusion, which includes a preprocessed data set, a main network, auxiliary branch networks added at the ends of different depth layers of the main network, and the main network and the auxiliary branch networks performing feature extraction on samples in the data set; and the diversity regularization item is used for avoiding the homogenization phenomenon of the network and inducing different networks to generate diversity characteristics for subsequent characteristic diagram fusion and prediction fusion. The feature fusion network is used for carrying out self-adaptive feature extraction and conversion on the obtained first branch feature graph, the second branch feature graph, the third branch feature graph and the tail end output feature graph of the main network in the channel dimension to obtain a fusion feature graph; the classifier module is used for converting the output feature map of each network tail end convolution layer into a one-dimensional global feature vector and obtaining the class prediction probability of the sample from the one-dimensional global feature vector; the perceptron module is used for dynamically integrating the prediction probabilities from different classifications to obtain more robust fusion prediction probability; and carrying out knowledge migration on all networks based on the obtained fusion prediction probability to finish knowledge distillation.
In addition, the present embodiment also provides an online collaboration and fusion based self-knowledge distillation system, which comprises a microprocessor and a memory connected with each other, wherein the microprocessor and the memory are programmed or configured to execute the steps of the online collaboration and fusion based self-knowledge distillation method. In addition, the present embodiment also provides a computer readable storage medium, which stores therein the steps for execution by a computer device to implement the aforementioned online collaboration and fusion based self-knowledge distillation method.
The second embodiment:
the present embodiment is basically the same as the first embodiment, and is mainly different from the first embodiment in the implementation manner of the backbone network. In this embodiment, the backbone network specifically adopts a MobileNetV2 network, and for a MobileNetV2 network, the network is changed to: the convolutional layer of the first layer Conv3 × 3 remains unchanged, of the four convolutional blocks: the 1 st and 2 nd convolution chunks in the original mobilenetV2 network are classified as the 1 st convolution chunks in a backbone network, the 2 nd convolution chunks in the original mobilenetV2 network are kept as the 2 nd convolution chunks of the backbone network, the 3 rd and 4 th convolution chunks in the original mobilenetV2 network are classified as the 3 rd convolution chunks in the backbone network, and the 5 th and 6 th convolution chunks in the original mobilenetV2 network are classified as the 4 th convolution chunks in the backbone network, so that four-level chunk convolution is realized. In addition, according to convolution chunks of different stages of the backbone network, the original MobileNetV2 network can be adjusted as required to perform different merging matching.
In addition, the present embodiment also provides an online collaboration and fusion based self-knowledge distillation system, which comprises a microprocessor and a memory connected with each other, wherein the microprocessor and the memory are programmed or configured to execute the steps of the online collaboration and fusion based self-knowledge distillation method. In addition, the present embodiment also provides a computer readable storage medium, which stores therein the steps for execution by a computer device to implement the aforementioned online collaboration and fusion based self-knowledge distillation method.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present invention, and the scope of the present invention is not limited to the above embodiments, and all technical solutions that belong to the idea of the present invention belong to the scope of the present invention. It should be noted that modifications and adaptations to those skilled in the art without departing from the principles of the present invention should also be considered as within the scope of the present invention.

Claims (9)

1. The self-knowledge distillation method based on online cooperation and fusion is characterized by comprising the step of carrying out image classification training or application by adopting a self-knowledge distillation network based on online cooperation and fusion, wherein the self-knowledge distillation network based on online cooperation and fusion comprises a feature extraction network, a feature fusion network and a classifier network which are connected with each other, and the feature extraction network comprises a backbone network and a classifier networknAn auxiliary branch network designed based on attention mechanism, the main network including a plurality of sub-branch networks for extracting feature maps respectivelyn+Convolution chunk with learnable level 1 parameter, arbitrarynA feature map of the output of the stage convolution block as an input to the auxiliary branch networknAn auxiliary branch network andn+the output of the level 1 convolutional block is fed into a feature fusion network for generating a fusion feature map, the backbone network,nThe feature graphs output by the auxiliary branch networks and the feature fusion network are sent to a classifier network to obtain corresponding sample class prediction probabilities, the classifier network is used for fusing the classification prediction probabilities obtained by classifying and predicting the features output by the feature extraction network and the feature fusion network respectively to obtain fusion prediction probabilities, and the adjacent auxiliary branch networks are connected in seriesnAn auxiliary branch network andn+diversified regularization terms for slowing down homogenization phenomena and improving the feature fusion quality of the feature fusion network and the prediction fusion quality of the classifier network are arranged among the 1-level convolution blocksL div The diversity regularization termL div The functional expression of (a) is:
L div =-Σn-1 j=1||F j - F j+1||2 2,(1)
in the above formula, n is auxiliaryThe number of the subsidiary branch networks is,F j is a firstjA characteristic diagram of the output of each auxiliary branch network,F j+1is a firstj+1 auxiliary branch network output characteristic diagram.
2. The on-line collaboration and fusion based self-knowledge distillation method as claimed in claim 1, wherein the auxiliary branch network comprises an attention module and a down-sampling layer module connected with each other, the attention module is used for performing feature weight modeling on the input feature map to obtain an updated feature mapF i_At The down-sampling layer module is used for comparing the updated feature mapF i_At Performing size reduction and channel conversion to obtain the second onen+Signature graph of level 1 convolution block outputF n+1The spatial dimension is consistentiBranch feature mapF i-down
3. The online collaboration and fusion based self-knowledge distillation method according to claim 2, wherein the attention module comprises two parallel paths of computing channel attention and computing spatial attention and a feature map updating module, wherein the step of computing spatial attention comprises: (1) will input the feature mapF i From the spatial dimension R by convolution with Conv1 × 1W *H*CReduced to RW*H*1Obtaining a characteristic diagramF i_s Wherein W is the width, H is the height, C is the channel, and R is the dimension; (2) for characteristic diagramF i_s Carrying out average pooling operation on H and W dimensions respectively to obtain two one-dimensional global featuresF i-sW ∈RW*1*1F i-sH ∈R1*H*1(ii) a Normalizing the obtained global features on the H dimension and the W dimension by using an activation function Sigmoid, and calculating an outer product of the two normalized feature vectors to obtain a space attention moment arrayA s ∈RW*H*1(ii) a The step of calculating the channel attention comprises: (1) will input the feature mapF i From the spatial dimension R by an average pooling operationW*H*CReduced to R1*1*CObtaining a characteristic diagramF i_c (ii) a (2) By Conv1 × 1 convolution willF i_c Performing dimensionality reduction and dimensionality lifting to obtain a pre-weight vectorF i-pre ∈R1*1*C(ii) a (3) Pre-weight vector obtained by applying Sigmoid pair of activation functionsF i-pre Normalizing to obtain the final channel attention vectorA c ∈R1*1*C(ii) a The feature map updating module is used for merging the outputs of two parallel paths of computing channel attention and computing space attention into an updated feature map, and the function expression for merging the outputs of the two parallel paths of computing channel attention and computing space attention into the feature map is as follows:
F i_At = F i A s A c F i_At ∈RW*H*C
in the above-mentioned formula, the compound has the following structure,F i_At is as followsiAn updated profile output by the secondary branch network.
4. The on-line collaboration and fusion based self-knowledge distillation method as claimed in claim 2, wherein the down-sampling layer module comprises different number of down-sampling layers such that any of the first to third layersiUpdated profile output by an auxiliary branch networkF i_At After size reduction and channel conversion, the first one is obtainediBranch feature mapF i-down And a firstn+Signature graph for level 1 convolution block outputF n+1The spatial dimensions are consistent, and the step of performing downsampling processing on the input feature map by the downsampling layer comprises the following steps of: (1) firstly inputting a feature mapF i_At Or the feature map output by the last down-sampling layer is sent to the first layer depth separable convolution to obtain the feature map with half-reduced size in two dimensions of H and WF i-temp ∈RW/2*H/2*C;(2)Reducing feature size by half in both H and W dimensionsF i-temp Performing channel conversion by one-layer point-by-point convolution and one-layer depth convolution to obtain a converted characteristic diagramF i-trans ∈RW/2*H/2*Cmid(ii) a (3) Finally will beF i-trans Conversion of dimensionality to a specified value C by a layer of point-by-point convolutionoutObtaining a characteristic diagramF i-out ∈RW/2*H/2*CoutIf the down-sampling layer is the last down-sampling layer, the feature map obtainedF i-out As obtained after size reduction and channel conversioniBranch characteristic diagramF i-down (ii) a Wherein Cmid is the obtained characteristic diagramF i-trans The number of channels in (1).
5. The self-knowledge distillation method based on online collaboration and fusion as claimed in claim 1, wherein the feature fusion network fuses feature extraction networks to obtain a fused feature map with richer semantic informationF Re Comprises the following steps: (1) each toiBranch feature mapF i-down And the firstn+Signature graph of level 1 convolution block outputF n+1Splicing in channel dimension to obtain a spliced characteristic diagramF cat ∈RW*H*4C(ii) a (2) The obtained splicing characteristic diagramF cat Respectively obtaining two one-dimensional global feature vectors through maximum pooling operation and average pooling operationV max V avg (ii) a (3) Two one-dimensional global feature vectors to be obtainedV max V avg Performing global feature modeling through convolution of two layers of Conv1 multiplied by 1 respectively, and then adding element by element to obtain a one-dimensional pre-weight vector; (4) normalizing the obtained pre-weight vector through an activation function Sigmoid to obtain a weight vectorU∈R1 *1*4CAnd the obtained weight vector is usedUAnd a mosaic profileF cat Performing element-by-element multiplication to obtain the updated mosaicConnection characteristic diagramM∈RW*H*4C(ii) a (5) Finally, the updated splicing feature map is subjected to point-by-point convolution and depth separable convolution of Conv1 multiplied by 1MThe channel is compressed to one fourth of the original channel to obtain a fusion characteristic diagram with richer semantic informationF Re ∈RW*H*C
6. The online collaboration and fusion based self-knowledge distillation method of claim 1, wherein the classifier network comprisesnAn auxiliary classifier, a main network classifier and a fusion classifier,nan auxiliary classifier andnthe auxiliary branch networks are in one-to-one correspondence for outputting according to the corresponding auxiliary branch networksiBranch feature mapF i-down Is classified to obtainiBranch classification prediction probability, backbone network classifier for predicting probability based onn+Signature graph of level 1 convolution block outputF n+1Classifying to obtain the prediction probability of the classification of the backbone network, and fusing the classifier for fusing the feature map output by the network according to the featuresF Re And obtaining the classification prediction probability of the fusion branch network by classification.
7. The on-line collaboration and fusion based self-knowledge distillation method as claimed in claim 6, wherein the training of image classification using the on-line collaboration and fusion based self-knowledge distillation network further comprises training of image classification using a perceptron, the perceptron being configured to perform the training of image classification according to the inputted first inputiThe branch classification prediction probability, the trunk network classification prediction probability and the feature fusion branch prediction probability are fused to obtain a fusion prediction probability, and when image classification training is carried out by adopting a self-knowledge distillation network based on online cooperation and fusion, each training round comprises the following steps:
s1) inputting the training data into the feature extraction network, and outputting the training data to the second oneiBranch feature mapF i-down And the firstn+Signature graph of level 1 convolution block outputF n+1Extracting fusion features through a feature fusion networkDrawing (A)F Re Fusing feature mapsF Re Obtaining a feature fusion branch prediction probability through a fusion classifier;
s2) will beiThe branch classification prediction probability, the trunk network classification prediction probability and the feature fusion branch prediction probability are used as the input of a perception machine capable of learning parameters, a one-dimensional fusion prediction probability is output through the perception machine, the optimal fusion prediction probability is judged to be obtained according to the difference loss between the fusion prediction probability and the real label of the sample if the difference loss reaches the error range of a preset value, and the next step is skipped to be executed; otherwise, updating parameters of the perceptron by a gradient descent method, and skipping to execute the step S1) to continue training the perceptron;
s3) taking the optimal fusion prediction probability as a soft target in knowledge distillation, and training the main network, the auxiliary branch network and the feature fusion network based on training data to realize knowledge migration and complete knowledge distillation.
8. An online collaboration and fusion based self-knowledge distillation system comprising a microprocessor and a memory connected to each other, wherein the microprocessor is programmed or configured to perform the steps of the online collaboration and fusion based self-knowledge distillation method according to any one of claims 1 to 7.
9. A computer readable storage medium storing computer program instructions for execution by a computer device to perform the method of online collaboration and fusion based self-knowledge distillation according to any one of claims 1 to 7.
CN202210019067.4A 2022-01-10 2022-01-10 Self-knowledge distillation method and system based on online cooperation and fusion Active CN114049527B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210019067.4A CN114049527B (en) 2022-01-10 2022-01-10 Self-knowledge distillation method and system based on online cooperation and fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210019067.4A CN114049527B (en) 2022-01-10 2022-01-10 Self-knowledge distillation method and system based on online cooperation and fusion

Publications (2)

Publication Number Publication Date
CN114049527A CN114049527A (en) 2022-02-15
CN114049527B true CN114049527B (en) 2022-06-14

Family

ID=80213479

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210019067.4A Active CN114049527B (en) 2022-01-10 2022-01-10 Self-knowledge distillation method and system based on online cooperation and fusion

Country Status (1)

Country Link
CN (1) CN114049527B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117009830A (en) * 2023-10-07 2023-11-07 之江实验室 Knowledge distillation method and system based on embedded feature regularization

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114912532B (en) * 2022-05-20 2023-08-25 电子科技大学 Multi-source heterogeneous perception data fusion method for automatic driving automobile

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110472730A (en) * 2019-08-07 2019-11-19 交叉信息核心技术研究院(西安)有限公司 A kind of distillation training method and the scalable dynamic prediction method certainly of convolutional neural networks
CN113591978A (en) * 2021-07-30 2021-11-02 山东大学 Image classification method, device and storage medium based on confidence penalty regularization self-knowledge distillation
CN113887610A (en) * 2021-09-29 2022-01-04 内蒙古工业大学 Pollen image classification method based on cross attention distillation transducer

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4150535A4 (en) * 2020-06-05 2023-10-04 Huawei Technologies Co., Ltd. Improved knowledge distillation by utilizing backward pass knowledge in neural networks

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110472730A (en) * 2019-08-07 2019-11-19 交叉信息核心技术研究院(西安)有限公司 A kind of distillation training method and the scalable dynamic prediction method certainly of convolutional neural networks
WO2021023202A1 (en) * 2019-08-07 2021-02-11 交叉信息核心技术研究院(西安)有限公司 Self-distillation training method and device for convolutional neural network, and scalable dynamic prediction method
CN113591978A (en) * 2021-07-30 2021-11-02 山东大学 Image classification method, device and storage medium based on confidence penalty regularization self-knowledge distillation
CN113887610A (en) * 2021-09-29 2022-01-04 内蒙古工业大学 Pollen image classification method based on cross attention distillation transducer

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Self-speculation of clinical features based on knowledge distillation for accurate ocular disease classification;Junjun He等;《Biomedical Signal Processing and Control》;20210531;1-9 *
基于知识蒸馏的轻量型浮游植物检测网络;张彤彤等;《应用科学学报》;20200530(第03期);33-42 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117009830A (en) * 2023-10-07 2023-11-07 之江实验室 Knowledge distillation method and system based on embedded feature regularization
CN117009830B (en) * 2023-10-07 2024-02-13 之江实验室 Knowledge distillation method and system based on embedded feature regularization

Also Published As

Publication number Publication date
CN114049527A (en) 2022-02-15

Similar Documents

Publication Publication Date Title
WO2021023202A1 (en) Self-distillation training method and device for convolutional neural network, and scalable dynamic prediction method
CN114049527B (en) Self-knowledge distillation method and system based on online cooperation and fusion
CN110782015A (en) Training method and device for network structure optimizer of neural network and storage medium
WO2019212878A1 (en) Design flow for quantized neural networks
CN107729999A (en) Consider the deep neural network compression method of matrix correlation
WO2021218517A1 (en) Method for acquiring neural network model, and image processing method and apparatus
CN110516095A (en) Weakly supervised depth Hash social activity image search method and system based on semanteme migration
CN111612134A (en) Neural network structure searching method and device, electronic equipment and storage medium
CN111709493B (en) Object classification method, training device, object classification equipment and storage medium
CN111583031A (en) Application scoring card model building method based on ensemble learning
WO2023280113A1 (en) Data processing method, training method for neural network model, and apparatus
KR20220098991A (en) Method and apparatus for recognizing emtions based on speech signal
CN116362325A (en) Electric power image recognition model lightweight application method based on model compression
CN115511069A (en) Neural network training method, data processing method, device and storage medium
CN112507114A (en) Multi-input LSTM-CNN text classification method and system based on word attention mechanism
CN114647752A (en) Lightweight visual question-answering method based on bidirectional separable deep self-attention network
CN113870863A (en) Voiceprint recognition method and device, storage medium and electronic equipment
CN114373092A (en) Progressive training fine-grained vision classification method based on jigsaw arrangement learning
CN113822434A (en) Model selection learning for knowledge distillation
CN116257751A (en) Distillation method and device based on online cooperation and feature fusion
Zerrouk et al. Evolutionary algorithm for optimized CNN architecture search applied to real-time boat detection in aerial images
CN113313250B (en) Neural network training method and system adopting mixed precision quantization and knowledge distillation
CN114972959A (en) Remote sensing image retrieval method for sample generation and in-class sequencing loss in deep learning
CN112200208B (en) Cloud workflow task execution time prediction method based on multi-dimensional feature fusion
Sevim et al. Document image classification with vision transformers

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant