CN115577793A - Network structure-oriented mapping type distillation method and training method thereof - Google Patents

Network structure-oriented mapping type distillation method and training method thereof Download PDF

Info

Publication number
CN115577793A
CN115577793A CN202210507030.6A CN202210507030A CN115577793A CN 115577793 A CN115577793 A CN 115577793A CN 202210507030 A CN202210507030 A CN 202210507030A CN 115577793 A CN115577793 A CN 115577793A
Authority
CN
China
Prior art keywords
model
mapping
student
representation
characteristic information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210507030.6A
Other languages
Chinese (zh)
Inventor
胡潘天
段章领
贾兆红
唐俊
刘永峰
宋俊才
周行云
王坤
刘弨
路然
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University
Original Assignee
Anhui University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University filed Critical Anhui University
Priority to CN202210507030.6A priority Critical patent/CN115577793A/en
Publication of CN115577793A publication Critical patent/CN115577793A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • G06V10/765Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects using rules for classification or partitioning the feature space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to the technical field of model compression and acceleration in computer vision, and solves the technical problems that the relationship representation cannot be improved by relationship-based distillation, the number of input image batches is small, the mutual advantages are well combined, and the expression capability of the feature representation and the relationship representation is compensated and enhanced, in particular to a mapping type distillation method facing to a network structure, which comprises the following processes: forming a teacher model and a student model according to the pre-trained target detection model; and respectively extracting the characteristic diagrams of the modules of the teacher model and the student model. The method has excellent performance in classification and detection tasks, well combines the advantages of relationship information and attention information, makes up and enhances the expression capability of feature characterization and relationship characterization, provides two optional module mapping methods, can accept larger image input and more batch quantities, and improves the applicability of the method.

Description

Network structure-oriented mapping type distillation method and training method thereof
Technical Field
The invention relates to the technical field of model compression and acceleration in computer vision, in particular to a network structure-oriented mapping type distillation method and a training method thereof.
Background
In the past decades, with the continuous iterative update of deep learning technology, new neural network models are continuously leap forward in the field of computer vision. Although computing resources and memory have increased significantly enough to meet everyday demands, they are still barriers to deep neural network performance. Therefore, model compression has gained extensive research interest in the fields of computer vision and machine learning. Quantization and pruning are the most common methods among previous researchers, focusing on network architecture design, computational and model size reduction, and redundant connection removal for large models.
The above-described operational limitations are that custom design hardware and software model acceleration is required, limiting their practical applications. To address this challenge, knowledge distillation is proposed as an end-to-end solution to bridge model differences and minimize the computational burden of deep neural networks. Knowledge distillation, which is a process of extracting knowledge from a teacher network and transferring the knowledge to a student network, aims to encourage students to learn from the teacher and improve the generalization ability of the student network.
In the field of knowledge distillation, researchers are seeking more effective solutions to transfer knowledge from a teacher network to a student network. In general, there are three types of knowledge, namely reaction-based knowledge, feature-based knowledge, and relationship-based knowledge. In addition, there are other distillation strategies. The research on knowledge distillation in recent years mainly focuses on the following two aspects: firstly, more important distillation knowledge is mined; and secondly, a more effective method is explored to transfer knowledge to a student network.
Most distillation methods are generally extracted from the teacher network from two forms of knowledge: feature-based or relationship-based. These efforts necessarily introduce some knowledge information, such as attention information and relationship information between instances. However, they are not complementary in nature, leading to the following problems in the existing known distillation methods:
(1) In the absence of attention mechanisms, relationship-based distillation fails to improve relationship characterization, while feature-based distillation ignores relationships between instances of higher semantics, thus resulting in poor feature characterization. Therefore, most existing distillation methods do not achieve a good balance between the task of detection and the task of classification.
(2) The existing distillation method considering the relationship information among the network modules has a series of defects of large memory consumption, small batch quantity of input pictures, incapability of being accelerated by a GPU and the like.
(3) Existing methods that take into account the combination of attention and relationship information simply add the loss functions of feature-based distillation and relationship-based distillation and do not combine well the mutual advantages, compensating for and enhancing the expressive power of feature characterization and relationship characterization.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a mapping type distillation method facing to a network structure and a training method thereof, and solves the technical problems that the relationship representation cannot be improved based on the distillation of the relationship, the number of batches of input pictures is small, the mutual advantages are well combined, and the expression capability of the feature representation and the relationship representation is compensated and enhanced.
The method has excellent performance in classification and detection tasks, well combines the advantages of relationship information and attention information, makes up and enhances the expression capability of feature characterization and relationship characterization, provides two optional module mapping methods, can accept larger image input and more batch quantities, and improves the applicability of the method.
In order to solve the technical problems, the invention provides the following technical scheme: a mapping type distillation method facing to a network structure is based on a pre-trained target detection model as a basic data scalar, and comprises the following processes:
forming a teacher model and a student model according to the pre-trained target detection model;
respectively extracting characteristic graphs of modules of each layer of the teacher model and the student model;
carrying out attention mapping on the extracted feature map to obtain a feature information representation map with attention information;
carrying out module mapping on the characteristic information representation graph to obtain a mapping representation graph with pixel level relation information;
a mapping profile containing attention information and pixel level relationship information is distilled from the teacher model to the student model.
Further, the attention map includes a global attention map and a local attention map, and the module map includes a global module map and a local module map.
Further, the feature map attention mapping includes the following processes:
carrying out global attention mapping on the characteristic information representation diagrams of the student model and the teacher model to obtain global characteristic information, and respectively obtaining the global characteristic information representation diagrams of the student model and the teacher model;
and carrying out local attention mapping on the characteristic information representation diagrams of the student model and the teacher model to obtain local characteristic information, and respectively obtaining the local characteristic information representation diagrams of the student model and the teacher model.
Further, the module mapping of the characteristic information representation comprises the following processes:
carrying out global module mapping on the global characteristic information representation diagrams of the student model and the teacher model to obtain global characteristic information and pixel level relation information, and respectively obtaining global mapping representation diagrams of the student model and the teacher model;
and carrying out local module mapping on the local characteristic information representation diagrams of the student model and the teacher model to obtain local characteristic information and pixel level relation information, and respectively obtaining the local mapping representation diagrams of the student model and the teacher model.
Further, the module mapping of the feature information characterization graph further includes the following processes:
carrying out global module mapping on the global characteristic information representation diagrams of the student model and the teacher model to obtain global characteristic information and pixel level relation information, and respectively obtaining global mapping representation diagrams of the student model and the teacher model;
local module mapping is carried out on the local characteristic information representation diagrams of the student model and the teacher model to obtain local characteristic information and pixel level relation information, and local mapping representation diagrams of the student model and the teacher model are obtained respectively;
and carrying out dimension compression on the global and local mapping representation graphs of the student model and the teacher model.
Further, the process of module mapping of the characteristic information representation diagram further comprises:
the module maps to obtain global and local mapping representation maps of the student model and the teacher model, combines the loss function, restrains and punishes the student model for training, and simulates learning from the student model to the teacher model to complete mapping type distillation.
Further, the process of module mapping of the characteristic information representation diagram further comprises:
and training by taking the mapping representation of the teacher model as a label to supervise and constrain the student model, and adding the loss function of the mapping distillation into the original loss function of the student model to balance the training error together, so that the robustness of the model is enhanced.
Further, the object detection model may also be an image classification model.
The invention also provides a training method applied to the mapping type distillation method, and the specific technical scheme is as follows:
a training method of a mapping distillation method facing a network structure comprises the following processes:
acquiring a pre-trained target detection model or an image classification model to form a teacher model and a student model;
and extracting the feature graphs of the feature layer modules from the teacher model and the student models, and introducing the feature graphs into the mapping type distillation method to train with the student models.
By means of the technical scheme, the invention provides a mapping type distillation method facing to a network structure and a training method thereof, and the method at least has the following beneficial effects:
1. the invention provides a mapping type distillation method and a training method thereof, which are used for embedding pixel-level relation information between adjacent modules based on an attention mechanism into a mapping representation diagram, transmitting the information from a teacher model to a student model and restricting and supervising training and learning of the student model. The method has excellent performance in classification and detection tasks, and well combines the advantages of relationship information and attention information, so as to make up and enhance the expression capacity of feature characterization and relationship characterization.
2. The mapping type distillation method provided by the invention gives consideration to image classification and target detection tasks, both of which show excellent performance, realizes better performance compared with the existing distillation method, obviously reduces the calculation workload, and also provides two selectable module mapping methods, can accept larger image input and more batch quantities, and improves the applicability of the method.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
FIG. 1 is a flow chart of a process for mapped distillation according to one embodiment of the present invention;
FIG. 2 is a flowchart illustrating attention mapping of a feature map according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating module mapping for a characteristic information representation according to an embodiment of the present invention;
FIG. 4 is a flow chart of a training method for the mapped distillation method according to an embodiment of the present invention;
FIG. 5 is a bottom schematic block diagram of a mapped distillation process according to an embodiment of the present invention;
FIG. 6 is a flow chart of a mapped distillation process according to example two of the present invention;
fig. 7 is a flowchart of module mapping for a characteristic information characterization graph in the third embodiment of the present invention;
FIG. 8 is a schematic diagram of an exemplary distillation of a mapped distillation method and a training method thereof according to a third embodiment of the present invention;
FIG. 9 is a graph showing the results of comparative experiments on the mapped distillation method on the data sets Cars196 and CUB-200-2011 in the experimental examples of the present invention;
FIG. 10 is a graph showing the accuracy of a comparative experiment on a data set Cars196 in an experimental example of the present invention;
FIG. 11 is a graph illustrating the accuracy of comparative experiments on the data set CUB-200-2011 in the experimental example of the present invention;
FIG. 12 is a graph showing comparative experimental results of the mapped distillation method on the data set WIDERFACE in the experimental example of the present invention;
FIG. 13 is a matrix probability distribution diagram of a mapping representation of teacher and student models in an experimental example of the present invention;
FIG. 14 is a graph showing the results of a mapping distillation method ablation experiment performed on the data sets Cars196 and CUB-200-2011 in an experimental example of the present invention;
fig. 15 is a graph of the ablation experimental results of the mapped distillation method on the data set WIDERFACE in the experimental example of the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention more comprehensible, the present invention is described in detail with reference to the accompanying drawings and the detailed description thereof. Therefore, the realization process of how to apply technical means to solve the technical problems and achieve the technical effects can be fully understood and implemented.
Those skilled in the art will appreciate that all or part of the steps in the method for implementing the above embodiments may be implemented by a program instructing relevant hardware, and thus, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Example one
Referring to fig. 1 to 5, an embodiment of the present invention is shown, which is implemented as follows:
a mapping type distillation method for a network structure is based on a pre-trained target detection model as a basic data scalar, and comprises the following processes:
and S11, forming a teacher model and a student model according to the pre-trained target detection model.
And S12, respectively extracting characteristic diagrams of modules of the teacher model and the student model.
And S13, performing attention mapping on the extracted feature map to obtain a feature information representation map with attention information.
The attention map includes a global attention map and a local attention map.
The attention mapping of the feature map comprises the following processes:
s131, carrying out global attention mapping on the characteristic information representation diagrams of the student model and the teacher model to obtain global characteristic information, and respectively obtaining the global characteristic information representation diagrams of the student model and the teacher model.
And S132, carrying out local attention mapping on the characteristic information representation diagrams of the student model and the teacher model to obtain local characteristic information, and respectively obtaining the local characteristic information representation diagrams of the student model and the teacher model.
The global attention mapping function is: Ψ global :R L,N,C,H,W →R L,N,C The concrete expression is as follows:
Figure BDA0003636438890000071
the local attention mapping function is: psi local :R L,N,C,H,W →R L,N,HW The concrete expression is as follows:
Figure BDA0003636438890000072
wherein, is epsilon to R L,N,C,H,W L represents the number of characteristic layers in the teacher model and the student model, N, C, H and W represent the number of batches, channels, widths and heights of the characteristic diagrams respectively, and i, j and k represent ith, jth and kth vector segments in space and channel dimensions respectively.
And obtaining global and local characteristic information characterization graphs of the student model and the teacher model based on the attention mapping for subsequent module mapping.
And S14, carrying out module mapping on the characteristic information representation graph to obtain a mapping representation graph with pixel level relation information.
The module map includes a global module map and a local module map.
The module mapping of the characteristic information representation graph comprises the following processes:
s141, global module mapping is carried out on the global characteristic information representation diagrams of the student model and the teacher model, global characteristic information and pixel level relation information are obtained, and global mapping representation diagrams of the student model and the teacher model are obtained respectively;
and S142, carrying out local module mapping on the local characteristic information representation diagrams of the student model and the teacher model to obtain local characteristic information and pixel level relation information, and respectively obtaining the local mapping representation diagrams of the student model and the teacher model.
Global Module mapping function as phi global The concrete expression is as follows:
Figure BDA0003636438890000081
local module mapping function as phi local The concrete expression is as follows:
Figure BDA0003636438890000082
wherein the content of the first and second substances,
Figure BDA0003636438890000083
respectively representing the characteristic characterization graphs of the (n-1) th layer and the n th layer,
Figure BDA0003636438890000084
an example pixel-level relational mapping representation of the (n-1) th, nth modules is shown, respectively.
The above is a module mapping manner of the module mapping mode one.
Based on the obtained mapping representation diagrams of the teacher model and the student models, the mapping representation diagrams of the teacher model are used as labels to supervise and constrain the student models for training, the loss function of the mapping distillation is added to the original loss function of the student models to share the training error, and the robustness of the models is enhanced.
And obtaining global and local mapping representation maps of teachers and students based on module mapping, and constraining and punishing the student models to train by combining loss functions so as to realize that the student models simulate learning to the teacher model and finish mapping type distillation.
And S15, distilling the mapping representation containing the attention information and the pixel level relation information from the teacher model to the student model.
The embodiment also provides a training method applied to the above mapping type distillation method, and the specific implementation manner is as follows:
a training method of a mapping distillation method facing a network structure comprises the following processes:
s10, obtaining a pre-trained target detection model or an image classification model to form a teacher model and a student model;
and S20, extracting feature graphs of the feature layer modules of the teacher model and the student models, and transmitting the feature graphs into the mapping type distillation method to train with the student models.
According to the mapping type distillation method and the training method thereof, the pixel-level relation information based on the attention mechanism between the adjacent modules is embedded into the mapping representation diagram, the pixel-level relation information is transmitted from the teacher model to the student model, and the training learning of the teacher model is restrained and supervised.
The mapping type distillation method has excellent performance in classification and detection tasks, and well combines the advantages of relationship information and attention information, so that the expression capacities of characteristic representation and relationship representation are compensated and enhanced. The process achieves better performance and significantly reduces computational effort compared to previous distillation processes.
Example two
Referring to fig. 6, an embodiment of the present invention is shown, which is implemented as follows:
a mapping distillation method for a network structure is based on a pre-trained image classification model as a basic data scalar, and comprises the following processes:
s21, forming a teacher model and a student model according to the pre-trained image classification model;
s22, respectively extracting feature maps of modules of the teacher model and the student model;
s23, performing attention mapping on the extracted feature map to obtain a feature information representation map with attention information;
s24, carrying out module mapping on the characteristic information representation graph to obtain a mapping representation graph with pixel level relation information;
and S25, distilling the mapping representation containing the attention information and the pixel level relation information from the teacher model to the student model.
The mapping type distillation method for network structure proposed in this embodiment is made based on the first embodiment, and has the advantages of the corresponding method embodiments, and the technical contents identical or similar to those in the first embodiment are not repeated herein.
By the embodiment, the method integrates the image classification and the target detection tasks, both of which show excellent performance, and the calculation workload is remarkably reduced.
EXAMPLE III
Referring to fig. 7 to 8, an embodiment of the present invention is shown, which is implemented as follows:
as shown in fig. 8, the teacher model and the student model extract feature profiles of feature layers, where circles, triangles, squares, and diamonds represent output examples of the feature profiles.
And forming a mapping token map matrix by the adjacent feature layers, and particularly obtaining the mapping token map matrix through further module mapping of the feature token maps of global and local attention mapping.
Wherein the square represents the mapping relation matrix of each instance in each layer. Specifically, one can express (L-1) × M × N, where L denotes the total number of layers in the feature layer module, and M, N denote the rows and columns of the mapping feature map matrix, where x denotes optional, representing the module mapping the second mapping pattern.
The module mapping of the characteristic information representation graph comprises the following processes:
s1411, carrying out global module mapping on the global characteristic information representation diagrams of the student model and the teacher model to obtain global characteristic information and pixel level relation information, and respectively obtaining global mapping representation diagrams of the student model and the teacher model.
And S1412, performing local module mapping on the local characteristic information representation diagrams of the student model and the teacher model to obtain local characteristic information and pixel level relation information, and obtaining the local mapping representation diagrams of the student model and the teacher model respectively.
And S1413, performing dimension compression on the global and local mapping representation graphs of the student model and the teacher model.
The above is a module mapping mode of the module mapping mode two, and the module mapping mode of the mode two performs dimension compression on the basis of the mode one to finally obtain a mapping representation:
Figure BDA0003636438890000101
based on the obtained mapping representation of the teacher model and the obtained mapping representation of the student model, the mapping representation of the teacher model is used as a label to supervise and constrain the student model for training, a mapping type loss function is designed and added to the original loss function of the student model to balance training errors together, and the robustness of the model is enhanced.
The loss function for the mapped distillation in modes one and two is as follows:
Figure BDA0003636438890000111
wherein, T and S respectively represent a teacher model and a student model.
In addition, the mapped distillation loss of the mode two provides an additional loss function, and the loss function is calculated by considering the conditional probability distribution difference of the mapping representation diagrams of the teacher model and the student model;
the conditional probability functions for samples i and j are embodied as follows:
Figure BDA0003636438890000112
wherein t represents a teacher model, N represents the number of samples, K is a kernel function with a bandwidth of σ, and is used for measuring the similarity of the samples, and the specific kernel function is a cosine kernel function (avoiding calculating the kernel bandwidth), which is specifically represented as follows:
Figure BDA0003636438890000113
wherein a, b denote the vectors of the two groups of samples, respectively | | n 2 Representing a vector of 2 A norm;
finally, the loss function of the mapping representation chart of the teacher model and the student model is obtained as follows:
Figure BDA0003636438890000114
wherein t and s represent a teacher model and a student model, respectively;
finally, the total distillation loss function is expressed as:
Figure BDA0003636438890000115
Figure BDA0003636438890000121
wherein two over parameters α, β balance the different distillation losses.
Based on the obtained mapping representation diagrams of the teacher model and the student models, the mapping representation diagrams of the teacher model are used as labels to supervise and constrain the student models for training, the loss function of the mapping distillation is added to the original loss function of the student models to share the training error, and the robustness of the models is enhanced.
The module mapping of the characteristic information representation provided in this embodiment is performed on the basis of the first or second embodiment, and has the advantages of the corresponding method embodiment, and detailed descriptions of the same or similar technical contents covered in the first or second embodiment are omitted here.
By the method, the problems that memory consumption is large, the batch number of input pictures is small, and the input pictures cannot be accelerated by a GPU can be solved, meanwhile, dimensionality compression is performed on the basis of the mode one by the module mapping mode of the mode two, and the mapping representation graph is finally obtained, so that the module mapping of the mode two can accept image input of batches with larger size and more batches, and universality is improved.
Examples of the experiments
Referring to fig. 9 to fig. 15, the experimental example is verified based on the mapping type distillation method and the training method thereof for the network structure described in the first, second, and third embodiments, and the specific verification contents include an image classification test, a target detection test, an ablation test, and a detection task ablation test, and the specific scheme is as follows:
1. image classification experiment
1.1, selecting fine-grained image classification data sets Cars196 and CUB-200-2011 from the experimental data sets.
1.2, the experimental setup was as follows:
the experiment used Resnet50 as the teacher network and Resnet18 as the student network. Experiments the same experimental setup was used for both data sets. According to the experimental setting, the initial learning rate is set to be 0.1, the batch size is set to be 128, the total time is set to be 80, the rate reduction coefficient of the learning rate is 0.1, L2 standardization is used, the loss function calculation mode adopts L2 norm and conditional probability distribution to calculate loss, and the best result is obtained. Experimental assessment indices, all using Recall @ K, once all test images are embedded by the model, each test image will be used as a query, and the top K nearest neighbor images will be retrieved from the test set that does not contain the query. If the retrieved image contains the same category as the query, the recall of the query is considered to be 1.Recall @ K was calculated by calculating the average recall for the entire test set.
1.3, the experimental results are shown in fig. 9, and the experimental summary is as follows:
the accuracy of Resnet18 extracted by AOMD in Cars196 and CUB-200-2011 data sets is 76.03% and 55.26% respectively, and the accuracy is improved by 8.2% and 3.6% respectively compared with the benchmark. Compared to Fitnet (based on response), an improvement of 4.2% and 3.4%, respectively; compared with the attention based on characteristics, the improvement is 21.4 percent and 11.2 percent respectively; compared with SP (based on the relation), the improvement is respectively 5.6 percent and 3.0 percent, and compared with ICKD, the improvement is respectively 2.8 percent and 5.0 percent.
1.4, the following results can be analyzed:
(1) Attention distillation cannot adapt to a classification task of fine granularity, and the distillation effect is poor; relational distillation is not superior to traditional response-based distillation and performs poorly.
(2) The mapped distillation can compensate and enhance the representation of features and relationships with each other and is superior to feature-based and response-based distillation alone.
(3) And the mapping distillation is suitable for image classification tasks.
Further, comparing the accuracy curves of the experiment on the data sets Cars196 and CUB-200-2011 as shown in FIGS. 10 and 11, the following results can be analyzed.
The AOMD of the invention achieves the best precision within 10 to 20 epochs. In combination with fig. 9, AOMD shows the advantages of fast convergence and significant forward yield in fine-grained classification task, which verifies the superiority and high performance of the AOMD method of the present invention.
2. Target detection experiment
2.1, selecting a reference data set WIDERFACE for face detection from the experimental data set.
2.2, experimental settings were as follows:
experiments multi-branch task experiments were performed on WIDERFACE, including face classification, face box regression, and face five-point regression. In the experiment, the difference of network structures is considered, the Resnet50 is set as a teacher network, and the Mobilenet V1x0.25 is set as a student network. The initial learning rate is set to be 1e-3 in the experiment, the total duration is 250epochs, the learning rate is decreased progressively when 190 and 220epochs, and the learning rate decreasing coefficient is 0.1. For the evaluation index, the detection index of Deng was used. The evaluation indexes of three difficulty levels of easy, medium and difficult are defined by gradually increasing complex samples on the basis of the detection rate of the EdgeBox.
2.3, the experimental results are shown in fig. 12, and the experimental results are summarized as follows:
experimental selection experiments were performed with distillates based on response (FT), on features (Fine-grained, zhang) and on relationships (CC, FSP, RKD, PKT). Compared with the response-based FT model, the accuracy is improved by 2% over the original student network, by about 0.4% and 3.7% over the fine-grained and Zhang attention distillation, and by about 1.5% to 4.2% over the relationship-based distillation CC, FSP, RKD and PKT.
2.4, the results were analyzed as follows:
(1) By incorporating the attention mechanism into the relationship matrix well, the mapping representation is enhanced, and the deficiency of relationship-based extraction in target detection is made up.
(2) The AOMD can well make up the characteristic representation and the relation representation and is suitable for a target detection task.
Further, as shown in fig. 13, the probability distribution diagram of the mapping representation matrix of the teacher and the student models is shown, and the trained student model and the teacher model show similar probability distribution.
3. Ablation experiment
The ablation experiments were directed to the impact of two modes (module mapping mode one, two) and the hyper-parameters (α and β) on the image classification task (Cars 196, CUB-200-2011) and the object detection task (WIDERFACE).
3.1, the classification task ablation experiment is as follows:
the experimental data set selects fine-grained image classification data sets Cars196 and CUB-200-2011.
3.2, the experimental settings are as follows:
the experiment used Resnet50 as the teacher network and Resnet18 as the student network. Experiments the same experimental setup was used for both data sets. The experimental setup set the initial learning rate to 0.1, the batch size to 128, the total time to 80, the learning rate reduction factor to 0.1, all normalized using L2, where the loss calculation function all calculated the loss using L2 norm to ensure fairness. Experimental assessment indices, each using Recall @ K, once all test images are embedded by the model, each test image will be used as a query, and the top K nearest neighbor images will be retrieved from the test set that does not contain the query. If the retrieved image contains the same category as the query, the recall of the query is considered to be 1.Recall @ K was calculated by calculating the average recall for the entire test set.
3.3, the results are shown in FIG. 14, and the results are as follows:
as to modes, mode one on Cars196 is slightly higher than mode two by about 0.1%, while mode one on CUB-200-2011 is slightly lower than mode two by about 0.8%. On Cars196 and CUB-200-2011, for values of the hyper-parameters α, β, AOMD for mode one (mode two) was improved by about 7.4% (7.4%) and 0% (2.4%), respectively, for α =1 and β =0, showing a positive gain to distillation effect; the AOMD for mode one (mode two) decreased by about 14.2% (14.5%) and 8.1% (7.5%), respectively, when α =0 and β =1, indicating a negative gain in distillation effect; when α =1, β =1, AOMD of mode one (mode two) increased by about 8.2% (8.1%), 1.9% (2.7%), most suitable for distillation.
3.4, from the above results it is possible to analyse:
(1) The two modes of AOMD are effective to the fine-grained classified data set, and the second mode has the best performance.
(2) When both hyper-parameters α and β (e.g., α =1, β = 1) are specified, AOMD performance is optimal.
4. Test task ablation experiment
The experimental data set selects a reference data set WIDERFACE for face detection.
4.1, experimental setup as follows:
experiments multi-branch task experiments were performed on WIDERFACE, including face classification, face frame regression, and face five point regression. In the experiment, the difference of network structures is considered, the Resnet50 is set as a teacher network, and the Mobilenet V1x0.25 is set as a student network. The initial learning rate is set to be 1e-3 in the experiment, the total duration is 250epochs, the learning rate is decreased progressively when 190 and 220epochs, and the learning rate decreasing coefficient is 0.1. For the evaluation index, the detection index of Deng was used. The evaluation indexes of three difficulty levels of easy, medium and difficult are defined by gradually increasing complex samples on the basis of the detection rate of the EdgeBox.
4.2, the results of the experiment are shown in FIG. 15, and the results are as follows:
for mode two, a little over 0.2% higher on WIDERFACE than mode one (only considering the difficult level of accuracy). For the hyperparameters α and β, the AOMD for mode one (mode two) was improved by about 1.4% (1.4%) when α =1 and β =0, exhibiting a positive distillation effect; when α =0 and β =1, the AOMD of mode one (mode two) decreases by about 1.6% (1.4%) in the precision of the difficulty level, indicating that the distillation effect is a negative gain; when α =1 and β =1, the AOMD of mode one (mode two) increases by about 2.0% (1.8%) in the precision of the difficulty level, which is the best for distillation. For memory consumption, mode two allows the addition of 20 images of 640 x 640 size to the same memory environment, 12 more batch sizes than mode one.
4.3, from the above results it is possible to analyse:
(1) Both modes, AOMD, are effective on target detection datasets, with mode two performing best.
(2) AOMD performance is best when two hyper-parameters α and β (e.g., α =1, β = 1) are assigned.
(3) On the target detection data set, compared with the AOMD mode I with the same computing resources, the AMOD mode II can train more batches, and the distillation effect is better.
The invention provides a mapping type distillation method and a training method thereof, which are used for embedding pixel-level relation information between adjacent modules based on an attention mechanism into a mapping representation diagram, transmitting the information from a teacher model to a student model and restricting and supervising training and learning of the student model. The method has excellent performance in classification and detection tasks, and well combines the advantages of relationship information and attention information, so as to make up and enhance the expression capacity of feature characterization and relationship characterization.
The mapping type distillation method provided by the invention gives consideration to image classification and target detection tasks, both of which have excellent performance, compared with the existing distillation method, the method realizes better performance, obviously reduces the calculation workload, and also provides two optional module mapping methods, can accept larger image input and more batch numbers, and improves the applicability of the method.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts in the embodiments are referred to each other. For each of the above embodiments, since they are substantially similar to the method embodiments, the description is simple, and reference may be made to the partial description of the method embodiments for relevant points.
The present invention has been described in detail with reference to the foregoing embodiments, and the principles and embodiments of the present invention have been described herein with reference to specific examples, which are provided only to assist understanding of the methods and core concepts of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (9)

1. A mapping type distillation method for a network structure is characterized in that a pre-trained target detection model is used as a basic data scalar, and the mapping type distillation method comprises the following processes:
forming a teacher model and a student model according to the pre-trained target detection model;
respectively extracting feature maps of modules of each layer of the teacher model and the student model;
carrying out attention mapping on the extracted feature map to obtain a feature information representation map with attention information;
carrying out module mapping on the characteristic information representation graph to obtain a mapping representation graph with pixel level relation information;
a mapping profile containing attention information and pixel level relationship information is distilled from the teacher model to the student model.
2. The mapped distillation method of claim 1, wherein: the attention map includes a global attention map and a local attention map, and the module map includes a global module map and a local module map.
3. The mapped distillation process of claim 1 or 2, wherein: the attention mapping of the feature map comprises the following processes:
carrying out global attention mapping on the characteristic information representation diagrams of the student model and the teacher model to obtain global characteristic information, and respectively obtaining the global characteristic information representation diagrams of the student model and the teacher model;
and carrying out local attention mapping on the characteristic information representation diagrams of the student model and the teacher model to obtain local characteristic information, and respectively obtaining the local characteristic information representation diagrams of the student model and the teacher model.
4. The mapped distillation process of claim 1 or 2, wherein: the module mapping of the characteristic information representation graph comprises the following processes:
carrying out global module mapping on the global characteristic information representation diagrams of the student model and the teacher model to obtain global characteristic information and pixel level relation information, and respectively obtaining global mapping representation diagrams of the student model and the teacher model;
and carrying out local module mapping on the local characteristic information representation diagrams of the student model and the teacher model to obtain local characteristic information and pixel level relation information, and respectively obtaining the local mapping representation diagrams of the student model and the teacher model.
5. The mapped distillation process of claim 1 or 2, wherein: the module mapping of the characteristic information representation graph further comprises the following processes:
carrying out global module mapping on the global characteristic information representation diagrams of the student model and the teacher model to obtain global characteristic information and pixel level relation information, and respectively obtaining global mapping representation diagrams of the student model and the teacher model;
local module mapping is carried out on the local characteristic information representation diagrams of the student model and the teacher model to obtain local characteristic information and pixel level relation information, and local mapping representation diagrams of the student model and the teacher model are obtained respectively;
and carrying out dimension compression on the global and local mapping representation graphs of the student model and the teacher model.
6. The mapped distillation process of claim 4 or 5, wherein: the module mapping process of the characteristic information representation diagram further comprises the following steps:
the module maps to obtain global and local mapping representation maps of the student model and the teacher model, combines the loss function, restrains and punishes the student model for training, and simulates learning from the student model to the teacher model to complete mapping type distillation.
7. The mapped distillation process of claim 4, 5 or 6, wherein: the module mapping process of the characteristic information representation diagram further comprises the following steps:
and training by taking the mapping representation of the teacher model as a label to supervise and constrain the student model, and adding the loss function of the mapping distillation into the original loss function of the student model to balance the training error together, so that the robustness of the model is enhanced.
8. The mapped distillation process of claim 1, wherein: the object detection model may also be an image classification model.
9. A training method for a mapping distillation method facing a network structure is characterized by comprising the following processes:
acquiring a pre-trained target detection model or an image classification model to form a teacher model and a student model;
extracting feature maps of feature layer modules from the teacher model and the student models, and training the teacher model and the student models by the mapping type distillation method of any one of claims 1 to 8.
CN202210507030.6A 2022-05-10 2022-05-10 Network structure-oriented mapping type distillation method and training method thereof Pending CN115577793A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210507030.6A CN115577793A (en) 2022-05-10 2022-05-10 Network structure-oriented mapping type distillation method and training method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210507030.6A CN115577793A (en) 2022-05-10 2022-05-10 Network structure-oriented mapping type distillation method and training method thereof

Publications (1)

Publication Number Publication Date
CN115577793A true CN115577793A (en) 2023-01-06

Family

ID=84579454

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210507030.6A Pending CN115577793A (en) 2022-05-10 2022-05-10 Network structure-oriented mapping type distillation method and training method thereof

Country Status (1)

Country Link
CN (1) CN115577793A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116051935A (en) * 2023-03-03 2023-05-02 北京百度网讯科技有限公司 Image detection method, training method and device of deep learning model

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116051935A (en) * 2023-03-03 2023-05-02 北京百度网讯科技有限公司 Image detection method, training method and device of deep learning model
CN116051935B (en) * 2023-03-03 2024-03-22 北京百度网讯科技有限公司 Image detection method, training method and device of deep learning model

Similar Documents

Publication Publication Date Title
CN113221911B (en) Vehicle weight identification method and system based on dual attention mechanism
CN104866810A (en) Face recognition method of deep convolutional neural network
CN111639524B (en) Automatic driving image semantic segmentation optimization method
CN113177559B (en) Image recognition method, system, equipment and medium combining breadth and dense convolutional neural network
CN112115967B (en) Image increment learning method based on data protection
CN113420775A (en) Image classification method under extremely small quantity of training samples based on adaptive subdomain field adaptation of non-linearity
CN116110022B (en) Lightweight traffic sign detection method and system based on response knowledge distillation
CN114067385A (en) Cross-modal face retrieval Hash method based on metric learning
CN111274424A (en) Semantic enhanced hash method for zero sample image retrieval
CN115546196A (en) Knowledge distillation-based lightweight remote sensing image change detection method
CN116912585A (en) SAR target recognition method based on self-supervision learning and knowledge distillation
CN116796810A (en) Deep neural network model compression method and device based on knowledge distillation
CN115577793A (en) Network structure-oriented mapping type distillation method and training method thereof
Pang et al. Multihead attention mechanism guided ConvLSTM for pixel-level segmentation of ocean remote sensing images
CN117079276B (en) Semantic segmentation method, system, equipment and medium based on knowledge distillation
CN113706551A (en) Image segmentation method, device, equipment and storage medium
CN116797821A (en) Generalized zero sample image classification method based on fusion visual information
CN115830401A (en) Small sample image classification method
CN116306969A (en) Federal learning method and system based on self-supervision learning
CN114972282A (en) Incremental learning non-reference image quality evaluation method based on image semantic information
CN111563413B (en) Age prediction method based on mixed double models
Jain et al. Flynet–neural network model for automatic building detection from satellite images
CN113420821A (en) Multi-label learning method based on local correlation of labels and features
CN117036698B (en) Semantic segmentation method based on dual feature knowledge distillation
CN116501908B (en) Image retrieval method based on feature fusion learning graph attention network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination