CN115577793A

CN115577793A - Network structure-oriented mapping type distillation method and training method thereof

Info

Publication number: CN115577793A
Application number: CN202210507030.6A
Authority: CN
Inventors: 胡潘天; 段章领; 贾兆红; 唐俊; 刘永峰; 宋俊才; 周行云; 王坤; 刘弨; 路然
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2022-05-10
Filing date: 2022-05-10
Publication date: 2023-01-06

Abstract

The invention relates to the technical field of model compression and acceleration in computer vision, and solves the technical problems that the relationship representation cannot be improved by relationship-based distillation, the number of input image batches is small, the mutual advantages are well combined, and the expression capability of the feature representation and the relationship representation is compensated and enhanced, in particular to a mapping type distillation method facing to a network structure, which comprises the following processes: forming a teacher model and a student model according to the pre-trained target detection model; and respectively extracting the characteristic diagrams of the modules of the teacher model and the student model. The method has excellent performance in classification and detection tasks, well combines the advantages of relationship information and attention information, makes up and enhances the expression capability of feature characterization and relationship characterization, provides two optional module mapping methods, can accept larger image input and more batch quantities, and improves the applicability of the method.

Description

Network structure-oriented mapping type distillation method and training method thereof

Technical Field

The invention relates to the technical field of model compression and acceleration in computer vision, in particular to a network structure-oriented mapping type distillation method and a training method thereof.

Background

In the past decades, with the continuous iterative update of deep learning technology, new neural network models are continuously leap forward in the field of computer vision. Although computing resources and memory have increased significantly enough to meet everyday demands, they are still barriers to deep neural network performance. Therefore, model compression has gained extensive research interest in the fields of computer vision and machine learning. Quantization and pruning are the most common methods among previous researchers, focusing on network architecture design, computational and model size reduction, and redundant connection removal for large models.

The above-described operational limitations are that custom design hardware and software model acceleration is required, limiting their practical applications. To address this challenge, knowledge distillation is proposed as an end-to-end solution to bridge model differences and minimize the computational burden of deep neural networks. Knowledge distillation, which is a process of extracting knowledge from a teacher network and transferring the knowledge to a student network, aims to encourage students to learn from the teacher and improve the generalization ability of the student network.

In the field of knowledge distillation, researchers are seeking more effective solutions to transfer knowledge from a teacher network to a student network. In general, there are three types of knowledge, namely reaction-based knowledge, feature-based knowledge, and relationship-based knowledge. In addition, there are other distillation strategies. The research on knowledge distillation in recent years mainly focuses on the following two aspects: firstly, more important distillation knowledge is mined; and secondly, a more effective method is explored to transfer knowledge to a student network.

Most distillation methods are generally extracted from the teacher network from two forms of knowledge: feature-based or relationship-based. These efforts necessarily introduce some knowledge information, such as attention information and relationship information between instances. However, they are not complementary in nature, leading to the following problems in the existing known distillation methods:

(1) In the absence of attention mechanisms, relationship-based distillation fails to improve relationship characterization, while feature-based distillation ignores relationships between instances of higher semantics, thus resulting in poor feature characterization. Therefore, most existing distillation methods do not achieve a good balance between the task of detection and the task of classification.

(2) The existing distillation method considering the relationship information among the network modules has a series of defects of large memory consumption, small batch quantity of input pictures, incapability of being accelerated by a GPU and the like.

(3) Existing methods that take into account the combination of attention and relationship information simply add the loss functions of feature-based distillation and relationship-based distillation and do not combine well the mutual advantages, compensating for and enhancing the expressive power of feature characterization and relationship characterization.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a mapping type distillation method facing to a network structure and a training method thereof, and solves the technical problems that the relationship representation cannot be improved based on the distillation of the relationship, the number of batches of input pictures is small, the mutual advantages are well combined, and the expression capability of the feature representation and the relationship representation is compensated and enhanced.

The method has excellent performance in classification and detection tasks, well combines the advantages of relationship information and attention information, makes up and enhances the expression capability of feature characterization and relationship characterization, provides two optional module mapping methods, can accept larger image input and more batch quantities, and improves the applicability of the method.

In order to solve the technical problems, the invention provides the following technical scheme: a mapping type distillation method facing to a network structure is based on a pre-trained target detection model as a basic data scalar, and comprises the following processes:

forming a teacher model and a student model according to the pre-trained target detection model;

respectively extracting characteristic graphs of modules of each layer of the teacher model and the student model;

carrying out attention mapping on the extracted feature map to obtain a feature information representation map with attention information;

carrying out module mapping on the characteristic information representation graph to obtain a mapping representation graph with pixel level relation information;

a mapping profile containing attention information and pixel level relationship information is distilled from the teacher model to the student model.

Further, the attention map includes a global attention map and a local attention map, and the module map includes a global module map and a local module map.

Further, the feature map attention mapping includes the following processes:

carrying out global attention mapping on the characteristic information representation diagrams of the student model and the teacher model to obtain global characteristic information, and respectively obtaining the global characteristic information representation diagrams of the student model and the teacher model;

and carrying out local attention mapping on the characteristic information representation diagrams of the student model and the teacher model to obtain local characteristic information, and respectively obtaining the local characteristic information representation diagrams of the student model and the teacher model.

Further, the module mapping of the characteristic information representation comprises the following processes:

carrying out global module mapping on the global characteristic information representation diagrams of the student model and the teacher model to obtain global characteristic information and pixel level relation information, and respectively obtaining global mapping representation diagrams of the student model and the teacher model;

and carrying out local module mapping on the local characteristic information representation diagrams of the student model and the teacher model to obtain local characteristic information and pixel level relation information, and respectively obtaining the local mapping representation diagrams of the student model and the teacher model.

Further, the module mapping of the feature information characterization graph further includes the following processes:

local module mapping is carried out on the local characteristic information representation diagrams of the student model and the teacher model to obtain local characteristic information and pixel level relation information, and local mapping representation diagrams of the student model and the teacher model are obtained respectively;

and carrying out dimension compression on the global and local mapping representation graphs of the student model and the teacher model.

Further, the process of module mapping of the characteristic information representation diagram further comprises:

the module maps to obtain global and local mapping representation maps of the student model and the teacher model, combines the loss function, restrains and punishes the student model for training, and simulates learning from the student model to the teacher model to complete mapping type distillation.

and training by taking the mapping representation of the teacher model as a label to supervise and constrain the student model, and adding the loss function of the mapping distillation into the original loss function of the student model to balance the training error together, so that the robustness of the model is enhanced.

Further, the object detection model may also be an image classification model.

The invention also provides a training method applied to the mapping type distillation method, and the specific technical scheme is as follows:

a training method of a mapping distillation method facing a network structure comprises the following processes:

acquiring a pre-trained target detection model or an image classification model to form a teacher model and a student model;

and extracting the feature graphs of the feature layer modules from the teacher model and the student models, and introducing the feature graphs into the mapping type distillation method to train with the student models.

By means of the technical scheme, the invention provides a mapping type distillation method facing to a network structure and a training method thereof, and the method at least has the following beneficial effects:

1. the invention provides a mapping type distillation method and a training method thereof, which are used for embedding pixel-level relation information between adjacent modules based on an attention mechanism into a mapping representation diagram, transmitting the information from a teacher model to a student model and restricting and supervising training and learning of the student model. The method has excellent performance in classification and detection tasks, and well combines the advantages of relationship information and attention information, so as to make up and enhance the expression capacity of feature characterization and relationship characterization.

2. The mapping type distillation method provided by the invention gives consideration to image classification and target detection tasks, both of which show excellent performance, realizes better performance compared with the existing distillation method, obviously reduces the calculation workload, and also provides two selectable module mapping methods, can accept larger image input and more batch quantities, and improves the applicability of the method.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a flow chart of a process for mapped distillation according to one embodiment of the present invention;

FIG. 2 is a flowchart illustrating attention mapping of a feature map according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating module mapping for a characteristic information representation according to an embodiment of the present invention;

FIG. 4 is a flow chart of a training method for the mapped distillation method according to an embodiment of the present invention;

FIG. 5 is a bottom schematic block diagram of a mapped distillation process according to an embodiment of the present invention;

FIG. 6 is a flow chart of a mapped distillation process according to example two of the present invention;

fig. 7 is a flowchart of module mapping for a characteristic information characterization graph in the third embodiment of the present invention;

FIG. 8 is a schematic diagram of an exemplary distillation of a mapped distillation method and a training method thereof according to a third embodiment of the present invention;

FIG. 9 is a graph showing the results of comparative experiments on the mapped distillation method on the data sets Cars196 and CUB-200-2011 in the experimental examples of the present invention;

FIG. 10 is a graph showing the accuracy of a comparative experiment on a data set Cars196 in an experimental example of the present invention;

FIG. 11 is a graph illustrating the accuracy of comparative experiments on the data set CUB-200-2011 in the experimental example of the present invention;

FIG. 12 is a graph showing comparative experimental results of the mapped distillation method on the data set WIDERFACE in the experimental example of the present invention;

FIG. 13 is a matrix probability distribution diagram of a mapping representation of teacher and student models in an experimental example of the present invention;

FIG. 14 is a graph showing the results of a mapping distillation method ablation experiment performed on the data sets Cars196 and CUB-200-2011 in an experimental example of the present invention;

fig. 15 is a graph of the ablation experimental results of the mapped distillation method on the data set WIDERFACE in the experimental example of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention more comprehensible, the present invention is described in detail with reference to the accompanying drawings and the detailed description thereof. Therefore, the realization process of how to apply technical means to solve the technical problems and achieve the technical effects can be fully understood and implemented.

Those skilled in the art will appreciate that all or part of the steps in the method for implementing the above embodiments may be implemented by a program instructing relevant hardware, and thus, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Example one

Referring to fig. 1 to 5, an embodiment of the present invention is shown, which is implemented as follows:

a mapping type distillation method for a network structure is based on a pre-trained target detection model as a basic data scalar, and comprises the following processes:

and S11, forming a teacher model and a student model according to the pre-trained target detection model.

And S12, respectively extracting characteristic diagrams of modules of the teacher model and the student model.

And S13, performing attention mapping on the extracted feature map to obtain a feature information representation map with attention information.

The attention map includes a global attention map and a local attention map.

The attention mapping of the feature map comprises the following processes:

s131, carrying out global attention mapping on the characteristic information representation diagrams of the student model and the teacher model to obtain global characteristic information, and respectively obtaining the global characteristic information representation diagrams of the student model and the teacher model.

And S132, carrying out local attention mapping on the characteristic information representation diagrams of the student model and the teacher model to obtain local characteristic information, and respectively obtaining the local characteristic information representation diagrams of the student model and the teacher model.

The global attention mapping function is: Ψ _global :R ^L,N,C,H,W →R ^L,N,C The concrete expression is as follows:

the local attention mapping function is: psi _local :R ^L,N,C,H,W →R _L,N,HW The concrete expression is as follows:

wherein, is epsilon to R ^L,N,C,H,W L represents the number of characteristic layers in the teacher model and the student model, N, C, H and W represent the number of batches, channels, widths and heights of the characteristic diagrams respectively, and i, j and k represent ith, jth and kth vector segments in space and channel dimensions respectively.

And obtaining global and local characteristic information characterization graphs of the student model and the teacher model based on the attention mapping for subsequent module mapping.

And S14, carrying out module mapping on the characteristic information representation graph to obtain a mapping representation graph with pixel level relation information.

The module map includes a global module map and a local module map.

The module mapping of the characteristic information representation graph comprises the following processes:

s141, global module mapping is carried out on the global characteristic information representation diagrams of the student model and the teacher model, global characteristic information and pixel level relation information are obtained, and global mapping representation diagrams of the student model and the teacher model are obtained respectively;

and S142, carrying out local module mapping on the local characteristic information representation diagrams of the student model and the teacher model to obtain local characteristic information and pixel level relation information, and respectively obtaining the local mapping representation diagrams of the student model and the teacher model.

Global Module mapping function as phi _global The concrete expression is as follows:

local module mapping function as phi _local The concrete expression is as follows:

wherein the content of the first and second substances,

respectively representing the characteristic characterization graphs of the (n-1) th layer and the n th layer,

an example pixel-level relational mapping representation of the (n-1) th, nth modules is shown, respectively.

The above is a module mapping manner of the module mapping mode one.

Based on the obtained mapping representation diagrams of the teacher model and the student models, the mapping representation diagrams of the teacher model are used as labels to supervise and constrain the student models for training, the loss function of the mapping distillation is added to the original loss function of the student models to share the training error, and the robustness of the models is enhanced.

And obtaining global and local mapping representation maps of teachers and students based on module mapping, and constraining and punishing the student models to train by combining loss functions so as to realize that the student models simulate learning to the teacher model and finish mapping type distillation.

And S15, distilling the mapping representation containing the attention information and the pixel level relation information from the teacher model to the student model.

The embodiment also provides a training method applied to the above mapping type distillation method, and the specific implementation manner is as follows:

s10, obtaining a pre-trained target detection model or an image classification model to form a teacher model and a student model;

and S20, extracting feature graphs of the feature layer modules of the teacher model and the student models, and transmitting the feature graphs into the mapping type distillation method to train with the student models.

According to the mapping type distillation method and the training method thereof, the pixel-level relation information based on the attention mechanism between the adjacent modules is embedded into the mapping representation diagram, the pixel-level relation information is transmitted from the teacher model to the student model, and the training learning of the teacher model is restrained and supervised.

The mapping type distillation method has excellent performance in classification and detection tasks, and well combines the advantages of relationship information and attention information, so that the expression capacities of characteristic representation and relationship representation are compensated and enhanced. The process achieves better performance and significantly reduces computational effort compared to previous distillation processes.

Example two

Referring to fig. 6, an embodiment of the present invention is shown, which is implemented as follows:

a mapping distillation method for a network structure is based on a pre-trained image classification model as a basic data scalar, and comprises the following processes:

s21, forming a teacher model and a student model according to the pre-trained image classification model;

s22, respectively extracting feature maps of modules of the teacher model and the student model;

s23, performing attention mapping on the extracted feature map to obtain a feature information representation map with attention information;

s24, carrying out module mapping on the characteristic information representation graph to obtain a mapping representation graph with pixel level relation information;

and S25, distilling the mapping representation containing the attention information and the pixel level relation information from the teacher model to the student model.

The mapping type distillation method for network structure proposed in this embodiment is made based on the first embodiment, and has the advantages of the corresponding method embodiments, and the technical contents identical or similar to those in the first embodiment are not repeated herein.

By the embodiment, the method integrates the image classification and the target detection tasks, both of which show excellent performance, and the calculation workload is remarkably reduced.

EXAMPLE III

Referring to fig. 7 to 8, an embodiment of the present invention is shown, which is implemented as follows:

as shown in fig. 8, the teacher model and the student model extract feature profiles of feature layers, where circles, triangles, squares, and diamonds represent output examples of the feature profiles.

And forming a mapping token map matrix by the adjacent feature layers, and particularly obtaining the mapping token map matrix through further module mapping of the feature token maps of global and local attention mapping.

Wherein the square represents the mapping relation matrix of each instance in each layer. Specifically, one can express (L-1) × M × N, where L denotes the total number of layers in the feature layer module, and M, N denote the rows and columns of the mapping feature map matrix, where x denotes optional, representing the module mapping the second mapping pattern.

s1411, carrying out global module mapping on the global characteristic information representation diagrams of the student model and the teacher model to obtain global characteristic information and pixel level relation information, and respectively obtaining global mapping representation diagrams of the student model and the teacher model.

And S1412, performing local module mapping on the local characteristic information representation diagrams of the student model and the teacher model to obtain local characteristic information and pixel level relation information, and obtaining the local mapping representation diagrams of the student model and the teacher model respectively.

And S1413, performing dimension compression on the global and local mapping representation graphs of the student model and the teacher model.

The above is a module mapping mode of the module mapping mode two, and the module mapping mode of the mode two performs dimension compression on the basis of the mode one to finally obtain a mapping representation:

based on the obtained mapping representation of the teacher model and the obtained mapping representation of the student model, the mapping representation of the teacher model is used as a label to supervise and constrain the student model for training, a mapping type loss function is designed and added to the original loss function of the student model to balance training errors together, and the robustness of the model is enhanced.

The loss function for the mapped distillation in modes one and two is as follows:

wherein, T and S respectively represent a teacher model and a student model.

In addition, the mapped distillation loss of the mode two provides an additional loss function, and the loss function is calculated by considering the conditional probability distribution difference of the mapping representation diagrams of the teacher model and the student model;

the conditional probability functions for samples i and j are embodied as follows:

wherein t represents a teacher model, N represents the number of samples, K is a kernel function with a bandwidth of σ, and is used for measuring the similarity of the samples, and the specific kernel function is a cosine kernel function (avoiding calculating the kernel bandwidth), which is specifically represented as follows:

wherein a, b denote the vectors of the two groups of samples, respectively | | n ₂ Representing a vector of ₂ A norm;

finally, the loss function of the mapping representation chart of the teacher model and the student model is obtained as follows:

wherein t and s represent a teacher model and a student model, respectively;

finally, the total distillation loss function is expressed as:

wherein two over parameters α, β balance the different distillation losses.

The module mapping of the characteristic information representation provided in this embodiment is performed on the basis of the first or second embodiment, and has the advantages of the corresponding method embodiment, and detailed descriptions of the same or similar technical contents covered in the first or second embodiment are omitted here.

By the method, the problems that memory consumption is large, the batch number of input pictures is small, and the input pictures cannot be accelerated by a GPU can be solved, meanwhile, dimensionality compression is performed on the basis of the mode one by the module mapping mode of the mode two, and the mapping representation graph is finally obtained, so that the module mapping of the mode two can accept image input of batches with larger size and more batches, and universality is improved.

Examples of the experiments

Referring to fig. 9 to fig. 15, the experimental example is verified based on the mapping type distillation method and the training method thereof for the network structure described in the first, second, and third embodiments, and the specific verification contents include an image classification test, a target detection test, an ablation test, and a detection task ablation test, and the specific scheme is as follows:

1. image classification experiment

1.1, selecting fine-grained image classification data sets Cars196 and CUB-200-2011 from the experimental data sets.

1.2, the experimental setup was as follows:

the experiment used Resnet50 as the teacher network and Resnet18 as the student network. Experiments the same experimental setup was used for both data sets. According to the experimental setting, the initial learning rate is set to be 0.1, the batch size is set to be 128, the total time is set to be 80, the rate reduction coefficient of the learning rate is 0.1, L2 standardization is used, the loss function calculation mode adopts L2 norm and conditional probability distribution to calculate loss, and the best result is obtained. Experimental assessment indices, all using Recall @ K, once all test images are embedded by the model, each test image will be used as a query, and the top K nearest neighbor images will be retrieved from the test set that does not contain the query. If the retrieved image contains the same category as the query, the recall of the query is considered to be 1.Recall @ K was calculated by calculating the average recall for the entire test set.

1.3, the experimental results are shown in fig. 9, and the experimental summary is as follows:

the accuracy of Resnet18 extracted by AOMD in Cars196 and CUB-200-2011 data sets is 76.03% and 55.26% respectively, and the accuracy is improved by 8.2% and 3.6% respectively compared with the benchmark. Compared to Fitnet (based on response), an improvement of 4.2% and 3.4%, respectively; compared with the attention based on characteristics, the improvement is 21.4 percent and 11.2 percent respectively; compared with SP (based on the relation), the improvement is respectively 5.6 percent and 3.0 percent, and compared with ICKD, the improvement is respectively 2.8 percent and 5.0 percent.

1.4, the following results can be analyzed:

(1) Attention distillation cannot adapt to a classification task of fine granularity, and the distillation effect is poor; relational distillation is not superior to traditional response-based distillation and performs poorly.

(2) The mapped distillation can compensate and enhance the representation of features and relationships with each other and is superior to feature-based and response-based distillation alone.

(3) And the mapping distillation is suitable for image classification tasks.

Further, comparing the accuracy curves of the experiment on the data sets Cars196 and CUB-200-2011 as shown in FIGS. 10 and 11, the following results can be analyzed.

The AOMD of the invention achieves the best precision within 10 to 20 epochs. In combination with fig. 9, AOMD shows the advantages of fast convergence and significant forward yield in fine-grained classification task, which verifies the superiority and high performance of the AOMD method of the present invention.

2. Target detection experiment

2.1, selecting a reference data set WIDERFACE for face detection from the experimental data set.

2.2, experimental settings were as follows:

experiments multi-branch task experiments were performed on WIDERFACE, including face classification, face box regression, and face five-point regression. In the experiment, the difference of network structures is considered, the Resnet50 is set as a teacher network, and the Mobilenet V1x0.25 is set as a student network. The initial learning rate is set to be 1e-3 in the experiment, the total duration is 250epochs, the learning rate is decreased progressively when 190 and 220epochs, and the learning rate decreasing coefficient is 0.1. For the evaluation index, the detection index of Deng was used. The evaluation indexes of three difficulty levels of easy, medium and difficult are defined by gradually increasing complex samples on the basis of the detection rate of the EdgeBox.

2.3, the experimental results are shown in fig. 12, and the experimental results are summarized as follows:

experimental selection experiments were performed with distillates based on response (FT), on features (Fine-grained, zhang) and on relationships (CC, FSP, RKD, PKT). Compared with the response-based FT model, the accuracy is improved by 2% over the original student network, by about 0.4% and 3.7% over the fine-grained and Zhang attention distillation, and by about 1.5% to 4.2% over the relationship-based distillation CC, FSP, RKD and PKT.

2.4, the results were analyzed as follows:

(1) By incorporating the attention mechanism into the relationship matrix well, the mapping representation is enhanced, and the deficiency of relationship-based extraction in target detection is made up.

(2) The AOMD can well make up the characteristic representation and the relation representation and is suitable for a target detection task.

Further, as shown in fig. 13, the probability distribution diagram of the mapping representation matrix of the teacher and the student models is shown, and the trained student model and the teacher model show similar probability distribution.

3. Ablation experiment

The ablation experiments were directed to the impact of two modes (module mapping mode one, two) and the hyper-parameters (α and β) on the image classification task (Cars 196, CUB-200-2011) and the object detection task (WIDERFACE).

3.1, the classification task ablation experiment is as follows:

the experimental data set selects fine-grained image classification data sets Cars196 and CUB-200-2011.

3.2, the experimental settings are as follows:

the experiment used Resnet50 as the teacher network and Resnet18 as the student network. Experiments the same experimental setup was used for both data sets. The experimental setup set the initial learning rate to 0.1, the batch size to 128, the total time to 80, the learning rate reduction factor to 0.1, all normalized using L2, where the loss calculation function all calculated the loss using L2 norm to ensure fairness. Experimental assessment indices, each using Recall @ K, once all test images are embedded by the model, each test image will be used as a query, and the top K nearest neighbor images will be retrieved from the test set that does not contain the query. If the retrieved image contains the same category as the query, the recall of the query is considered to be 1.Recall @ K was calculated by calculating the average recall for the entire test set.

3.3, the results are shown in FIG. 14, and the results are as follows:

as to modes, mode one on Cars196 is slightly higher than mode two by about 0.1%, while mode one on CUB-200-2011 is slightly lower than mode two by about 0.8%. On Cars196 and CUB-200-2011, for values of the hyper-parameters α, β, AOMD for mode one (mode two) was improved by about 7.4% (7.4%) and 0% (2.4%), respectively, for α =1 and β =0, showing a positive gain to distillation effect; the AOMD for mode one (mode two) decreased by about 14.2% (14.5%) and 8.1% (7.5%), respectively, when α =0 and β =1, indicating a negative gain in distillation effect; when α =1, β =1, AOMD of mode one (mode two) increased by about 8.2% (8.1%), 1.9% (2.7%), most suitable for distillation.

3.4, from the above results it is possible to analyse:

(1) The two modes of AOMD are effective to the fine-grained classified data set, and the second mode has the best performance.

(2) When both hyper-parameters α and β (e.g., α =1, β = 1) are specified, AOMD performance is optimal.

4. Test task ablation experiment

The experimental data set selects a reference data set WIDERFACE for face detection.

4.1, experimental setup as follows:

experiments multi-branch task experiments were performed on WIDERFACE, including face classification, face frame regression, and face five point regression. In the experiment, the difference of network structures is considered, the Resnet50 is set as a teacher network, and the Mobilenet V1x0.25 is set as a student network. The initial learning rate is set to be 1e-3 in the experiment, the total duration is 250epochs, the learning rate is decreased progressively when 190 and 220epochs, and the learning rate decreasing coefficient is 0.1. For the evaluation index, the detection index of Deng was used. The evaluation indexes of three difficulty levels of easy, medium and difficult are defined by gradually increasing complex samples on the basis of the detection rate of the EdgeBox.

4.2, the results of the experiment are shown in FIG. 15, and the results are as follows:

for mode two, a little over 0.2% higher on WIDERFACE than mode one (only considering the difficult level of accuracy). For the hyperparameters α and β, the AOMD for mode one (mode two) was improved by about 1.4% (1.4%) when α =1 and β =0, exhibiting a positive distillation effect; when α =0 and β =1, the AOMD of mode one (mode two) decreases by about 1.6% (1.4%) in the precision of the difficulty level, indicating that the distillation effect is a negative gain; when α =1 and β =1, the AOMD of mode one (mode two) increases by about 2.0% (1.8%) in the precision of the difficulty level, which is the best for distillation. For memory consumption, mode two allows the addition of 20 images of 640 x 640 size to the same memory environment, 12 more batch sizes than mode one.

4.3, from the above results it is possible to analyse:

(1) Both modes, AOMD, are effective on target detection datasets, with mode two performing best.

(2) AOMD performance is best when two hyper-parameters α and β (e.g., α =1, β = 1) are assigned.

(3) On the target detection data set, compared with the AOMD mode I with the same computing resources, the AMOD mode II can train more batches, and the distillation effect is better.

The invention provides a mapping type distillation method and a training method thereof, which are used for embedding pixel-level relation information between adjacent modules based on an attention mechanism into a mapping representation diagram, transmitting the information from a teacher model to a student model and restricting and supervising training and learning of the student model. The method has excellent performance in classification and detection tasks, and well combines the advantages of relationship information and attention information, so as to make up and enhance the expression capacity of feature characterization and relationship characterization.

The mapping type distillation method provided by the invention gives consideration to image classification and target detection tasks, both of which have excellent performance, compared with the existing distillation method, the method realizes better performance, obviously reduces the calculation workload, and also provides two optional module mapping methods, can accept larger image input and more batch numbers, and improves the applicability of the method.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts in the embodiments are referred to each other. For each of the above embodiments, since they are substantially similar to the method embodiments, the description is simple, and reference may be made to the partial description of the method embodiments for relevant points.

The present invention has been described in detail with reference to the foregoing embodiments, and the principles and embodiments of the present invention have been described herein with reference to specific examples, which are provided only to assist understanding of the methods and core concepts of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A mapping type distillation method for a network structure is characterized in that a pre-trained target detection model is used as a basic data scalar, and the mapping type distillation method comprises the following processes:

respectively extracting feature maps of modules of each layer of the teacher model and the student model;

2. The mapped distillation method of claim 1, wherein: the attention map includes a global attention map and a local attention map, and the module map includes a global module map and a local module map.

3. The mapped distillation process of claim 1 or 2, wherein: the attention mapping of the feature map comprises the following processes:

4. The mapped distillation process of claim 1 or 2, wherein: the module mapping of the characteristic information representation graph comprises the following processes:

5. The mapped distillation process of claim 1 or 2, wherein: the module mapping of the characteristic information representation graph further comprises the following processes:

6. The mapped distillation process of claim 4 or 5, wherein: the module mapping process of the characteristic information representation diagram further comprises the following steps:

7. The mapped distillation process of claim 4, 5 or 6, wherein: the module mapping process of the characteristic information representation diagram further comprises the following steps:

8. The mapped distillation process of claim 1, wherein: the object detection model may also be an image classification model.

9. A training method for a mapping distillation method facing a network structure is characterized by comprising the following processes:

extracting feature maps of feature layer modules from the teacher model and the student models, and training the teacher model and the student models by the mapping type distillation method of any one of claims 1 to 8.