CN113326941A - Knowledge distillation method, device and equipment based on multilayer multi-attention migration - Google Patents

Knowledge distillation method, device and equipment based on multilayer multi-attention migration Download PDF

Info

Publication number
CN113326941A
CN113326941A CN202110713825.8A CN202110713825A CN113326941A CN 113326941 A CN113326941 A CN 113326941A CN 202110713825 A CN202110713825 A CN 202110713825A CN 113326941 A CN113326941 A CN 113326941A
Authority
CN
China
Prior art keywords
network
output
loss function
layer
attention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110713825.8A
Other languages
Chinese (zh)
Inventor
苟建平
孙立媛
欧卫华
陈潇君
夏书银
柯佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu University
Original Assignee
Jiangsu University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu University filed Critical Jiangsu University
Priority to CN202110713825.8A priority Critical patent/CN113326941A/en
Publication of CN113326941A publication Critical patent/CN113326941A/en
Priority to CN202210535550.8A priority patent/CN114742223A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • G06N5/025Extracting rules from data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to the technical field of knowledge distillation, and discloses a knowledge distillation method, a knowledge distillation device and knowledge distillation equipment based on multilayer multi-attention migration, wherein the knowledge distillation method comprises the following steps: constructing an untrained student network and a pre-trained teacher network; inputting training data into a student network and a teacher network to obtain a first output characteristic set of each intermediate layer of the student network and a second output characteristic set of each intermediate layer of the teacher network; determining a distillation loss function based on the first set of output characteristics and the second set of output characteristics; and iteratively training the student network based on the distillation loss function. According to the invention, different types of attention knowledge are explored in different intermediate layers of the deep neural network for migration, so that the learning of the student network is effectively guided, and the performance and generalization capability of the student network are improved.

Description

Knowledge distillation method, device and equipment based on multilayer multi-attention migration
Technical Field
The invention relates to the technical field of knowledge distillation, in particular to a knowledge distillation method, a knowledge distillation device and knowledge distillation equipment based on multilayer multi-attention migration.
Background
In recent years, the deep neural network has made a breakthrough in many fields such as computer vision, speech recognition, and natural language processing, and more work is being focused on the performance of the neural network. However, to achieve better performance, large deep neural networks are designed to contain hundreds of layers and millions of parameters. As the number of parameters increases, the complexity of the model also increases, increasing runtime and storage costs, and it is difficult to deploy such a large model onto a mobile device or an embedded device. Therefore, various model compression techniques are proposed to obtain a lightweight model with good performance and easy deployment. Knowledge distillation is widely concerned due to the characteristics of easy realization, simple operation and capability of effectively improving the performance of a small model.
Knowledge distillation refers to extracting knowledge from a large-scale teacher network and transferring the knowledge to a small-scale student network in the training process, so that the performance of the small-scale student network is improved. In this way, a small neural network model with better performance can be obtained, so that the small neural network model can be conveniently deployed to a mobile terminal. Since the middle layer of the neural network contains a large number of parameters and a complex network structure, in recent years, many knowledge distillation schemes of the deep neural network migrate the feature representation of the middle layer as knowledge in the process of training the model. The knowledge in the teacher network output layer is migrated to the student network as if the answer of a certain problem is directly taught to students, and the knowledge in the teacher network intermediate layer is migrated to the student network as if the method for solving a certain problem is taught to students, so that the migration of the knowledge in the teacher network intermediate layer is undoubtedly a more effective way for improving the generalization capability of the student network. Among various knowledge distillation methods based on middle layer feature extraction, attention has been paid to the knowledge based on attention, since the knowledge enables a student network to obtain good performance and has great flexibility with respect to the structures of a teacher network and a student network (i.e., the spatial dimensions and the number of features of the teacher network and the student network do not need to be the same).
Among them, Zagoruyko et al propose an attention migration scheme based on an attention map. Namely, in the training process, the attention map based on activation or the attention map based on gradient in the teacher network is migrated into the student network as knowledge, and the fact that the attention map based on activation is migrated can enable the student network to obtain better performance compared with the attention map based on gradient is verified through experiments.
Dong et al, among other things, have adopted a framework for a plurality of teachers to teach a student, and have defined one of the teacher's networks as an expert network, which migrates itself to the student's network based on activated attention knowledge during the course of training. However, training of multiple teacher networks takes a significant time penalty.
Kundu et al, which combines two model compression modes of knowledge distillation and parameter pruning, migrate attention knowledge based on activation to a student network during training.
Among them, Qu et al propose a hybrid attention knowledge migration method, i.e., in the course of training, the teacher network migrates its own attention knowledge based on activation and attention knowledge based on channels to the student network simultaneously. The authors think that the two kinds of knowledge can complement each other and act simultaneously, so as to better improve the performance of the student network.
Among other things, Chen et al designed a new convolutional neural network, called SCA-CNN, into which both space-based attention and channel-based attention are integrated.
However, in conjunction with the knowledge distillation methods described above, existing attention-based knowledge distillation methods choose to migrate the same attention knowledge at different layers of the network, ignoring the hierarchical structure for deep neural networks representing learning, resulting in limited performance and generalization capability of the deep neural networks.
Disclosure of Invention
Based on the technical problems, in order to further improve the performance and generalization capability of the student network, the invention provides a knowledge distillation method, a knowledge distillation device and knowledge distillation equipment based on multilayer multi-attention migration, which are used for searching different types of attention knowledge in different intermediate layers of a deep neural network for migration, so as to effectively guide the learning of the student network and improve the performance and generalization capability of the student network, and specifically comprise the following technical schemes:
a knowledge distillation method based on multi-layer multi-attention migration, comprising:
constructing an untrained student network and a pre-trained teacher network;
inputting training data into a student network and a teacher network to obtain a first output characteristic set of each intermediate layer of the student network and a second output characteristic set of each intermediate layer of the teacher network;
determining a distillation loss function based on the first set of output characteristics and the second set of output characteristics;
and iteratively training the student network based on the distillation loss function.
A knowledge distillation apparatus based on multi-layer multi-attention migration, comprising:
the model building module is used for building an untrained student network and a pre-trained teacher network;
the output characteristic acquisition module is used for inputting the training data into the student network and the teacher network to acquire a first output characteristic set of each middle layer of the student network and a second output characteristic set of each middle layer of the teacher network;
a loss function determination module to determine a distillation loss function based on the first set of output characteristics and the second set of output characteristics;
and the network training module is used for carrying out iterative training on the student network based on the distillation loss function.
A computer device comprising a memory in which a computer program is stored and a processor which, when executing the computer program, carries out the steps of the above-described knowledge distillation method based on multi-layer multi-attention migration.
A computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the above-described multi-layer multi-attention-migration based knowledge distillation method.
Compared with the prior art, the invention has the beneficial effects that:
the invention effectively utilizes a plurality of attention knowledge and carries out knowledge migration under a unified framework. In training a student network, various layers thereof may receive attention knowledge from different layers of the teacher network. In order to transfer rich knowledge existing in a teacher network to a student network, a plurality of teacher networks are generally adopted in the training process, however, it is very difficult to determine what knowledge each teacher is suitable for transferring, and in addition, the training of a large teacher network is expensive in time. Therefore, the invention only uses one teacher network, extracts different attention knowledge from different middle layers of the teacher network in the training process, and then migrates the attention knowledge to the student network, thereby saving time and cost and simultaneously leading the student network to obtain better performance.
Drawings
The present application will be further explained by way of exemplary embodiments, which will be described in detail by way of the accompanying drawings, in which:
FIG. 1 is a schematic flow diagram of a knowledge distillation process based on multi-layer multi-attention migration.
FIG. 2 is a schematic diagram of the basic framework of the distillation method based on the knowledge of multi-layer multi-attention migration.
FIG. 3 is a graphical representation of the performance of a knowledge distillation process based on multi-layer multi-attention migration compared to other knowledge distillation processes.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings of the embodiments of the present disclosure. It is to be understood that the described embodiments are only a few embodiments of the present disclosure, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the disclosure without any inventive step, are within the scope of protection of the disclosure.
The application aims to provide a knowledge distillation method, a device, equipment and a medium based on multilayer multi-attention migration, wherein the method comprises the following steps: constructing an untrained student network and a pre-trained teacher network; inputting training data into a student network and a teacher network to obtain a first output characteristic set of each intermediate layer of the student network and a second output characteristic set of each intermediate layer of the teacher network; determining a distillation loss function based on the first set of output characteristics and the second set of output characteristics; and iteratively training the student network based on the distillation loss function.
The embodiments of the present application may be applied to various application scenarios including, but not limited to, various scenarios in the field of computer vision applications, such as face recognition, image classification, object detection, semantic segmentation, etc., or to neural network model-based processing systems deployed on edge devices (e.g., mobile phones, wearable devices, computing nodes, etc.), or to application scenarios for speech signal processing, natural language processing, recommendation systems, or to application scenarios requiring compression of neural network models due to limited resources and latency requirements.
For illustrative purposes only, the embodiments of the present application may be applied to an application scenario of object detection at a mobile phone end. The technical problems to be solved by the application scenario are as follows: when a user uses a mobile phone to take a picture, the user needs to automatically capture objects such as human faces and animals so as to help the mobile phone to automatically focus, beautify and the like, so that a convolutional neural network model for object detection, which is small in size and fast in operation, is needed, and better user experience is brought to the user and the quality of mobile phone products is improved.
For illustrative purposes only, the present application embodiments may also be used in application scenarios of autonomous driving scenario segmentation. The technical problems to be solved by the application scenario are as follows: after capturing the road image, the camera of the automatic driving vehicle needs to divide the image, and then separates different objects such as road surface, roadbed, vehicles, pedestrians and the like, so as to keep the vehicle running in a correct area. There is therefore a need for a convolutional neural network model that can provide fast real-time correct interpretation and semantic segmentation of a picture.
For illustrative purposes only, the embodiments of the present application may also be used in application scenarios of portal gate face verification. The technical problems to be solved by the application scenario are as follows: when passengers carry out face authentication on gates at entrances of high-speed rails, airports and the like, a camera can shoot a face image and extract features by using a convolutional neural network, and then similarity calculation is carried out on the face image and the image features of identity documents stored in a system; and if the similarity is high, the verification is successful. Among them, it is most time-consuming to extract features through a convolutional neural network, and thus an efficient convolutional neural network model capable of performing face verification and feature extraction quickly is required.
For illustrative purposes only, the embodiments of the present application may also be used in application scenarios where the translator is simultaneously interpreting vocals. The technical problems to be solved by the application scenario are as follows: in terms of speech recognition and machine translation problems, real-time speech recognition must be achieved and translation must be performed, so an efficient convolutional neural network model is required.
The embodiments of the present application may be modified and improved according to specific application environments, and are not limited herein.
In order to make the technical field of the present application better understand, embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application.
Referring to fig. 1, in the present embodiment, a knowledge distillation method based on multi-layer multi-attention migration includes:
s101, constructing an untrained student network and a pre-trained teacher network;
s102, inputting training data into a student network and a teacher network to obtain a first output characteristic set of each middle layer of the student network and a second output characteristic set of each middle layer of the teacher network;
wherein the training data is input data used in the training. Preferably, the training data can be preprocessed according to the input formats of the input layers of the teacher network and the student network to obtain the regularized training data;
the specific content of the training data is related to specific application scenarios of the teacher network and the student network, for example: in an application scene of object classification, the training data can be the feature data of a pre-selected sample object; in an application scenario of image classification, the training data may be a sample picture;
among them, generally, the intermediate layer of the neural network is composed of a convolutional layer, BN, Relu, or the like;
among them, generally, a neural network is composed of an intermediate layer, a pooling layer, and a full connection layer.
S103, determining a distillation loss function based on the first output characteristic set and the second output characteristic set;
and S104, iteratively training the student network based on the distillation loss function.
The distillation loss function is used for updating and optimizing parameters of the student network, the values of the distillation loss function are adjusted through the minimum loss function or other modes in each iteration in the student network training process, the parameters of the student network are correspondingly updated, the student network is subjected to iterative training for multiple times, so that the parameter values of the student network tend to be fitted step by step, and the training process is a process of supervising learning.
In the embodiment, various attention knowledge is effectively utilized, and the knowledge migration is carried out under a unified framework. In training a student network, various layers thereof may receive attention knowledge from different layers of the teacher network. In order to transfer rich knowledge existing in a teacher network to a student network, a plurality of teacher networks are generally adopted in the training process, however, it is very difficult to determine what knowledge each teacher is suitable for transferring, and in addition, the training of a large teacher network is expensive in time. Therefore, the invention only uses one teacher network, extracts different attention knowledge from different middle layers of the teacher network in the training process, and then migrates the attention knowledge to the student network, thereby saving time and cost and simultaneously leading the student network to obtain better performance.
Referring to fig. 2, in some embodiments, the intermediate layers of the student network and the teacher network each include a lower intermediate layer, an intermediate layer, and a higher intermediate layer, and determining the distillation loss function based on the first set of output characteristics and the second set of output characteristics includes:
determining a first loss function based on the output features of the first output feature set and the second output feature set, which are located in the lower middle layer;
the middle layer of the lower layer is a set of middle layers used for extracting edge features of a student network and a teacher network;
specifically, the edge feature refers to features that all samples in general training data have, such as a vertical edge feature, a horizontal edge feature, a color, a position, and other local features;
determining a second loss function based on the output features of the middle and upper middle layers in the first and second output feature sets;
the middle layer is a set of middle layers used for extracting local features by a student network and a teacher network;
specifically, the local features are to some extent recombined to make the local features closer to the attributes of the samples in the training data, which results in that the intermediate-level features of the samples in different training data may be different but may also be the same. For example, the local features of a wire mesh picture obtained through the neural network are definitely different from the local features of a cat picture obtained through the neural network, but the local features of a dog picture obtained through the neural network and the local features of a cat picture obtained through the neural network may be the same.
A third loss function is determined based on the output features of the first and second sets of output features that are located in the higher-level middle layer.
The high-level middle layer is a set of middle layers used for extracting global features from a student network and a teacher network;
specifically, the global feature is a recombination of the local features. Just because of the global feature, it can highly generalize the attributes of the samples in the training data, resulting in that the global feature tends to be different from sample to sample.
And weighting and summing the first loss function, the second loss function and the third loss function to obtain a distillation loss function.
In this embodiment, first, the structure of the known deep neural network is very complex, and it is composed of different middle layers stacked, and the information of interest of different middle layers is different, for example, the lower middle layer focuses on the edge information and the position information of the sample, the middle layer focuses on the local information, and the upper middle layer focuses more on the whole sample. Therefore, different attention knowledge applicable to the lower middle layer, the middle layer, and the upper middle layer of the neural network needs to be studied for different feature representations. In the knowledge distillation scheme based on multi-layer attention migration of the present application, three different attentional knowledge are considered and from this three different loss functions are obtained, namely position-based attentional knowledge, activation-based attentional knowledge and channel-based attentional knowledge.
Specifically, the determining a first loss function based on the output features of the first output feature set and the second output feature set, which are located in the lower-level middle layer, includes:
obtaining the output characteristics of the first output characteristic set positioned in the middle layer of the lower layer
Figure BDA0003133992140000061
And output features of the second set of output features that are located in lower intermediate layers
Figure BDA0003133992140000062
Wherein, the output characteristic A belongs to RC×H×WThe method comprises the following steps of (1) forming a characteristic plane with C H multiplied by W space dimensions;
will be provided with
Figure BDA0003133992140000063
Extracting features along two dimensions of W and H
Figure BDA0003133992140000064
And
Figure BDA0003133992140000065
the method specifically comprises the following steps:
Figure BDA0003133992140000066
Figure BDA0003133992140000067
Figure BDA0003133992140000068
Figure BDA0003133992140000069
therein, sigmawSum ΣHRepresents addition along the H and W spatial dimensions;
will be provided with
Figure BDA00031339921400000610
And
Figure BDA00031339921400000611
connected by 1 × 1 convolution transfer functions F, specifically:
Figure BDA00031339921400000612
Figure BDA00031339921400000613
where [, ] denotes the join operation along the spatial dimension, θ (-) denotes a non-linear activation function;
obtaining a first loss function based on the attention knowledge of the position, specifically:
Figure BDA00031339921400000614
wherein L isPKTRepresenting a first loss function, and I represents a lower intermediate layer of the student network and the teacher network.
From the first loss function, the deep neural network is considered to detect edge features (such as features of object edges and positions) in the lower-level middle layer, so that the attention knowledge based on the positions is migrated in the lower-level middle layer of the neural network in the training process.
Specifically, the determining the second loss function based on the output features of the middle layer intermediate layer and the high layer intermediate layer in the first output feature set and the second output feature set includes:
obtaining the output characteristics of the first output characteristic set positioned in the middle layer
Figure BDA0003133992140000071
And the output features of the second output feature set located in the middle layer
Figure BDA0003133992140000072
Wherein, the output characteristic A belongs to RC×H×WThe method comprises the following steps of (1) forming a characteristic plane with C H multiplied by W space dimensions;
calculating an attention mapping chart of a student network and a teacher network, specifically:
Figure BDA0003133992140000073
Figure BDA0003133992140000074
therein, sigmaCRepresenting addition along the channel dimension, vec (·) represents vectorization;
acquiring a second loss function based on the activated attention knowledge, specifically:
Figure BDA0003133992140000075
wherein L isAKTAnd J represents the middle and upper middle layers of the student and teacher networks.
According to the second loss function, the middle layer and the middle layer are combined with the edge characteristics step by step to form a more complete representation, so that the student network simulates the characteristics of the deep layer of the teacher network by adopting the widely used attention-based knowledge to pay attention to the area concerned by the teacher network;
specifically, the determining the third loss function based on the output features of the first output feature set and the second output feature set, which are located in the higher-layer middle layer, includes:
obtaining the output characteristics of the first output characteristic set positioned in the middle layer of the high layer
Figure BDA0003133992140000076
And the output features of the second output feature set located in the higher-level middle layer
Figure BDA0003133992140000077
The method is characterized in that the output characteristics of the student network and the teacher network are subjected to global average pooling, and specifically comprises the following steps:
Figure BDA0003133992140000078
Figure BDA0003133992140000079
wherein G represents a global average pooling layer;
acquiring a third loss function based on the attention knowledge of the channel, specifically:
Figure BDA00031339921400000710
wherein L isCKTRepresenting a third loss function, and K represents a higher-level middle layer of the student network and the teacher network.
As can be seen from the third loss function, the attention-based knowledge of activation is obtained by adding feature maps output by the network middle layer along the channel dimension, and therefore the knowledge of the channel dimension is missing in the migration process.
Specifically, in the first loss function LPKTA second loss function LAKTAnd a third loss function LCKTFor the number of the middle-layer middle layer, the middle-layer middle layer and the high-layer middle layer, when the number of the middle layers of the teacher network and the student network is different, the same number of corresponding middle layers can be selected for establishing the loss function, and therefore the number of the middle layers of the teacher network and the number of the middle layers of the student network do not need to be kept the same.
Finally, the first loss function L is determinedPKTA second loss function LAKTAnd a third loss function LCKTAnd obtaining a distillation loss function by weighted summation, wherein the distillation loss function comprises the following specific steps:
Ltotal=αLPKT+βLAKT+γLCKT
wherein L istotalRepresents a distillation loss function, and α, β, and γ represent weight coefficients.
The multi-layer attention migration based knowledge distillation method of the present invention will be further described with reference to experimental data as follows:
referring to fig. 3, a CIFAR-100 dataset is used as training data and is input into the knowledge distillation method based on multilayer multi-attention migration and other knowledge distillation schemes of the present application, and a comparison diagram of the accuracy of the knowledge distillation method based on multilayer multi-attention migration and other knowledge distillation methods in fig. 3 is finally obtained, and it can be seen from fig. 3 that the performance of the neural network is obviously improved compared with other methods after the training by the method provided by the present invention.
In particular, HMAT in fig. 3 represents the knowledge distillation method based on multiple knowledge shifts of the present application;
specifically, the knowledge distillation method based on multiple knowledge migration in fig. 3 is the top, which shows that the accuracy of the knowledge distillation method is the highest compared with other knowledge distillation schemes;
specifically, other methods of knowledge distillation in FIG. 3 include H-AT, CCKD, VID, KDAFM, AT and KD.
In some embodiments, the present application further discloses a knowledge distillation apparatus based on multi-layer multi-attention migration, comprising:
the model building module is used for building an untrained student network and a pre-trained teacher network;
the output characteristic acquisition module is used for inputting the training data into the student network and the teacher network to acquire a first output characteristic set of each middle layer of the student network and a second output characteristic set of each middle layer of the teacher network;
a loss function determination module to determine a distillation loss function based on the first set of output characteristics and the second set of output characteristics;
and the network training module is used for carrying out iterative training on the student network based on the distillation loss function.
In some embodiments, the present application further discloses a computer device, which is characterized by comprising a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to realize the steps of the knowledge distillation method based on multi-layer multi-attention migration.
The computer device may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.
The memory includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or D interface display memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage may be an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. In other embodiments, the memory may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like provided on the computer device. Of course, the memory may also include both internal and external storage devices of the computer device. In this embodiment, the memory is used for storing an operating system and various application software installed in the computer device, such as program codes of a knowledge distillation method based on multi-layer multi-attention migration. In addition, the memory may also be used to temporarily store various types of data that have been output or are to be output.
The processor may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor is typically used to control the overall operation of the computer device. In this embodiment, the processor is configured to execute the program code stored in the memory or process data, for example, execute the program code of the knowledge distillation method based on multi-layer multi-attention migration.
In some embodiments, the present application further discloses a computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when executed by a processor, the computer program implements the steps of the knowledge distillation method based on multi-layer multi-attention migration.
Wherein the computer readable storage medium stores an interface display program executable by at least one processor to cause the at least one processor to perform the steps of the program code of the knowledge distillation method based on multi-layer multi-attention migration as described above.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present application.
The above is an embodiment of the present invention. The embodiments and specific parameters in the embodiments are only used for clearly illustrating the verification process of the invention and are not used for limiting the patent protection scope of the invention, which is defined by the claims, and all the equivalent structural changes made by using the contents of the description and the drawings of the present invention should be included in the protection scope of the present invention.

Claims (9)

1. Knowledge distillation method based on multilayer multi-attention migration is characterized by comprising the following steps:
constructing an untrained student network and a pre-trained teacher network;
inputting training data into the student network and the teacher network to obtain a first output feature set of each middle layer of the student network and a second output feature set of each middle layer of the teacher network;
determining a distillation loss function based on the first set of output characteristics and the second set of output characteristics;
iteratively training the student network based on the distillation loss function.
2. The multi-tier, multi-attention-migration based knowledge distillation method of claim 1, wherein the intermediate tiers of the student network and the teacher network each include a lower intermediate tier, an intermediate middle tier, and an upper intermediate tier, and wherein determining a distillation loss function based on the first set of output features and the second set of output features comprises:
determining a first loss function based on output features of the first and second sets of output features that are located in the lower intermediate layer;
determining a second loss function based on output features of the first and second sets of output features that are located in the middle and upper middle layers;
determining a third loss function based on the output features of the first and second sets of output features that are located at the higher-level middle layer;
and weighting and summing the first loss function, the second loss function and the third loss function to obtain a distillation loss function.
3. The knowledge distillation method based on multi-layer multi-attention migration according to claim 2, characterized in that:
the low-level middle layer is a set of middle layers used for extracting edge features by the student network and the teacher network;
the middle layer is a set of middle layers used for extracting local features by the student network and the teacher network;
the high-level middle layer is a set of middle layers used for extracting global features by the student network and the teacher network.
4. The method of knowledge distillation based on multi-layer multi-attention migration according to claim 2, wherein determining a first loss function based on the output features of the first and second sets of output features that are located in the lower middle layer comprises:
obtaining the output characteristics of the first output characteristic set positioned in the middle layer of the lower layer
Figure FDA0003133992130000011
And output features of the second set of output features that are located in lower intermediate layers
Figure FDA0003133992130000012
Wherein, the output characteristic A belongs to RC×H×WThe method comprises the following steps of (1) forming a characteristic plane with C H multiplied by W space dimensions;
will be provided with
Figure FDA0003133992130000013
Extracting features along two dimensions of W and H
Figure FDA0003133992130000014
And
Figure FDA0003133992130000015
the method specifically comprises the following steps:
Figure FDA0003133992130000016
Figure FDA0003133992130000021
Figure FDA0003133992130000022
Figure FDA0003133992130000023
therein, sigmaWSum ΣHRepresents addition along the H and W spatial dimensions;
will be provided with
Figure FDA0003133992130000024
And
Figure FDA0003133992130000025
connected by 1 × 1 convolution transfer functions F, specifically:
Figure FDA0003133992130000026
Figure FDA0003133992130000027
where [, ] denotes the join operation along the spatial dimension, θ (-) denotes a non-linear activation function;
obtaining a first loss function based on the attention knowledge of the position, specifically:
Figure FDA0003133992130000028
wherein L isPKTRepresenting a first loss function, and I represents a lower intermediate layer of the student network and the teacher network.
5. The method of knowledge distillation based on multi-layer multi-attention migration of claim 2, wherein determining a second loss function based on the output characteristics of the middle and upper intermediate layers in the first and second sets of output characteristics comprises:
obtaining the output characteristics of the first output characteristic set positioned in the middle layer
Figure FDA0003133992130000029
And output features of the second set of output features that are located in a middle tier of the second set of output features
Figure FDA00031339921300000210
Wherein, the output characteristic A belongs to RC×H×WThe method comprises the following steps of (1) forming a characteristic plane with C H multiplied by W space dimensions;
calculating an attention mapping chart of the student network and the teacher network, specifically:
Figure FDA00031339921300000211
Figure FDA00031339921300000212
therein, sigmaCRepresenting addition along the channel dimension, vec (·) represents vectorization;
acquiring a second loss function based on the activated attention knowledge, specifically:
Figure FDA00031339921300000213
wherein L isAKTAnd J represents the middle and upper middle layers of the student and teacher networks.
6. The multi-layer multi-attention-migration based knowledge distillation method of claim 2, wherein determining a third loss function based on the output features of the first and second sets of output features at the upper-layer middle layer comprises:
obtaining the output characteristics of the first output characteristic set positioned in the middle layer of the high layer
Figure FDA0003133992130000031
And the output features of the second output feature set positioned in the higher-level middle layer
Figure FDA0003133992130000032
And performing global average pooling on the output characteristics of the student network and the teacher network, specifically:
Figure FDA0003133992130000033
Figure FDA0003133992130000034
wherein G represents a global average pooling layer;
acquiring a third loss function based on the attention knowledge of the channel, specifically:
Figure FDA0003133992130000035
wherein L isCKTRepresenting a third loss function, and K represents a higher-level middle layer of the student network and the teacher network.
7. A knowledge distillation apparatus based on multi-layer multi-attention migration, comprising:
the model building module is used for building an untrained student network and a pre-trained teacher network;
the output characteristic acquisition module is used for inputting training data into the student network and the teacher network to acquire a first output characteristic set of each middle layer of the student network and a second output characteristic set of each middle layer of the teacher network;
a loss function determination module to determine a distillation loss function based on the first set of output characteristics and the second set of output characteristics;
a network training module to iteratively train the student network based on the distillation loss function.
8. A computer arrangement, characterized by comprising a memory in which a computer program is stored and a processor which, when executing the computer program, carries out the steps of the method for knowledge distillation based on multi-layer multi-attention migration according to any one of claims 1 to 6.
9. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps of the method for knowledge distillation based on multi-layer multi-attention migration according to any one of claims 1 to 6.
CN202110713825.8A 2021-06-25 2021-06-25 Knowledge distillation method, device and equipment based on multilayer multi-attention migration Pending CN113326941A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110713825.8A CN113326941A (en) 2021-06-25 2021-06-25 Knowledge distillation method, device and equipment based on multilayer multi-attention migration
CN202210535550.8A CN114742223A (en) 2021-06-25 2022-05-17 Vehicle model identification method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110713825.8A CN113326941A (en) 2021-06-25 2021-06-25 Knowledge distillation method, device and equipment based on multilayer multi-attention migration

Publications (1)

Publication Number Publication Date
CN113326941A true CN113326941A (en) 2021-08-31

Family

ID=77424815

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202110713825.8A Pending CN113326941A (en) 2021-06-25 2021-06-25 Knowledge distillation method, device and equipment based on multilayer multi-attention migration
CN202210535550.8A Pending CN114742223A (en) 2021-06-25 2022-05-17 Vehicle model identification method and device, computer equipment and storage medium

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN202210535550.8A Pending CN114742223A (en) 2021-06-25 2022-05-17 Vehicle model identification method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (2) CN113326941A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113487614A (en) * 2021-09-08 2021-10-08 四川大学 Training method and device for fetus ultrasonic standard section image recognition network model
CN113963022A (en) * 2021-10-20 2022-01-21 哈尔滨工业大学 Knowledge distillation-based target tracking method of multi-outlet full convolution network
CN114139703A (en) * 2021-11-26 2022-03-04 上海瑾盛通信科技有限公司 Knowledge distillation method and device, storage medium and electronic equipment
CN114298224A (en) * 2021-12-29 2022-04-08 云从科技集团股份有限公司 Image classification method, device and computer readable storage medium
CN114387447A (en) * 2021-11-22 2022-04-22 西安电子科技大学 Neural network compression method based on attention migration of embedded feature similarity
CN114742223A (en) * 2021-06-25 2022-07-12 江苏大学 Vehicle model identification method and device, computer equipment and storage medium
WO2022217853A1 (en) * 2021-04-16 2022-10-20 Huawei Technologies Co., Ltd. Methods, devices and media for improving knowledge distillation using intermediate representations
WO2023097638A1 (en) * 2021-12-03 2023-06-08 宁德时代新能源科技股份有限公司 Rapid anomaly detection method and system based on contrastive representation distillation
CN117116408A (en) * 2023-10-25 2023-11-24 湖南科技大学 Relation extraction method for electronic medical record analysis
CN117253123A (en) * 2023-08-11 2023-12-19 中国矿业大学 Knowledge distillation method based on fusion matching of intermediate layer feature auxiliary modules

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111105008A (en) * 2018-10-29 2020-05-05 富士通株式会社 Model training method, data recognition method and data recognition device
CN111144490B (en) * 2019-12-26 2022-09-06 南京邮电大学 Fine granularity identification method based on alternative knowledge distillation strategy
CN112164054B (en) * 2020-09-30 2024-07-26 交叉信息核心技术研究院(西安)有限公司 Image target detection method and detector based on knowledge distillation and training method thereof
CN112508080B (en) * 2020-12-03 2024-01-12 广州大学 Vehicle model identification method, device, equipment and medium based on experience playback
CN113326941A (en) * 2021-06-25 2021-08-31 江苏大学 Knowledge distillation method, device and equipment based on multilayer multi-attention migration

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022217853A1 (en) * 2021-04-16 2022-10-20 Huawei Technologies Co., Ltd. Methods, devices and media for improving knowledge distillation using intermediate representations
CN114742223A (en) * 2021-06-25 2022-07-12 江苏大学 Vehicle model identification method and device, computer equipment and storage medium
CN113487614A (en) * 2021-09-08 2021-10-08 四川大学 Training method and device for fetus ultrasonic standard section image recognition network model
CN113963022A (en) * 2021-10-20 2022-01-21 哈尔滨工业大学 Knowledge distillation-based target tracking method of multi-outlet full convolution network
CN113963022B (en) * 2021-10-20 2023-08-18 哈尔滨工业大学 Multi-outlet full convolution network target tracking method based on knowledge distillation
CN114387447A (en) * 2021-11-22 2022-04-22 西安电子科技大学 Neural network compression method based on attention migration of embedded feature similarity
CN114139703A (en) * 2021-11-26 2022-03-04 上海瑾盛通信科技有限公司 Knowledge distillation method and device, storage medium and electronic equipment
US12020425B2 (en) 2021-12-03 2024-06-25 Contemporary Amperex Technology Co., Limited Fast anomaly detection method and system based on contrastive representation distillation
WO2023097638A1 (en) * 2021-12-03 2023-06-08 宁德时代新能源科技股份有限公司 Rapid anomaly detection method and system based on contrastive representation distillation
CN114298224A (en) * 2021-12-29 2022-04-08 云从科技集团股份有限公司 Image classification method, device and computer readable storage medium
CN114298224B (en) * 2021-12-29 2024-06-18 云从科技集团股份有限公司 Image classification method, apparatus and computer readable storage medium
CN117253123A (en) * 2023-08-11 2023-12-19 中国矿业大学 Knowledge distillation method based on fusion matching of intermediate layer feature auxiliary modules
CN117253123B (en) * 2023-08-11 2024-05-17 中国矿业大学 Knowledge distillation method based on fusion matching of intermediate layer feature auxiliary modules
CN117116408B (en) * 2023-10-25 2024-01-26 湖南科技大学 Relation extraction method for electronic medical record analysis
CN117116408A (en) * 2023-10-25 2023-11-24 湖南科技大学 Relation extraction method for electronic medical record analysis

Also Published As

Publication number Publication date
CN114742223A (en) 2022-07-12

Similar Documents

Publication Publication Date Title
CN113326941A (en) Knowledge distillation method, device and equipment based on multilayer multi-attention migration
CN112084331B (en) Text processing and model training method and device, computer equipment and storage medium
US20210319232A1 (en) Temporally distributed neural networks for video semantic segmentation
CN111160350B (en) Portrait segmentation method, model training method, device, medium and electronic equipment
CN110782420A (en) Small target feature representation enhancement method based on deep learning
CN114596566B (en) Text recognition method and related device
WO2024041479A1 (en) Data processing method and apparatus
WO2021012493A1 (en) Short video keyword extraction method and apparatus, and storage medium
CN114742224A (en) Pedestrian re-identification method and device, computer equipment and storage medium
CN111104941B (en) Image direction correction method and device and electronic equipment
CN113011320B (en) Video processing method, device, electronic equipment and storage medium
CN116958323A (en) Image generation method, device, electronic equipment, storage medium and program product
CN117033609A (en) Text visual question-answering method, device, computer equipment and storage medium
CN112749556A (en) Multi-language model training method and device, storage medium and electronic equipment
CN113505640A (en) Small-scale pedestrian detection method based on multi-scale feature fusion
CN109657082A (en) Remote sensing images multi-tag search method and system based on full convolutional neural networks
CN117216536A (en) Model training method, device and equipment and storage medium
CN115205546A (en) Model training method and device, electronic equipment and storage medium
CN113343898B (en) Mask shielding face recognition method, device and equipment based on knowledge distillation network
CN118155231A (en) Document identification method, device, equipment, medium and product
CN115374304A (en) Data processing method, electronic device, storage medium, and computer program product
CN114022928B (en) DEEPFAKES video detection method and system based on double streams
CN114332561A (en) Super-resolution model training method, device, equipment and medium
CN113537186A (en) Text image recognition method and device, electronic equipment and storage medium
Jobin et al. Classroom slide narration system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20210831

WD01 Invention patent application deemed withdrawn after publication