CN113326941A

CN113326941A - Knowledge distillation method, device and equipment based on multilayer multi-attention migration

Info

Publication number: CN113326941A
Application number: CN202110713825.8A
Authority: CN
Inventors: 苟建平; 孙立媛; 欧卫华; 陈潇君; 夏书银; 柯佳
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2021-06-25
Filing date: 2021-06-25
Publication date: 2021-08-31
Also published as: CN114742223A

Abstract

The invention relates to the technical field of knowledge distillation, and discloses a knowledge distillation method, a knowledge distillation device and knowledge distillation equipment based on multilayer multi-attention migration, wherein the knowledge distillation method comprises the following steps: constructing an untrained student network and a pre-trained teacher network; inputting training data into a student network and a teacher network to obtain a first output characteristic set of each intermediate layer of the student network and a second output characteristic set of each intermediate layer of the teacher network; determining a distillation loss function based on the first set of output characteristics and the second set of output characteristics; and iteratively training the student network based on the distillation loss function. According to the invention, different types of attention knowledge are explored in different intermediate layers of the deep neural network for migration, so that the learning of the student network is effectively guided, and the performance and generalization capability of the student network are improved.

Description

Knowledge distillation method, device and equipment based on multilayer multi-attention migration

Technical Field

The invention relates to the technical field of knowledge distillation, in particular to a knowledge distillation method, a knowledge distillation device and knowledge distillation equipment based on multilayer multi-attention migration.

Background

In recent years, the deep neural network has made a breakthrough in many fields such as computer vision, speech recognition, and natural language processing, and more work is being focused on the performance of the neural network. However, to achieve better performance, large deep neural networks are designed to contain hundreds of layers and millions of parameters. As the number of parameters increases, the complexity of the model also increases, increasing runtime and storage costs, and it is difficult to deploy such a large model onto a mobile device or an embedded device. Therefore, various model compression techniques are proposed to obtain a lightweight model with good performance and easy deployment. Knowledge distillation is widely concerned due to the characteristics of easy realization, simple operation and capability of effectively improving the performance of a small model.

Knowledge distillation refers to extracting knowledge from a large-scale teacher network and transferring the knowledge to a small-scale student network in the training process, so that the performance of the small-scale student network is improved. In this way, a small neural network model with better performance can be obtained, so that the small neural network model can be conveniently deployed to a mobile terminal. Since the middle layer of the neural network contains a large number of parameters and a complex network structure, in recent years, many knowledge distillation schemes of the deep neural network migrate the feature representation of the middle layer as knowledge in the process of training the model. The knowledge in the teacher network output layer is migrated to the student network as if the answer of a certain problem is directly taught to students, and the knowledge in the teacher network intermediate layer is migrated to the student network as if the method for solving a certain problem is taught to students, so that the migration of the knowledge in the teacher network intermediate layer is undoubtedly a more effective way for improving the generalization capability of the student network. Among various knowledge distillation methods based on middle layer feature extraction, attention has been paid to the knowledge based on attention, since the knowledge enables a student network to obtain good performance and has great flexibility with respect to the structures of a teacher network and a student network (i.e., the spatial dimensions and the number of features of the teacher network and the student network do not need to be the same).

Among them, Zagoruyko et al propose an attention migration scheme based on an attention map. Namely, in the training process, the attention map based on activation or the attention map based on gradient in the teacher network is migrated into the student network as knowledge, and the fact that the attention map based on activation is migrated can enable the student network to obtain better performance compared with the attention map based on gradient is verified through experiments.

Dong et al, among other things, have adopted a framework for a plurality of teachers to teach a student, and have defined one of the teacher's networks as an expert network, which migrates itself to the student's network based on activated attention knowledge during the course of training. However, training of multiple teacher networks takes a significant time penalty.

Kundu et al, which combines two model compression modes of knowledge distillation and parameter pruning, migrate attention knowledge based on activation to a student network during training.

Among them, Qu et al propose a hybrid attention knowledge migration method, i.e., in the course of training, the teacher network migrates its own attention knowledge based on activation and attention knowledge based on channels to the student network simultaneously. The authors think that the two kinds of knowledge can complement each other and act simultaneously, so as to better improve the performance of the student network.

Among other things, Chen et al designed a new convolutional neural network, called SCA-CNN, into which both space-based attention and channel-based attention are integrated.

However, in conjunction with the knowledge distillation methods described above, existing attention-based knowledge distillation methods choose to migrate the same attention knowledge at different layers of the network, ignoring the hierarchical structure for deep neural networks representing learning, resulting in limited performance and generalization capability of the deep neural networks.

Disclosure of Invention

Based on the technical problems, in order to further improve the performance and generalization capability of the student network, the invention provides a knowledge distillation method, a knowledge distillation device and knowledge distillation equipment based on multilayer multi-attention migration, which are used for searching different types of attention knowledge in different intermediate layers of a deep neural network for migration, so as to effectively guide the learning of the student network and improve the performance and generalization capability of the student network, and specifically comprise the following technical schemes:

a knowledge distillation method based on multi-layer multi-attention migration, comprising:

constructing an untrained student network and a pre-trained teacher network;

inputting training data into a student network and a teacher network to obtain a first output characteristic set of each intermediate layer of the student network and a second output characteristic set of each intermediate layer of the teacher network;

determining a distillation loss function based on the first set of output characteristics and the second set of output characteristics;

and iteratively training the student network based on the distillation loss function.

A knowledge distillation apparatus based on multi-layer multi-attention migration, comprising:

the model building module is used for building an untrained student network and a pre-trained teacher network;

the output characteristic acquisition module is used for inputting the training data into the student network and the teacher network to acquire a first output characteristic set of each middle layer of the student network and a second output characteristic set of each middle layer of the teacher network;

a loss function determination module to determine a distillation loss function based on the first set of output characteristics and the second set of output characteristics;

and the network training module is used for carrying out iterative training on the student network based on the distillation loss function.

A computer device comprising a memory in which a computer program is stored and a processor which, when executing the computer program, carries out the steps of the above-described knowledge distillation method based on multi-layer multi-attention migration.

A computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the above-described multi-layer multi-attention-migration based knowledge distillation method.

Compared with the prior art, the invention has the beneficial effects that:

the invention effectively utilizes a plurality of attention knowledge and carries out knowledge migration under a unified framework. In training a student network, various layers thereof may receive attention knowledge from different layers of the teacher network. In order to transfer rich knowledge existing in a teacher network to a student network, a plurality of teacher networks are generally adopted in the training process, however, it is very difficult to determine what knowledge each teacher is suitable for transferring, and in addition, the training of a large teacher network is expensive in time. Therefore, the invention only uses one teacher network, extracts different attention knowledge from different middle layers of the teacher network in the training process, and then migrates the attention knowledge to the student network, thereby saving time and cost and simultaneously leading the student network to obtain better performance.

Drawings

The present application will be further explained by way of exemplary embodiments, which will be described in detail by way of the accompanying drawings, in which:

FIG. 1 is a schematic flow diagram of a knowledge distillation process based on multi-layer multi-attention migration.

FIG. 2 is a schematic diagram of the basic framework of the distillation method based on the knowledge of multi-layer multi-attention migration.

FIG. 3 is a graphical representation of the performance of a knowledge distillation process based on multi-layer multi-attention migration compared to other knowledge distillation processes.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings of the embodiments of the present disclosure. It is to be understood that the described embodiments are only a few embodiments of the present disclosure, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the disclosure without any inventive step, are within the scope of protection of the disclosure.

The application aims to provide a knowledge distillation method, a device, equipment and a medium based on multilayer multi-attention migration, wherein the method comprises the following steps: constructing an untrained student network and a pre-trained teacher network; inputting training data into a student network and a teacher network to obtain a first output characteristic set of each intermediate layer of the student network and a second output characteristic set of each intermediate layer of the teacher network; determining a distillation loss function based on the first set of output characteristics and the second set of output characteristics; and iteratively training the student network based on the distillation loss function.

The embodiments of the present application may be applied to various application scenarios including, but not limited to, various scenarios in the field of computer vision applications, such as face recognition, image classification, object detection, semantic segmentation, etc., or to neural network model-based processing systems deployed on edge devices (e.g., mobile phones, wearable devices, computing nodes, etc.), or to application scenarios for speech signal processing, natural language processing, recommendation systems, or to application scenarios requiring compression of neural network models due to limited resources and latency requirements.

For illustrative purposes only, the embodiments of the present application may be applied to an application scenario of object detection at a mobile phone end. The technical problems to be solved by the application scenario are as follows: when a user uses a mobile phone to take a picture, the user needs to automatically capture objects such as human faces and animals so as to help the mobile phone to automatically focus, beautify and the like, so that a convolutional neural network model for object detection, which is small in size and fast in operation, is needed, and better user experience is brought to the user and the quality of mobile phone products is improved.

For illustrative purposes only, the present application embodiments may also be used in application scenarios of autonomous driving scenario segmentation. The technical problems to be solved by the application scenario are as follows: after capturing the road image, the camera of the automatic driving vehicle needs to divide the image, and then separates different objects such as road surface, roadbed, vehicles, pedestrians and the like, so as to keep the vehicle running in a correct area. There is therefore a need for a convolutional neural network model that can provide fast real-time correct interpretation and semantic segmentation of a picture.

For illustrative purposes only, the embodiments of the present application may also be used in application scenarios of portal gate face verification. The technical problems to be solved by the application scenario are as follows: when passengers carry out face authentication on gates at entrances of high-speed rails, airports and the like, a camera can shoot a face image and extract features by using a convolutional neural network, and then similarity calculation is carried out on the face image and the image features of identity documents stored in a system; and if the similarity is high, the verification is successful. Among them, it is most time-consuming to extract features through a convolutional neural network, and thus an efficient convolutional neural network model capable of performing face verification and feature extraction quickly is required.

For illustrative purposes only, the embodiments of the present application may also be used in application scenarios where the translator is simultaneously interpreting vocals. The technical problems to be solved by the application scenario are as follows: in terms of speech recognition and machine translation problems, real-time speech recognition must be achieved and translation must be performed, so an efficient convolutional neural network model is required.

The embodiments of the present application may be modified and improved according to specific application environments, and are not limited herein.

In order to make the technical field of the present application better understand, embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application.

Referring to fig. 1, in the present embodiment, a knowledge distillation method based on multi-layer multi-attention migration includes:

s101, constructing an untrained student network and a pre-trained teacher network;

s102, inputting training data into a student network and a teacher network to obtain a first output characteristic set of each middle layer of the student network and a second output characteristic set of each middle layer of the teacher network;

wherein the training data is input data used in the training. Preferably, the training data can be preprocessed according to the input formats of the input layers of the teacher network and the student network to obtain the regularized training data;

the specific content of the training data is related to specific application scenarios of the teacher network and the student network, for example: in an application scene of object classification, the training data can be the feature data of a pre-selected sample object; in an application scenario of image classification, the training data may be a sample picture;

among them, generally, the intermediate layer of the neural network is composed of a convolutional layer, BN, Relu, or the like;

among them, generally, a neural network is composed of an intermediate layer, a pooling layer, and a full connection layer.

S103, determining a distillation loss function based on the first output characteristic set and the second output characteristic set;

and S104, iteratively training the student network based on the distillation loss function.

The distillation loss function is used for updating and optimizing parameters of the student network, the values of the distillation loss function are adjusted through the minimum loss function or other modes in each iteration in the student network training process, the parameters of the student network are correspondingly updated, the student network is subjected to iterative training for multiple times, so that the parameter values of the student network tend to be fitted step by step, and the training process is a process of supervising learning.

In the embodiment, various attention knowledge is effectively utilized, and the knowledge migration is carried out under a unified framework. In training a student network, various layers thereof may receive attention knowledge from different layers of the teacher network. In order to transfer rich knowledge existing in a teacher network to a student network, a plurality of teacher networks are generally adopted in the training process, however, it is very difficult to determine what knowledge each teacher is suitable for transferring, and in addition, the training of a large teacher network is expensive in time. Therefore, the invention only uses one teacher network, extracts different attention knowledge from different middle layers of the teacher network in the training process, and then migrates the attention knowledge to the student network, thereby saving time and cost and simultaneously leading the student network to obtain better performance.

Referring to fig. 2, in some embodiments, the intermediate layers of the student network and the teacher network each include a lower intermediate layer, an intermediate layer, and a higher intermediate layer, and determining the distillation loss function based on the first set of output characteristics and the second set of output characteristics includes:

determining a first loss function based on the output features of the first output feature set and the second output feature set, which are located in the lower middle layer;

the middle layer of the lower layer is a set of middle layers used for extracting edge features of a student network and a teacher network;

specifically, the edge feature refers to features that all samples in general training data have, such as a vertical edge feature, a horizontal edge feature, a color, a position, and other local features;

determining a second loss function based on the output features of the middle and upper middle layers in the first and second output feature sets;

the middle layer is a set of middle layers used for extracting local features by a student network and a teacher network;

specifically, the local features are to some extent recombined to make the local features closer to the attributes of the samples in the training data, which results in that the intermediate-level features of the samples in different training data may be different but may also be the same. For example, the local features of a wire mesh picture obtained through the neural network are definitely different from the local features of a cat picture obtained through the neural network, but the local features of a dog picture obtained through the neural network and the local features of a cat picture obtained through the neural network may be the same.

A third loss function is determined based on the output features of the first and second sets of output features that are located in the higher-level middle layer.

The high-level middle layer is a set of middle layers used for extracting global features from a student network and a teacher network;

specifically, the global feature is a recombination of the local features. Just because of the global feature, it can highly generalize the attributes of the samples in the training data, resulting in that the global feature tends to be different from sample to sample.

And weighting and summing the first loss function, the second loss function and the third loss function to obtain a distillation loss function.

In this embodiment, first, the structure of the known deep neural network is very complex, and it is composed of different middle layers stacked, and the information of interest of different middle layers is different, for example, the lower middle layer focuses on the edge information and the position information of the sample, the middle layer focuses on the local information, and the upper middle layer focuses more on the whole sample. Therefore, different attention knowledge applicable to the lower middle layer, the middle layer, and the upper middle layer of the neural network needs to be studied for different feature representations. In the knowledge distillation scheme based on multi-layer attention migration of the present application, three different attentional knowledge are considered and from this three different loss functions are obtained, namely position-based attentional knowledge, activation-based attentional knowledge and channel-based attentional knowledge.

Specifically, the determining a first loss function based on the output features of the first output feature set and the second output feature set, which are located in the lower-level middle layer, includes:

obtaining the output characteristics of the first output characteristic set positioned in the middle layer of the lower layer

And output features of the second set of output features that are located in lower intermediate layers

Wherein, the output characteristic A belongs to R^C×H×WThe method comprises the following steps of (1) forming a characteristic plane with C H multiplied by W space dimensions;

will be provided with

Extracting features along two dimensions of W and H

And

the method specifically comprises the following steps:

therein, sigma_wSum Σ_HRepresents addition along the H and W spatial dimensions;

will be provided with

And

connected by 1 × 1 convolution transfer functions F, specifically:

where [, ] denotes the join operation along the spatial dimension, θ (-) denotes a non-linear activation function;

obtaining a first loss function based on the attention knowledge of the position, specifically:

wherein L is_PKTRepresenting a first loss function, and I represents a lower intermediate layer of the student network and the teacher network.

From the first loss function, the deep neural network is considered to detect edge features (such as features of object edges and positions) in the lower-level middle layer, so that the attention knowledge based on the positions is migrated in the lower-level middle layer of the neural network in the training process.

Specifically, the determining the second loss function based on the output features of the middle layer intermediate layer and the high layer intermediate layer in the first output feature set and the second output feature set includes:

obtaining the output characteristics of the first output characteristic set positioned in the middle layer

And the output features of the second output feature set located in the middle layer

calculating an attention mapping chart of a student network and a teacher network, specifically:

therein, sigma_CRepresenting addition along the channel dimension, vec (·) represents vectorization;

acquiring a second loss function based on the activated attention knowledge, specifically:

wherein L is_AKTAnd J represents the middle and upper middle layers of the student and teacher networks.

According to the second loss function, the middle layer and the middle layer are combined with the edge characteristics step by step to form a more complete representation, so that the student network simulates the characteristics of the deep layer of the teacher network by adopting the widely used attention-based knowledge to pay attention to the area concerned by the teacher network;

specifically, the determining the third loss function based on the output features of the first output feature set and the second output feature set, which are located in the higher-layer middle layer, includes:

obtaining the output characteristics of the first output characteristic set positioned in the middle layer of the high layer

And the output features of the second output feature set located in the higher-level middle layer

The method is characterized in that the output characteristics of the student network and the teacher network are subjected to global average pooling, and specifically comprises the following steps:

wherein G represents a global average pooling layer;

acquiring a third loss function based on the attention knowledge of the channel, specifically:

wherein L is_CKTRepresenting a third loss function, and K represents a higher-level middle layer of the student network and the teacher network.

As can be seen from the third loss function, the attention-based knowledge of activation is obtained by adding feature maps output by the network middle layer along the channel dimension, and therefore the knowledge of the channel dimension is missing in the migration process.

Specifically, in the first loss function L_PKTA second loss function L_AKTAnd a third loss function L_CKTFor the number of the middle-layer middle layer, the middle-layer middle layer and the high-layer middle layer, when the number of the middle layers of the teacher network and the student network is different, the same number of corresponding middle layers can be selected for establishing the loss function, and therefore the number of the middle layers of the teacher network and the number of the middle layers of the student network do not need to be kept the same.

Finally, the first loss function L is determined_PKTA second loss function L_AKTAnd a third loss function L_CKTAnd obtaining a distillation loss function by weighted summation, wherein the distillation loss function comprises the following specific steps:

L_total＝αL_PKT+βL_AKT+γL_CKT

wherein L is_totalRepresents a distillation loss function, and α, β, and γ represent weight coefficients.

The multi-layer attention migration based knowledge distillation method of the present invention will be further described with reference to experimental data as follows:

referring to fig. 3, a CIFAR-100 dataset is used as training data and is input into the knowledge distillation method based on multilayer multi-attention migration and other knowledge distillation schemes of the present application, and a comparison diagram of the accuracy of the knowledge distillation method based on multilayer multi-attention migration and other knowledge distillation methods in fig. 3 is finally obtained, and it can be seen from fig. 3 that the performance of the neural network is obviously improved compared with other methods after the training by the method provided by the present invention.

In particular, HMAT in fig. 3 represents the knowledge distillation method based on multiple knowledge shifts of the present application;

specifically, the knowledge distillation method based on multiple knowledge migration in fig. 3 is the top, which shows that the accuracy of the knowledge distillation method is the highest compared with other knowledge distillation schemes;

specifically, other methods of knowledge distillation in FIG. 3 include H-AT, CCKD, VID, KDAFM, AT and KD.

In some embodiments, the present application further discloses a knowledge distillation apparatus based on multi-layer multi-attention migration, comprising:

In some embodiments, the present application further discloses a computer device, which is characterized by comprising a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to realize the steps of the knowledge distillation method based on multi-layer multi-attention migration.

The computer device may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.

The memory includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or D interface display memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage may be an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. In other embodiments, the memory may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like provided on the computer device. Of course, the memory may also include both internal and external storage devices of the computer device. In this embodiment, the memory is used for storing an operating system and various application software installed in the computer device, such as program codes of a knowledge distillation method based on multi-layer multi-attention migration. In addition, the memory may also be used to temporarily store various types of data that have been output or are to be output.

The processor may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor is typically used to control the overall operation of the computer device. In this embodiment, the processor is configured to execute the program code stored in the memory or process data, for example, execute the program code of the knowledge distillation method based on multi-layer multi-attention migration.

In some embodiments, the present application further discloses a computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when executed by a processor, the computer program implements the steps of the knowledge distillation method based on multi-layer multi-attention migration.

Wherein the computer readable storage medium stores an interface display program executable by at least one processor to cause the at least one processor to perform the steps of the program code of the knowledge distillation method based on multi-layer multi-attention migration as described above.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present application.

The above is an embodiment of the present invention. The embodiments and specific parameters in the embodiments are only used for clearly illustrating the verification process of the invention and are not used for limiting the patent protection scope of the invention, which is defined by the claims, and all the equivalent structural changes made by using the contents of the description and the drawings of the present invention should be included in the protection scope of the present invention.

Claims

1. Knowledge distillation method based on multilayer multi-attention migration is characterized by comprising the following steps:

constructing an untrained student network and a pre-trained teacher network;

inputting training data into the student network and the teacher network to obtain a first output feature set of each middle layer of the student network and a second output feature set of each middle layer of the teacher network;

iteratively training the student network based on the distillation loss function.

2. The multi-tier, multi-attention-migration based knowledge distillation method of claim 1, wherein the intermediate tiers of the student network and the teacher network each include a lower intermediate tier, an intermediate middle tier, and an upper intermediate tier, and wherein determining a distillation loss function based on the first set of output features and the second set of output features comprises:

determining a first loss function based on output features of the first and second sets of output features that are located in the lower intermediate layer;

determining a second loss function based on output features of the first and second sets of output features that are located in the middle and upper middle layers;

determining a third loss function based on the output features of the first and second sets of output features that are located at the higher-level middle layer;

3. The knowledge distillation method based on multi-layer multi-attention migration according to claim 2, characterized in that:

the low-level middle layer is a set of middle layers used for extracting edge features by the student network and the teacher network;

the middle layer is a set of middle layers used for extracting local features by the student network and the teacher network;

the high-level middle layer is a set of middle layers used for extracting global features by the student network and the teacher network.

4. The method of knowledge distillation based on multi-layer multi-attention migration according to claim 2, wherein determining a first loss function based on the output features of the first and second sets of output features that are located in the lower middle layer comprises:

will be provided with

Extracting features along two dimensions of W and H

And

the method specifically comprises the following steps:

will be provided with

And

connected by 1 × 1 convolution transfer functions F, specifically:

5. The method of knowledge distillation based on multi-layer multi-attention migration of claim 2, wherein determining a second loss function based on the output characteristics of the middle and upper intermediate layers in the first and second sets of output characteristics comprises:

And output features of the second set of output features that are located in a middle tier of the second set of output features

calculating an attention mapping chart of the student network and the teacher network, specifically:

6. The multi-layer multi-attention-migration based knowledge distillation method of claim 2, wherein determining a third loss function based on the output features of the first and second sets of output features at the upper-layer middle layer comprises:

And the output features of the second output feature set positioned in the higher-level middle layer

And performing global average pooling on the output characteristics of the student network and the teacher network, specifically:

wherein G represents a global average pooling layer;

7. A knowledge distillation apparatus based on multi-layer multi-attention migration, comprising:

the output characteristic acquisition module is used for inputting training data into the student network and the teacher network to acquire a first output characteristic set of each middle layer of the student network and a second output characteristic set of each middle layer of the teacher network;

a network training module to iteratively train the student network based on the distillation loss function.

8. A computer arrangement, characterized by comprising a memory in which a computer program is stored and a processor which, when executing the computer program, carries out the steps of the method for knowledge distillation based on multi-layer multi-attention migration according to any one of claims 1 to 6.

9. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps of the method for knowledge distillation based on multi-layer multi-attention migration according to any one of claims 1 to 6.