CN117009830A

CN117009830A - Knowledge distillation method and system based on embedded feature regularization

Info

Publication number: CN117009830A
Application number: CN202311278779.9A
Authority: CN
Inventors: 王玉柱; 段曼妮; 程乐超; 王永恒
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2023-10-07
Filing date: 2023-10-07
Publication date: 2023-11-07
Anticipated expiration: 2043-10-07
Also published as: CN117009830B

Abstract

A knowledge distillation method and system based on embedded feature regularization, the method comprises: collecting annotation image data related to the recognition task, and calculating the central characteristics of each category of the teacher model on the whole training set; projecting the embedded features of the student model to the category center direction, rotating the embedded features of the teacher model to the category center direction, and constructing feature regularization loss by regularizing the projected features of the student model and the rotating features of the teacher model: increasing the feature norms of the student model, and restricting the feature directions of the student model to be consistent with the category center directions; inserting the characteristic regularization loss into the existing knowledge distillation frame, and training a student model; and deploying the trained student model to terminal equipment, and predicting probability vectors according to new data received by the terminal equipment so as to complete related recognition tasks. The invention increases the norm of the student characteristics and ensures that the constraint direction is consistent with the category center, so that the knowledge distillation performance is better.

Description

Knowledge distillation method and system based on embedded feature regularization

Technical Field

The invention relates to the field of deep neural network model compression, in particular to a knowledge distillation method and system based on embedded feature regularization.

Background

In recent years, deep neural networks have achieved a rapid progression over a variety of computer vision tasks, however, high computational and network parameter storage costs are incurred when reasoning is performed using neural networks. As an alternative to light weight, some model compression methods, such as knowledge distillation, network pruning, and quantification, have been identified as effective techniques to improve the efficiency of neural network parameters, particularly for computing resource constrained platforms, such as mobile devices. Knowledge distillation refers to training a lightweight model (called a student) under the supervision of a heavy model (called a teacher) without significant loss of performance. Compared with network pruning and quantification, knowledge distillation can realize performance migration under heterogeneous conditions of teachers and students, so that more flexibility is realized.

Taking the characteristics of a teacher as knowledge, knowledge distillation methods realize knowledge migration by encouraging the characteristics of students to be similar to the characteristics of the teacher, and the existing mainstream knowledge distillation methods can be roughly divided into two categories: probability (logits) based distillation and feature (feature) based distillation. Where probabilistic distillation is by minimizing the difference in KL divergence between the student and the teacher's predicted probability of input data (output of the last layer), the fundamental assumption of this approach is that if the student is able to produce a predicted distribution similar to the teacher, then the student's performance (e.g., accuracy) should be close to the teacher. However, probability-based distillation methods do not fully exploit the knowledge of the teacher, such as the characteristics of the teacher at other levels, in order to exploit the characteristics of other levels, the characteristic distillation method uses L2 distances on intermediate level characteristics, encouraging the student's characteristics to be similar to the teacher's characteristics.

Although probability-based and feature-based distillation methods have achieved unusual results, forcing students to produce probabilities or features similar to teachers cannot directly serve final tasks, such as pedestrian vehicle identification and detection in traffic scenes, part defects and positioning in industrial manufacturing scenes, recognition of hit targets by cruise missiles, and the like.

Disclosure of Invention

The invention aims to overcome the defects in the prior art and provides a knowledge distillation method and system based on embedded feature regularization.

In order to achieve the above object, the present invention provides the following technical solutions:

a knowledge distillation method based on embedded feature regularization comprises the following steps:

s1, collecting annotation image data related to an identification task, and dividing the annotation image data into a training set and a verification set;

s2, respectively carrying out data enhancement on the training set and the verification set;

s3, loading a pre-training weight to the teacher model, freezing network parameters of the pre-training weight, and randomly initializing parameters of the chemical model;

s4, traversing the training set, feeding data into a teacher model in batches, extracting embedded features of the teacher model on each input sample, and calculating the average value of each category as a feature center;

s5, inputting the same batch of data to the student model and the teacher model, respectively extracting embedded features, and performing linear dimension transformation on the embedded features of the student model to enable the embedded features to have the same dimension with the embedded features of the teacher model;

s6, extracting a characteristic center of the category to which each input sample belongs; rotating the embedded feature of the teacher model to the same direction as the center of the feature to obtain the rotated teacher feature; projecting the embedded features of the student model along the direction of the feature center to obtain projected student features;

s7, when the norm of the projected student characteristicMinimizing the Euclidean distance between the two when the norm of the teacher characteristic after rotation is smaller than or equal to the norm, and normalizing and restraining the Euclidean distance to be recorded as lossThe method comprises the steps of carrying out a first treatment on the surface of the Otherwise, lose->Cosine similarity between the embedded feature of the maximum chemical model and the feature center;

s8, will loseInserting the knowledge distillation model into the existing knowledge distillation frame, wherein the total loss of training the student model is equal to the cross entropy loss plus the knowledge distillation loss and loss after a pair of super parameters are balanced +.>And (3) summing;

s9, adjusting the pair of super parameters, obtaining a student model with highest accuracy on the verification set, and deploying the student model to the terminal equipment; and the terminal equipment inputs the received new data into the trained model to obtain the predictive probability vector of each category.

Further, the data enhancement of the training set in step S2 includes: maintaining the aspect ratio scaling image data, randomly cropping, randomly horizontally flipping, de-averaging, adding dithering; data enhancement of the validation set includes: maintaining aspect ratio scale image data, center cropping, and de-averaging operations.

Further, in the step S4, the calculation method of the feature center for the kth category is as follows:

（5）

wherein,is the set of all training samples belonging to the kth category.

Further, in the step S4, the calculation of the class center is performed for the whole training set, or the class center is calculated on the correct sample set only by using the teacher model.

Further, in the step S5, when the dimensions of the embedded features of the student model and the embedded features of the teacher model are the same, directly entering a step S6; when the dimension of the embedded feature of the student model is different from that of the teacher model, a fully-connected layer with a parameter capable of being learned is introduced outside the student model and used for transforming the dimension of the embedded feature of the student model to be the same as that of the teacher model.

Further, in the steps S6 and S7, four alternative methods of feature regularization are additionally constructed, and the four alternative methods are divided into two types: norm regularization and direction regularization.

Still further, the norm regularization includes two types: MSE method and SFN method; the MSE method measures the characteristic difference between the embedded characteristics of the student model and the embedded characteristics of the teacher model by using the L2 distance; the SFN method is to gradually increase the norm of the embedded features of the student model with a fixed step length in the training process of the student model, namely, one step length is increased for each training step.

Further, the direction regularization includes two types: a cosine method and an InfoNCE method; the cosine method is cosine similarity between the embedded features of the maximum chemical raw model and the feature center in the step S4; the InfoNCE method maximizes cosine similarity between the embedded features of the chemo-green model and the paired feature centers described in step S4, while minimizing cosine similarity between the embedded features of the student model and the unpaired feature centers.

Further, in the step S8, distillation lossThe specific form of (2) depends on the distillation frame selected, when a probability-based distillation frame is selected, distillation loss +.>The specific form of (2) is:

（10）

wherein the method comprises the steps ofAnd->The prediction probabilities of the student model and the teacher model on the input samples are respectively represented, and KL represents KL divergence.

Further, in the step S8, when the feature-based distillation frame is selected, distillation loss is reducedIs formed by the following steps:

（11）

wherein the method comprises the steps ofAnd->Representing intermediate layer characteristics of the student model and the teacher model, respectively, for the input sample.

Further, the total loss of the training student model described in step S8 is:

（4）

wherein the method comprises the steps ofFor cross entropy loss, < >>For knowledge of distillation losses, its specific form depends on the distillation frame chosen, the pair of hyper-parameters +.>And->For balancing->And->。

Further, in the step S9, the method for calculating the accuracy of the student model on the verification set includes:

wherein, ，/>，/>for the i-th input sample in the verification set, +.>Andtaking the numerical value in the model training in the step S8, M represents a student model, p represents the prediction probability of the student model on the category to which the input sample belongs, argmax (p) represents the index of the maximum value of the prediction probability, y represents the category label to which the input sample belongs, and N represents the number of samples contained in the verification set.

Further, in the step S4, for the kth category, the feature center is denoted asFor convenience of description, the invention ignores subscript k;

further, in the step S5, the embedded features of the student model and the teacher model, that is, the output of the penultimate layer, are extracted and recorded asAnd->For->Performing linear dimension transformation to satisfy sum +.>Having the same dimensions;

further, in the step S6, for each input sample, a category center of the category to which the sample belongs is extracted S4Computing class center->Unit vector of>The method comprises the steps of carrying out a first treatment on the surface of the Embedding feature of a rotational teacher model>To and class center->Same direction marked as->I.e. +.>The method comprises the steps of carrying out a first treatment on the surface of the Embedding features of student model->Along category center->Is recorded as +.>；

Further, in the step S7, when the norm of the projected student feature is smaller than or equal to the norm of the rotated teacher feature, that isWhen, minimize->And->The distance between, i.e.)>The characteristic norms of different samples are greatly different, and normalization constraint is carried out on the minimized distance, namely ND loss:

（1）

as can be seen from the formula (1), to reduce ND loss, it is necessary to increase the feature Norm (Norm) of the student, or to make the feature of the student consistent with the class center maintaining Direction (Direction) of the teacher, that is, the ND loss constrains both the Norm and the Direction of the feature of the student;

when (when)When the method is used, only the directions of the student characteristics are restricted to be consistent with the class center direction of the teacher, namely:

（2）

equation (1) and equation (2) have the same form, fusing them together:

（3）

further, in the steps S6 and S7, four alternative methods of feature regularization are additionally constructed, and the four alternative methods are divided into two types: norm regularization and direction regularization, wherein the norm regularization comprises two types: MSE method and SFN method; the MSE method measures the feature difference between the embedded features of the student model and the embedded features of the teacher model by using the L2 distance, namely

（6）

The SFN method is to gradually increase the characteristic norms of students:

（7）

wherein,for inputting data +.>For student model->And->The parameters of the current iteration update and the parameters of the previous iteration update are represented, r is a positive number, and the step length of the characteristic norm increase at each iteration is represented;

wherein the direction regularization comprises two types: a cosine method and an InfoNCE method; the cosine method is cosine similarity between the embedded features of the maximum chemical model and the feature centers described in the step S4, namely

（8）

Maximum chemobiological characteristics of the InfoNCE methodUnit vector of class center matched +.>Cosine similarity between them while minimizing +.>Class center not matching->Cosine similarity between:

（9）

the invention also includes a knowledge distillation system based on embedded feature regularization, comprising:

the data collection module is used for collecting the labeling image data related to the identification task by the server and dividing the labeling image data into a training set and a verification set according to the proportion;

the data preprocessing module is used for respectively preprocessing different data of the training set and the verification set;

the model loading module loads the publicly available pre-training weight to the teacher model and then sets the network parameters of the teacher model into a freezing mode, namely, the parameters are not updated; randomly initializing parameters of a student model;

the class center calculating module is used for calculating the characteristic centers of the teacher model in each class;

an embedded feature extraction module for extracting the embedded features of students and teachers respectively,and->For->Make linear transformation to make it and +>Having the same dimensions;

the embedded feature projection module projects and rotates the embedded features of the students and the teachers in the corresponding class center directions to obtain projection vectorsAnd->；

The embedded feature regularization module is used for constraining the embedded features of the students: increasing the characteristic norms, wherein the characteristic directions are consistent with the class centers of teachers;

the student model training module is used for training a student model and selecting and storing the optimal student model weight;

the model deployment module is used for deploying the student models to the terminal equipment, and the terminal equipment inputs the received new data to the trained models to obtain the predictive probability vector so as to complete related tasks.

The invention also comprises a knowledge distillation device based on embedded feature regularization, which comprises a memory and one or more processors, wherein executable codes are stored in the memory, and the one or more processors are used for realizing a knowledge distillation method based on embedded feature regularization when executing the executable codes.

The invention also includes a computer readable storage medium having stored thereon a program which, when executed by a processor, implements a knowledge distillation method based on regularization of embedded features of the invention.

The invention is provided by intensive research and experimental verification: the embedded feature norms of students are increased, the directions of the embedded features of the constraint students are consistent with the class center of teachers, namely, the embedded features of the students are constrained through a regularization method, so that the knowledge distillation performance of directly serving a target task can be improved, and higher accuracy can be realized. The embedded features refer to output features of the penultimate layers of the student network and the teacher network, and the embedded features are characterization of input data by the model.

The advantages of the invention are three aspects:

(1) the invention provides a new analysis and practice method in the field of model compression knowledge distillation, namely, a canonical chemometric feature: 1) Increasing norms of student features; 2) The characteristic direction is aligned with the class center of the teacher;

(2) the invention provides various basic characteristic regularization methods, and experiments prove that the basic regularization methods can improve the knowledge distillation effect;

(3) the invention provides a new loss function (called ND), the ND can regularize the characteristic norms and directions of a student model, the ND loss provided by the invention can be directly inserted into the existing knowledge distillation frame, and a large number of experiments show that the method can realize better knowledge distillation effect.

The beneficial effects of the invention are as follows:

according to the invention, by increasing the embedded feature norms of the students and restricting the directions of the embedded features of the students to be consistent with the class center of the teacher, namely restricting the embedded features of the students by a regularization method, the improvement of the knowledge distillation performance of directly serving the target task can be realized, and the higher accuracy can be realized. The embedded features refer to output features of the penultimate layers of the student network and the teacher network, and the embedded features are characterization of input data by the model.

Drawings

FIG. 1 is a schematic diagram of a knowledge distillation method probabilistic distillation based on embedded feature regularization in accordance with the present invention.

FIG. 2 is a schematic diagram of a knowledge distillation method feature distillation based on embedded feature regularization of the present invention.

Fig. 3 is a schematic projection view of an embedded feature of the present invention.

FIG. 4 is a graph of loss on a pedestrian vehicle identification dataset data set according to the present invention.

Fig. 5 is a graph of accuracy rate over a pedestrian vehicle identification dataset of the present invention.

FIG. 6 is a graph of the loss of the present invention over a natural scene dataset.

FIG. 7 is a graph of accuracy rate over a natural scene data set for the present invention.

Fig. 8 is a system configuration diagram of the present invention.

Detailed Description

The present invention will be described in detail below with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the detailed description and specific examples, while indicating and illustrating the invention, are not intended to limit the scope of the invention.

Example 1

Taking pedestrian and vehicle identification of traffic scenes as an example, including target categories of pedestrians, bicycles, electric vehicles, trucks, buses, automobiles and the like, the embodiment provides a pedestrian and vehicle identification method based on knowledge distillation method regularized by embedded features, referring to fig. 1, which comprises the following specific processes:

s1, constructing a data set: constructing image data with labels of traffic scenes of pedestrians, bicycles, electric vehicles, trucks, buses, automobiles and the like, and dividing the image data into a training set and a verification set according to a proportion, wherein the training set is 5 ten thousand, and the verification set is 1 ten thousand;

s2, data preprocessing: respectively carrying out different data enhancement on the training set and the verification set; for training sets: maintaining the aspect ratio to scale the image data, randomly clipping, randomly horizontally turning over, removing the mean value, adding jitter and the like; for the verification set: maintaining the aspect ratio scaling image data, center cropping, and de-averaging operations;

s3, selecting a teacher model and a student model: the teacher model and the student model can be of the same architecture or of different architectures; loading publicly available pre-training weights on the teacher model, and setting network parameters of the pre-training weights into a freezing mode, namely not updating the parameters; randomly initializing parameters of a student model; in this embodiment, the present invention sets the teacher model and the student model as ResNet-50 and MobileNet-v1 respectively as detailed description, and it should be noted that the present invention is also applicable to other model architectures;

s4, calculating embedded feature centers of all the categories: traversing the whole training set (5 ten thousand marked images), feeding data into a teacher model ResNet-50 in batches, extracting embedded features of the teacher model on the input samples, namely output of the penultimate layer, for each input sample, calculating the average value of each category as a feature center after traversing the training set, and marking the feature center as a feature center for the kth categoryFor convenience of description, the invention ignores subscript k;

s5, extracting embedded features: inputting the same batch of data to the student model MobileNet-v1 and the teacher model ResNet-50, respectively extracting the embedded features of the students and the teachers, namely the output of the penultimate layer, and recording asAnd->For->Performing linear dimension transformation to satisfy sum +.>Having the same dimensions;

s6, projecting embedded features: for each input sample, extracting S4 the category center of the category to which the sample belongsComputing class center->Unit vector of>The method comprises the steps of carrying out a first treatment on the surface of the Embedding feature of rotary teacher->To and class center->Same direction marked as->I.e. +.>The method comprises the steps of carrying out a first treatment on the surface of the Embedding features of students->Along category center->Is recorded as +.>The method comprises the steps of carrying out a first treatment on the surface of the See fig. 3 for projections or rotations of the embedded features for students and teachers;

s7, regularized embedding characteristics: when (when)When, minimize->And->Euclidean distance between, i.eThe difference of characteristic norms of different samples is large, and normalization constraint is carried out on the Euclidean distance, namely ND loss:

（1）

when (when)When the student characteristics are restricted, the directions of the student characteristics are consistent with the class center of the teacher, namely:

（2）

equation (1) and equation (2) have the same form, and the invention fuses the two together:

（3）

s8, training a student model: the ND penalty is inserted into the existing probability-based knowledge distillation framework, and the total penalty of training the student model is:

（4）

wherein the method comprises the steps ofFor cross entropy loss, < >>For knowledge of distillation losses, its specific form depends on the distillation frame chosen, < +.>And->As a superparameter for balancing->And->；

S9, model deployment: adjusting parametersAnd->And acquiring the student model with the highest accuracy on the verification set, and deploying the student model to the terminal equipment. The terminal equipment inputs the received new data into the trained model to obtain a predictive probability vector, and then the recognition of the pedestrian vehicles in the traffic scene is completed.

In the step S4, for the kth category, the calculation method of the feature center is as follows

（5）

Wherein,is the set of all training samples belonging to the kth category;

in the step S5, the embedded features of the studentEmbedded features with teacher->Preferably, the invention introduces a parameter-learnable fully connected layer followed by batch normalization, will +.>Dimension transformation to AND->The same;

in the steps S6 and S7, the present invention also creates four other feature regularization methods, and the four alternative methods can be further divided into two types: norm regularization and direction regularization;

the alternative norms regularization described above includes two types: MSE method and SFN method; the MSE method utilizes L2 distance to measure the characteristic difference between students and teachers:

（6）

the SFN method is to gradually increase the characteristic norms of students:

（7）

the alternative directional regularization described above involves two types: a cosine method and an InfoNCE method; maximum chemobiological characteristics of the cosine methodUnit vector from class center->Cosine similarity between:

（8）

（9）

in the step S8 of the above-mentioned process,for distillation loss, its specific form depends on the distillation frame chosen, when probability (logits) based distillation frames are chosen, distillation loss +.>The specific form of (2) is:

（10）

wherein the method comprises the steps ofAnd->Respectively representing the prediction probability of the student model and the teacher model on the input sample;

distillation loss when a feature-based distillation frame is selectedIs formed by the following steps:

（11）

As shown in Table 1, the process of the present invention is compared to prior art distillation processes. It can be seen that the accuracy rate of the 4 basic regularization methods created by the invention is respectively improved by 1.40%,1.67%,1.53% and 1.41% by adding the ND method created by the invention into the original KD frame, and the accuracy rate is improved by 2.45%. The training curve of the present invention is shown in fig. 4 and 5.

Example 2

Taking outdoor natural scene recognition as an example, including target categories of animals, birds, plants, people and the like, the embodiment provides a scene recognition method based on an embedded feature regularization knowledge distillation method, and referring to fig. 2, the specific process is as follows:

s1, constructing a data set: constructing tagged image data of animal, bird, plant, human and the like, and dividing the tagged image data into a training set and a verification set according to a proportion, wherein the training set is 120 ten thousand, and the verification set is 5 ten thousand;

s3, selecting a teacher model and a student model: the teacher model and the student model can be of the same architecture or of different architectures; loading publicly available pre-training weights on the teacher model, and setting network parameters of the pre-training weights into a freezing mode, namely not updating the parameters; randomly initializing parameters of a student model; in this embodiment, the present invention sets the teacher model and the student model as ResNet-101 and ResNet-18 respectively as detailed description, and it should be noted that the present invention is also applicable to other model architectures;

s4, calculating embedded feature centers of all the categories: traversing the whole training set (120 ten thousand marked images), feeding data into a teacher model ResNet-101 in batches, extracting embedded features of the teacher model on the input samples, namely output of the penultimate layer, for each input sample, calculating the average value of each category as a feature center after traversing the training set, and marking the feature center as a feature center for the kth categoryFor convenience of description, the invention ignores subscript k;

s5, extracting embedded features: inputting the same batch of data to the student model ResNet-18 and the teacher model ResNet-101, respectively extracting the embedded features of the students and the teachers, namely the output of the penultimate layer, and recording asAnd->For->Performing linear dimension transformation to satisfy sum +.>Having the same dimensions;

s6, projecting embedded features: for each input sample, extracting S4 the category center of the category to which the sample belongsComputing class center->Unit vector of>The method comprises the steps of carrying out a first treatment on the surface of the Embedding feature of rotary teacher->To and class center->Same direction marked as->I.e. +.>The method comprises the steps of carrying out a first treatment on the surface of the Embedding features of students->Along category center->Is recorded as +.>；

S7, regularized embedding characteristics: when (when)When, minimize->And->Distance between, i.eThe characteristic norms of different samples are greatly different, and the constraint is normalized, namely ND loss:

（1）

（2）

（3）

s8, training a student model: the ND loss is inserted into the existing feature-based knowledge distillation framework, and the total loss of training student models is:

（4）

wherein the method comprises the steps ofFor cross entropy loss, < >>For knowledge distillation loss, it hasThe form of the body depends on the distillation frame chosen, < >>And->As a superparameter for balancing->And->；

S9, model deployment: adjusting parametersAnd->And acquiring the student model with the highest accuracy on the verification set, and deploying the student model to the terminal equipment. The terminal equipment inputs the received new data into the trained model to obtain a predictive probability vector, and then scene recognition is completed.

（5）

Wherein,is the set of all training samples belonging to the kth category;

in the step S5, the embedded features of the studentEmbedded features with teacher->Preferably, the invention introduces a parameter-learnable fully connected layer, followed by batch normalization,will->Dimension transformation to AND->The same;

in the step S8 of the above-mentioned process,for distillation loss, its specific form depends on the distillation frame chosen, when a feature-based distillation frame is chosen,

（6）

wherein the method comprises the steps ofAnd->Respectively representing middle layer characteristics of the student model and the teacher model on input samples;

as shown in Table 2, the process of the present invention is compared to prior art distillation processes. It can be seen that the accuracy rate is improved by 1.45%,0.16% and 0.99% respectively after the ND method is added into the original KD/DKD/ReviewKD framework respectively. The training curves of the present invention are shown in fig. 6 and 7.

Example 3

Referring to fig. 8, the present invention also includes a knowledge distillation system based on embedded feature regularization, comprising:

Example 4

The embodiment of the invention also comprises a knowledge distillation device based on embedded feature regularization, which comprises a memory and one or more processors, wherein executable codes are stored in the memory, and the one or more processors are used for realizing a knowledge distillation method based on embedded feature regularization when executing the executable codes.

The embodiment of the present invention also provides a computer-readable storage medium having a program stored thereon, which when executed by a processor, implements a knowledge distillation method based on regularization of embedded features of the above-described embodiments 1 and 2.

The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced with equivalents; such modifications and substitutions do not depart from the spirit of the technical solutions according to the embodiments of the present invention.

Claims

1. The knowledge distillation method based on embedded feature regularization is characterized by comprising the following steps of:

s7, when the norm of the projected student feature is smaller than or equal to the norm of the rotated teacher feature, minimizing the Euclidean distance between the two, and normalizing and restraining the Euclidean distance to be recorded as lossThe method comprises the steps of carrying out a first treatment on the surface of the Otherwise, lose->Cosine similarity between the embedded feature of the maximum chemical model and the feature center;

2. The knowledge distillation method based on embedded feature regularization of claim 1, wherein the data enhancement of the training set in step S2 includes: maintaining the aspect ratio scaling image data, randomly cropping, randomly horizontally flipping, de-averaging, adding dithering; data enhancement of the validation set includes: maintaining aspect ratio scale image data, center cropping, and de-averaging operations.

3. The knowledge distillation method based on embedded feature regularization of claim 1, wherein in step S4, for the kth category, the calculation method of the feature center is:

（5）

wherein,is the set of all training samples belonging to the kth category.

4. A knowledge distillation method based on regularization of embedded features according to claim 1, characterized in that in step S4, the calculation of class centers is performed for the whole training set or on the correct sample set using only the teacher model.

5. The knowledge distillation method based on embedded feature regularization as recited in claim 1, wherein in step S5, when the dimensions of the embedded features of the student model and the embedded features of the teacher model are the same, step S6 is directly entered; when the dimension of the embedded feature of the student model is different from that of the teacher model, a fully-connected layer with a parameter capable of being learned is introduced outside the student model and used for transforming the dimension of the embedded feature of the student model to be the same as that of the teacher model.

6. The knowledge distillation method based on embedded feature regularization of claim 1, wherein in steps S6 and S7, four alternative methods of feature regularization are additionally constructed, and the four alternative methods are divided into two types: norm regularization and direction regularization.

7. The knowledge distillation method based on embedded feature regularization of claim 5, wherein said norm regularization comprises two types: MSE method and SFN method; the MSE method measures the characteristic difference between the embedded characteristics of the student model and the embedded characteristics of the teacher model by using the L2 distance; the SFN method is to gradually increase the norm of the embedded features of the student model with a fixed step length in the training process of the student model, namely, one step length is increased for each training step.

8. The knowledge distillation method based on embedded feature regularization of claim 5, wherein: the direction regularization includes two types: a cosine method and an InfoNCE method; the cosine method is cosine similarity between the embedded features of the maximum chemical raw model and the feature center in the step S4; the InfoNCE method maximizes cosine similarity between embedded features of the chemometric model and paired feature centers, while minimizing cosine similarity between embedded features of the student model and unpaired feature centers.

9. The knowledge distillation method based on embedded feature regularization of claim 1, wherein: in step S8, distillation lossThe specific form of (2) depends on the distillation frame selected, when a probability-based distillation frame is selected, distillation loss +.>The specific form of (2) is:

（10）

10. The knowledge distillation method based on embedded feature regularization of claim 8, wherein: in step S8, when the feature-based distillation frame is selected, distillation lossThe specific form of (2) is:

（11）

11. The knowledge distillation method based on embedded feature regularization of claim 1, wherein: the total loss of the training student model described in step S8 is:

（4）

12. A knowledge distillation system based on embedded feature regularization, characterized by: comprising the following steps:

the embedded feature extraction module is used for respectively extracting the embedded features of the students and the teachers and linearly transforming the embedded features of the students so that the embedded features have the same dimension as the embedded features of the teachers;

the embedded feature projection module is used for respectively projecting and rotating the embedded features of the students and the teachers in the corresponding category center directions;

the model deployment module is used for deploying the student models to the terminal equipment, and the terminal equipment inputs the received new data to the trained model to obtain a predictive probability vector, so as to complete the recognition task of the input data.

13. A knowledge distillation apparatus based on embedded feature regularization, comprising a memory having executable code stored therein and one or more processors, which when executing the executable code, are operable to implement a knowledge distillation method based on embedded feature regularization of any one of claims 1-10.

14. A computer readable storage medium having stored thereon a program which, when executed by a processor, implements a regularization knowledge distillation method based on embedded features of any one of claims 1-10.