CN114139703A

CN114139703A - Knowledge distillation method and device, storage medium and electronic equipment

Info

Publication number: CN114139703A
Application number: CN202111424814.4A
Authority: CN
Inventors: 刘飒; 冯天鹏; 郭彦东
Original assignee: Shanghai Jinsheng Communication Technology Co ltd
Current assignee: Shanghai Jinsheng Communication Technology Co ltd
Priority date: 2021-11-26
Filing date: 2021-11-26
Publication date: 2022-03-04

Abstract

The disclosure relates to the technical field of model compression, in particular to a knowledge distillation method and device, a computer readable storage medium and an electronic device, wherein the method comprises the following steps: constructing an untrained student network and a pre-trained teacher network; inputting training data into a student network to obtain a first output characteristic of the training data, and acquiring a first activation value distribution of each channel of the first output characteristic by adopting a normalized exponential function of a preset temperature coefficient; inputting the training data into a teacher network to obtain a second output characteristic of the training data, and acquiring a second activation value distribution of each channel of the second output characteristic by adopting a normalized exponential function; determining a target distillation loss from the first activation value distribution and the second activation value distribution; determining the regression loss of the student network according to the training data; and updating the network parameters of the student network according to the regression loss and the distillation loss. The technical scheme of the embodiment of the disclosure improves the precision of knowledge distillation, so that the accuracy of the obtained student network is better.

Description

Knowledge distillation method and device, storage medium and electronic equipment

Technical Field

The disclosure relates to the technical field of model compression, and in particular relates to a knowledge distillation method and device, a computer-readable storage medium and electronic equipment.

Background

In recent years, with the rapid development of deep learning, algorithms such as image classification, segmentation, and object detection have been rapidly developed. This technique has been successfully applied to a variety of pattern analysis problems, including the field of computer vision.

In consideration of actual deployment of a deep learning model, knowledge distillation is an effective model compression technology and widely researched on an image classification task, but the precision of a knowledge distillation method used for a target detection task is poor, the accuracy of a generated student network is low, and the processing precision of the image when the student network is applied to image processing is further poor.

Disclosure of Invention

The present disclosure aims to provide a knowledge distillation method, a knowledge distillation apparatus, a computer readable medium, and an electronic device, so as to improve the precision of knowledge distillation at least to some extent, so that the accuracy of the obtained student network is better.

According to a first aspect of the present disclosure, there is provided a method of knowledge distillation comprising: constructing an untrained student network and a pre-trained teacher network; inputting training data into the student network to obtain a first output characteristic of the training data, and acquiring a first activation value distribution of each channel of the first output characteristic by adopting a normalized exponential function of a preset temperature coefficient; inputting training data into the teacher network to obtain a second output characteristic of the training data, and obtaining a second activation value distribution of each channel of the second output characteristic by adopting the normalized exponential function; determining a target distillation loss from the first and second activation value distributions; determining a regression loss of the student network according to the training data; updating network parameters of the student network according to the regression loss and the distillation loss.

According to a second aspect of the present disclosure, there is provided a knowledge distillation apparatus comprising: the network construction module is used for constructing an untrained student network and a pre-trained teacher network; the first acquisition module is used for inputting training data into the student network to obtain a first output characteristic of the training data and acquiring a first activation value distribution of each channel of the first output characteristic by adopting a normalized exponential function of a preset temperature coefficient; the second acquisition module is used for inputting training data into the teacher network to obtain second output characteristics of the training data and acquiring second activation value distribution of each channel of the second output characteristics by adopting the normalized exponential function; a first determination module for determining a target distillation loss from the first and second activation value distributions; the second determining module is used for determining the regression loss of the student network according to the training data; and the parameter updating module is used for updating the network parameters of the student network according to the regression loss and the distillation loss.

According to a third aspect of the present disclosure, a computer-readable medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, is adapted to carry out the above-mentioned method.

According to a fourth aspect of the present disclosure, there is provided an electronic apparatus, comprising: one or more processors; and memory storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the above-described method.

The knowledge distillation method provided by one embodiment of the disclosure comprises the steps of constructing an untrained student network and a pre-trained teacher network; inputting training data into a student network to obtain a first output characteristic of the training data, and acquiring a first activation value distribution of each channel of the first output characteristic by adopting a normalized exponential function of a preset temperature coefficient; inputting the training data into a teacher network to obtain a second output characteristic of the training data, and acquiring a second activation value distribution of each channel of the second output characteristic by adopting a normalized exponential function; determining a target distillation loss from the first activation value distribution and the second activation value distribution; determining the regression loss of the student network according to the training data; and updating the network parameters of the student network according to the regression loss and the distillation loss. Compared with the prior art, the distillation is performed for each channel in the training data, because each channel corresponds to one activation response class, therefore, distilling channels can enable student networks to focus on more activation values that are related to semantic components and significant, but not other background areas or noise, improves the precision of knowledge distillation, further, obtains the activation value distribution of the output characteristics by adopting the normalized exponential function of the preset temperature coefficient, can determine the attention range during knowledge distillation through the value of the preset temperature, meanwhile, the distillation loss is calculated by adopting the distribution of the activation values, so that the influence of the magnitude difference of the activation values in the prior space knowledge distillation can be eliminated, the detection performance of the student network is improved while the speed of the target detection model is kept, and the precision of image processing when the student network is applied to image processing is further improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty. In the drawings:

FIG. 1 is a schematic diagram showing spatial knowledge distillation in the related art;

FIG. 2 illustrates a schematic diagram of an exemplary system architecture to which embodiments of the present disclosure may be applied;

FIG. 3 shows a schematic diagram of an electronic device to which embodiments of the present disclosure may be applied;

FIG. 4 schematically illustrates a flow diagram of a knowledge distillation method in an exemplary embodiment of the disclosure;

FIG. 5 shows a schematic diagram of channel knowledge distillation in an exemplary embodiment of the present disclosure;

FIG. 6 schematically illustrates a flow chart for determining distillation loss based on a first distribution of activation values and a second distribution of activation values in an exemplary embodiment of the present disclosure;

FIG. 7 schematically illustrates a data flow diagram of a knowledge distillation method in an exemplary embodiment of the disclosure;

fig. 8 schematically shows a schematic composition diagram of a knowledge distillation apparatus in an exemplary embodiment of the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

As one of the basic tasks to be solved in the field of computer vision, the target detection task can be divided into target classification and target positioning. The target classification task is responsible for judging whether an object of the interested category appears in the input image or not, and outputting a series of labels with scores to indicate the possibility that the object of the interested category appears in the input image. The targeting task is responsible for determining the location and extent of objects of the category of interest in the input image. However, as performance is increasing, the size and the calculation amount of the model are also increasing, so that the target detection model is difficult to deploy on the end side, and the application of the target detection model in the automatic driving field, the mobile device and other scenes is limited.

Considering that the actual deployment of a deep learning model falls to the ground, knowledge distillation is an effective model compression technology, and is widely researched on an image classification task, and for the distillation of features in a knowledge distillation method of a target detection task, a spatial knowledge distillation mode is basically adopted, and comprises point-by-point spatial knowledge distillation and pair-by-pair spatial knowledge distillation, as shown in fig. 1, the specific point-by-point spatial knowledge distillation mode is that feature vectors of a teacher network and a student network are aligned point-by-point on a spatial dimension; the specific mode of pairwise spatial knowledge distillation is to calculate the similarity between pixel points on the characteristic diagrams of the teacher network and the student network respectively along the spatial dimension to obtain the spatial structure relationship information of the teacher network and the student network, and then align the obtained spatial structure relationship information, so that the student network can capture the spatial structure information of the teacher network.

However, in the spatial knowledge distillation method for the target detection task, the output characteristics of the teacher network may include redundant information and noise, which results in poor precision of knowledge distillation and low accuracy of the generated student network.

In view of one or more of the above problems, exemplary embodiments of the present disclosure first provide a knowledgeable distillation method. The system architecture and application scenario of the operating environment of the exemplary embodiment are described below with reference to fig. 2.

Fig. 2 shows a schematic diagram of a system architecture, which system architecture 200 may include a terminal 210 and a server 220. The terminal 210 may be a terminal device such as a smart phone, a tablet computer, a desktop computer, or a notebook computer, and the server 220 generally refers to a background system providing services related to knowledge distillation in the exemplary embodiment, and may be a server or a cluster formed by multiple servers. The terminal 210 and the server 220 may form a connection through a wired or wireless communication link for data interaction.

In one embodiment, the above-described knowledge distillation method may be performed by server 220. For example, the user uses the terminal 210 to obtain the required training data, the terminal 110 uploads the training data to the server 220, the server 220 distills the knowledge and obtains the student network with updated parameters, and the student network returns to the terminal 210.

As can be seen from the above, the main body of execution of the knowledge distillation method in the present exemplary embodiment may be the terminal 210 or the server 220, which is not limited by the present disclosure.

Exemplary embodiments of the present disclosure also provide an electronic device for performing the above knowledge distillation method, which may be the above terminal 210 or the server 220. In general, the electronic device may include a processor and a memory for storing executable instructions of the processor, the processor configured to perform the above-described knowledge distillation method via execution of the executable instructions.

The structure of the electronic device will be exemplarily described below by taking the mobile terminal 300 in fig. 3 as an example. It will be appreciated by those skilled in the art that the configuration of figure 3 can also be applied to fixed type devices, in addition to components specifically intended for mobile purposes.

As shown in fig. 3, the mobile terminal 300 may specifically include: processor 301, memory 302, bus 303, mobile communication module 304, antenna 1, wireless communication module 305, antenna 2, display 306, camera module 307, audio module 308, power module 309, and sensor module 310.

Processor 301 may include one or more processing units, such as: the Processor 301 may include an AP (Application Processor), a modem Processor, a GPU (Graphics Processing Unit), an ISP (Image Signal Processor), a controller, an encoder, a decoder, a DSP (Digital Signal Processor), a baseband Processor, and/or an NPU (Neural-Network Processing Unit), etc. The knowledge distillation method in the present exemplary embodiment may be performed by an AP, a GPU, or a DSP, and when the method involves neural network related processing, may be performed by an NPU.

An encoder may encode (i.e., compress) an image or video, for example, the target image may be encoded into a particular format to reduce the data size for storage or transmission. The decoder may decode (i.e., decompress) the encoded data of the image or video to restore the image or video data, for example, the encoded data of the target image may be read, and decoded by the decoder to restore the data of the target image, so as to perform related processing of knowledge distillation on the data. The mobile terminal 300 may support one or more encoders and decoders. In this way, the mobile terminal 300 may process images or video in a variety of encoding formats, such as: image formats such as JPEG (Joint Photographic Experts Group), PNG (Portable Network Graphics), BMP (Bitmap), and Video formats such as MPEG (Moving Picture Experts Group) 1, MPEG2, h.263, h.264, and HEVC (High Efficiency Video Coding).

The processor 301 may be connected to the memory 302 or other components via the bus 303.

The memory 302 may be used to store computer-executable program code, which includes instructions. The processor 301 executes various functional applications of the mobile terminal 300 and data processing by executing instructions stored in the memory 302. The memory 302 may also store application data, such as files storing images, videos, and the like.

The communication function of the mobile terminal 300 may be implemented by the mobile communication module 304, the antenna 1, the wireless communication module 305, the antenna 2, a modem processor, a baseband processor, and the like. The antennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals. The mobile communication module 304 may provide a mobile communication solution of 2G, 3G, 4G, 5G, etc. applied to the mobile terminal 200. The wireless communication module 205 may provide wireless communication solutions such as wireless lan, bluetooth, near field communication, etc. applied to the mobile terminal 200.

The sensor module 310 may include a depth sensor 3101, a pressure sensor 3102, a gyro sensor 3103, a barometric pressure sensor 3104, etc. to implement a corresponding inductive detection function.

In view of one or more of the above problems, the present disclosure also provides a knowledge distillation method, which may be used in various application scenarios including, but not limited to, various scenarios in the field of computer vision applications, such as face recognition, image classification, object detection, semantic segmentation, etc., or in a neural network model-based processing system deployed on edge devices (e.g., mobile phones, wearable devices, computing nodes, etc.), or in an application scenario for speech signal processing, natural language processing, recommendation systems, or in an application scenario requiring compression of a neural network model due to limited resources and latency requirements. Fig. 4 shows an exemplary scheme of the knowledge distillation method, which may specifically include the following steps:

step S410, an untrained student network and a pre-trained teacher network are constructed;

step S420, inputting training data into the student network to obtain a first output characteristic of the training data, and obtaining a first activation value distribution of each channel of the first output characteristic by adopting a normalized exponential function of a preset temperature coefficient;

step S430, inputting training data into the teacher network to obtain a second output characteristic of the training data, and obtaining a second activation value distribution of each channel of the second output characteristic by adopting the normalized exponential function;

step S440, determining a target distillation loss according to the first activation value distribution and the second activation value distribution;

step S450, determining the regression loss of the student network according to the training data;

and step S460, updating network parameters of the student network according to the regression loss and the distillation loss.

In an exemplary embodiment, on the one hand, for each channel in the training data, since each channel corresponds to one category of activation response, distilling the channel can enable the student network to pay attention to more activation values related to semantic parts and significant, but not other background regions or noise, thereby improving the accuracy of knowledge distillation. On the other hand, the activation value distribution of the output characteristics is obtained by adopting the normalization exponential function of the preset temperature coefficient, the attention range during knowledge distillation can be determined by the value of the preset temperature, meanwhile, the influence of the activation value magnitude difference faced by the previous space knowledge distillation can be eliminated by adopting the activation value distribution to calculate the distillation loss, and the detection performance of the student network is improved while the speed of the target detection model is kept.

The above steps will be described in detail below.

In step S410, an untrained student network and a teacher network that completes pre-training are constructed.

In an exemplary embodiment of the present disclosure, an image processing task is described below as an example. Training data may be first acquired, which may typically be a collection of images that are labeled with classification types for objects that appear in the images. When the training data set is prepared, a true value annotation can be performed on the original image by adopting a manual annotation or machine-assisted annotation mode to obtain real label data. For example, after the original image is acquired, the classification type of the object appearing in the original image (e.g., whether the object is a person, a car, a tree, etc.) can be labeled using image labeling software, thereby obtaining a plurality of training data. When the training data is feature-encoded, encoding may be performed by a method such as one-hot (one-hot) encoding, and the present application does not limit the specific encoding method.

In this exemplary embodiment, the pre-training of the teacher network may be completed first, specifically, an untrained network to be trained may be constructed first, and real label data of training data may be obtained; inputting training data into a network to be trained to obtain a second output result; determining a damage function according to the second output result and the real label data; and training the network to be trained according to the loss function to obtain a teacher network.

The network complexity of the student network may be smaller than that of the teacher network. The network training is to make the student network learn to the teacher network, and make the output effect of the student network approach to the teacher network, so as to achieve the purpose of compressing the network.

The student network and the teacher network may be any type of network. For example, in the object detection task, the student network and the teacher network may be a graph network such as RCNN (Region Convolutional Neural network), FAST-RCNN (FAST Region Convolutional Neural network), or the like. In an example segmentation task, the student network and the teacher network may be MASK-RCNN (MASK-based regional convolutional neural network) networks. It should be noted that, the present disclosure illustrates the network training method by taking an image processing task as an example. In practical situations, the network training method can also be applied to tasks such as word processing tasks, voice processing tasks and the like. The above network training method under other tasks is not detailed in the present application.

In step S420, training data is input to the student network to obtain a first output characteristic of the training data, and a first activation value distribution of each channel of the first output characteristic is obtained by using a normalized exponential function of a preset temperature coefficient.

In an example embodiment of the disclosure, the first output feature is specifically an output feature obtained by processing training data through a student network. Referring to fig. 5, the first output characteristic may comprise a first characteristic map of a plurality of channels. Wherein the first feature map of each channel can characterize the image from one interpretation dimension with a feature meaning. For example, some feature maps may characterize the texture features that an image has. As another example, some of the first feature maps may characterize the profile features that the image has.

After obtaining the first output characteristic, a normalized exponential function of a preset temperature coefficient may be obtained, specifically, the normalized exponential function of the preset temperature coefficient may be a softmax function with a preset temperature coefficient, as follows:

t is a preset temperature coefficient, where a value of the preset temperature coefficient may be 1, 2, 3, or the like, or may be 0.1, 0.2, or the like, and may also be customized according to a user requirement, which is not specifically limited in this example embodiment.

In this example embodiment, the first activation value distribution of each channel in the first output characteristic may be obtained by using a normalized exponential function of the preset temperature coefficient, where the size of the first output characteristic may be [ C ]_S,H,W]Wherein, C_SRepresenting the number of channels, H and W respectively representing the height and width of the characteristic diagram, and converting the activation value of each channel into a distribution with a value range of 0 to 1 by using the normalized exponential function, and marking as F_sDistribution. The importance of the current position of the current channel can be determined according to the value, so that the magnitude difference of the activation values among different channels is eliminated, the position with the value close to 0 represents a noise area, and the closer the value is to 1, the larger the information content contained in the position is.

In step S430, training data is input to the teacher network to obtain a second output characteristic of the training data, and a second activation value distribution of each channel of the second output characteristic is obtained by using the normalized exponential function.

In an example embodiment of the disclosure, the second output feature is specifically an output feature obtained by processing training data by a teacher. Referring to fig. 5, the second output characteristic may comprise a second characteristic map of a plurality of channels. Wherein the second feature map of each channel can characterize the image from one interpretation dimension with a feature meaning. For example, some feature maps may characterize the texture features that an image has. As another example, some feature maps may characterize the profile features that an image has.

In this example embodiment, the processor may obtain a second activation value distribution of each channel in the second output feature by using a normalized exponential function which is the same as the student network preset temperature coefficient, where the size of the second output feature may be [ C [ ]_T H,W]Wherein, C_SRepresenting the number of channels, H and W respectively representing the height and width of the characteristic diagram, and converting the activation value of each channel into a distribution with a value range of 0 to 1 by using the normalized exponential function, and marking as F_tDistribution. The importance of the current position of the current channel can be determined according to the value, so that the magnitude difference of the activation values among different channels is eliminated, the position with the value close to 0 represents a noise area, and the closer the value is to 1, the larger the information content contained in the position is. The concerned range of each channel of the teacher network can be controlled by adjusting the value of the temperature coefficient T.

In step S440, a target distillation loss is determined based on the first activation value distribution and the second activation value distribution.

In an exemplary embodiment of the present disclosure, referring to fig. 6, the determining the target distillation loss according to the first and second activation value distributions may include:

step S610, determining a corresponding relationship between each first activation value distribution and each second activation value distribution;

step S620, calculating a reference distillation loss of each channel according to each first activation value distribution and a second activation value distribution corresponding to the first activation value distribution;

step S630, summing the reference distillation losses of all channels to obtain the target distillation loss.

The above steps will be described in detail below.

In step S610, a corresponding relationship between each first activation value distribution and each second activation value distribution is determined.

In an example embodiment of the present disclosure, after the first output feature and the second output feature are obtained, a first activation value distribution and a second activation value distribution corresponding to the same channel between the first output feature and the second output feature may be determined.

In an example embodiment, when the number of channels of the first output feature is the same as the number of channels of the second output feature, the corresponding relationship between each first activation value distribution and each second activation value distribution may be directly determined. For example, the first output characteristic and the second output characteristic include R, G, B channels, and in this case, the first activation value distribution and the second activation value distribution of the R channel may be determined as a correspondence relationship, and the first activation value distribution and the second activation value distribution of the G channel may be determined as a correspondence relationship; and determining the first activation value distribution and the second activation value distribution of the B channel as a corresponding relation.

In this exemplary embodiment, when the number of channels of the first output feature is different from the number of channels of the second output feature, the processor may perform convolution operation on the second output feature in response to that the number of channels of the second output feature is smaller than the number of channels of the first output feature, so that the number of channels of the second output feature is the same as the number of channels of the first output feature, specifically, copy the number of channels of the second output feature by the convolutional layer so that the number of channels of the second output feature is the same as the number of channels of the first output feature.

For example, the first output characteristic includes R, G, B, W four channels, the second output characteristic includes R, G, B three channels, and at this time, the convolution operation can be performed on the second output characteristic, and a new channel W is obtained by performing convolution calculation using R, G, B three channels of the second output characteristic₁Then, determining the first activation value distribution and the second activation value distribution of the R channel as a corresponding relation, and determining the first activation value distribution and the second activation value distribution of the G channel as a corresponding relation; determining the first activation value distribution and the second activation value distribution of the B channel as a corresponding relation, and determining the first activation value distribution and the W channel of the W channel₁The second activation value distribution of the channels is determined as a correspondence.

In the present exemplary embodiment, if the number of channels of the second output feature is the same as that of the second output feature only when the second output feature needs to obtain a plurality of new channels, at this time, the obtained plurality of new channels may be different by changing the size of the convolution kernel in the convolution operation, and the specific size setting of the convolution kernel may be customized according to the user requirement, which is not specifically limited in the present exemplary embodiment.

In step S620, the reference distillation loss of each of the channels is calculated from each of the first activation value distributions and a second activation value distribution corresponding to the first activation value distribution using the correspondence relationship.

In an exemplary embodiment of the present disclosure, the reference distillation loss of each channel may be calculated by directly calculating the second activation value distribution corresponding to the first activation value distribution using the correspondence relationship. For example, a reference distillation loss of the R channel may be calculated based on the first activation value distribution and the second activation value distribution of the R channel; the reference distillation loss of the G channel may be calculated from the first activation value distribution and the second activation value distribution of the G channel; the reference distillation loss of the B channel may be calculated from the first activation value distribution and the second activation value distribution of the B channel; according to the first activation value distribution of W channel and W₁The second activation value distribution of the channels calculates W channel and W₁Reference distillation loss of channel, wherein W channel and W₁The channels can be regarded as one and the same channel, i.e. the calculated reference distillation loss is the same.

Wherein, in calculating the above-mentioned distillation loss, the KL divergence can be used to measure the second activation value distribution F of each channel separately_tDistribution and first activation value distribution F_sSimilarity of distribution and this similarity is taken as the reference distillation loss for this channel.

In the present exemplary embodiment, the reference distillation loss of each channel can be calculated using the above-described method.

In step S630, the target distillation loss is calculated from the reference distillation losses of all the channels.

In an exemplary embodiment of the present disclosure, after obtaining the reference distillation loss of each channel, the reference distillation losses corresponding to all the channels may be first summed, and the result is taken as the target distillation loss.

In step S450, determining a regression loss of the student network based on the training data

In an example embodiment of the present disclosure, the real label data of the training data may be obtained first, where the obtaining manner of the real label data is described in detail above, and therefore is not described herein again.

In this example real-time manner, the training data may be input to the student network to obtain a first output result, and then the regression loss of the student network may be obtained according to the first output result and the real label data.

In step S460, network parameters of the student network are updated according to the regression loss and the distillation loss.

In an exemplary embodiment of the present disclosure, the weighted average of the regression loss and the distillation loss may be calculated as a target loss, wherein the weight of the distillation loss may be customized according to different application scenarios and different user requirements, and is not specifically limited in this exemplary embodiment, or the arithmetic mean of the regression loss and the distillation loss may be directly calculated as the target loss, and then the network parameter of the student network may be updated by using the target loss. And carrying out the steps aiming at each training data to finish the training of the student network.

The distillation method of knowledge of the present disclosure is systematically described below with reference to fig. 7.

Firstly, training data 701 can be respectively input into a teacher network 702 and a student network 703, the training data outputs a second output characteristic 704 after passing through a backbone network 7021 of the teacher network 702, and the training data outputs a first output characteristic 705 after passing through a backbone network 7031 of the student network 703, wherein the size of the second output characteristic 704 is [ C [ ]_T,H,W]Wherein, C_TIndicating the number of channels, H and W indicating the height and width of the profile, respectively. The first output feature 705 has a size of [ C_S,H,W]Wherein, C_SRepresenting the number of channels, H and W respectively representing the height and width of the feature map, inputting the first output feature 705 into a channel replication convolution layer, 706 obtaining a third output feature which is used together with the number of channels of the second output feature 704, and inputting the second output feature and the third output feature into a normalized exponential function layer 707 with preset temperature coefficients to obtain first activation value distributions 709 and 709The second activation value distribution 708, the first activation value distribution 709 and the second activation value distribution 708 are input to the KL divergence distillation loss calculation layer 710 to calculate the distillation loss, and then the regression loss 711 of the training data after passing through the main network 7031, the feature pyramid 7032 and the detector 7033 of the student network can be obtained, and the target loss 712 is obtained according to the distillation loss and the regression loss. Finally, the network parameters of the student network may be updated based on the target loss 712.

To sum up, in the exemplary embodiment, each channel in the training data is distilled, and each channel corresponds to one category of activation response, so that distilling the channel enables the student network to pay attention to more activation values related to semantic parts and significant, but not to other background regions or noise, improves the precision of knowledge distillation, further, obtains the activation value distribution of the output features by using the normalized exponential function of the preset temperature coefficient, determines the attention range during knowledge distillation by using the value of the preset temperature, calculates the distillation loss by using the activation value distribution, can eliminate the influence of the activation value magnitude difference faced by the conventional spatial knowledge distillation, improves the detection performance of the student network while maintaining the speed of the target detection model, and can obtain the value of the preset temperature coefficient T by adjusting the value of the preset temperature coefficient T, thereby controlling the range of interest for each channel of the teacher model.

It is noted that the above-mentioned figures are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

Further, referring to fig. 8, the embodiment of the present example also provides a knowledge distilling apparatus 800, which includes a network constructing module 810, a first obtaining module 820, a second obtaining module 830, a first determining module 840, a second determining module 850, and a parameter updating module 860. Wherein:

the network construction module 810 can be used for constructing an untrained student network and a pre-trained teacher network, and when constructing the pre-trained teacher network, firstly constructing an untrained network to be trained and acquiring real label data of training data; then inputting the training data into a network to be trained to obtain a second output result; secondly, determining a loss function according to the second output result and the real label data; and finally, training the network to be trained according to the loss function to obtain a teacher network.

The first obtaining module 820 may be configured to input the training data into the student network to obtain a first output feature of the training data, and obtain a first activation value distribution of each channel of the first output feature by using a normalized exponential function of a preset temperature coefficient.

The second obtaining module 830 may be configured to input the training data into the teacher network to obtain a second output feature of the training data, and obtain a second activation value distribution of each channel of the second output feature by using a normalized exponential function.

The first determining module 840 may be configured to determine the target distillation loss according to the first activation value distribution and the second activation value distribution, and specifically, may first determine a corresponding relationship between each first activation value distribution and each second activation value distribution; then calculating the reference distillation loss of each channel according to each first activation value distribution and a second activation value distribution corresponding to the first activation value distribution by using the corresponding relation; and finally summing the reference distillation losses of all the channels to obtain the target distillation loss. Wherein the similarity between the first activation value distribution and the second activation value distribution of each of the channels may be calculated using relative entropy, and the similarity may be used as the reference distillation loss.

Additionally, the first determination module 840 may be further responsive to the number of lanes of the second output characteristic being less than the number of lanes of the first output characteristic; and performing convolution operation on the second output characteristic to enable the number of channels of the second output characteristic to be the same as that of the channels of the first output characteristic.

The second determining module 850 may be configured to determine the regression loss of the student network according to the training data, and specifically, may first obtain the real label data of the training data; then inputting the training data to a student network to obtain a first output result; and finally, calculating the regression loss according to the first output result and the real label data.

The parameter updating module 860 may be configured to update the network parameters of the student network according to the regression loss and the distillation loss, and specifically, the distillation loss and the regression loss may be first summed to obtain a target loss, and the network parameters of the student network may be updated by using the target loss.

The specific details of each module in the above apparatus have been described in detail in the method section, and details that are not disclosed may refer to the method section, and thus are not described again.

As will be appreciated by one skilled in the art, aspects of the present disclosure may be embodied as a system, method or program product. Accordingly, various aspects of the present disclosure may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

Exemplary embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, various aspects of the disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps according to various exemplary embodiments of the disclosure described in the above-mentioned "exemplary methods" section of this specification, when the program product is run on the terminal device.

It should be noted that the computer readable media shown in the present disclosure may be computer readable signal media or computer readable storage media or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Furthermore, program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is to be limited only by the terms of the appended claims.

Claims

1. A method of knowledge distillation, comprising:

constructing an untrained student network and a pre-trained teacher network;

inputting training data into the student network to obtain a first output characteristic of the training data, and acquiring a first activation value distribution of each channel of the first output characteristic by adopting a normalized exponential function of a preset temperature coefficient;

inputting training data into the teacher network to obtain a second output characteristic of the training data, and obtaining a second activation value distribution of each channel of the second output characteristic by adopting the normalized exponential function;

determining a target distillation loss from the first and second activation value distributions;

determining a regression loss of the student network according to the training data;

updating network parameters of the student network according to the regression loss and the distillation loss.

2. The method of claim 1, wherein determining a target distillation loss from the first and second distributions of activation values comprises:

determining a corresponding relation between each first activation value distribution and each second activation value distribution;

calculating a reference distillation loss of each channel according to each first activation value distribution and a second activation value distribution corresponding to the first activation value distribution by using the corresponding relation;

calculating the target distillation loss from the reference distillation losses for all channels.

3. The method according to claim 2, wherein the calculating the reference distillation loss of each of the channels from each of the first activation value distributions and a second activation value distribution corresponding to the first activation value distribution using the correspondence relationship includes:

calculating a similarity between the first activation value distribution and the second activation value distribution of each of the channels using the relative entropy, and taking the similarity as the reference distillation loss.

4. The method of claim 1, further comprising:

in response to the number of channels of the second output characteristic being less than the number of channels of the first output characteristic;

performing convolution operation on the second output characteristic to enable the number of channels of the second output characteristic to be the same as the number of channels of the first output characteristic.

5. The method of claim 1, wherein determining the regression loss for the student network from the training data comprises:

acquiring real label data of the training data;

inputting the training data to the student network to obtain a first output result;

and calculating the regression loss according to the first output result and the real label data.

6. The method of claim 1, wherein the pre-training comprises:

constructing an untrained network to be trained, and acquiring real label data of the training data;

inputting the training data into the network to be trained to obtain a second output result;

determining a damage function according to the second output result and the real label data;

and training the network to be trained according to the loss function to obtain the teacher network.

7. The method of claim 1, wherein said updating network parameters of said student network based on said regression loss and said distillation loss comprises:

and calculating target loss according to the distillation loss and the regression loss, and updating network parameters of the student network by using the target loss.

8. A knowledge distillation apparatus, comprising:

the network construction module is used for constructing an untrained student network and a pre-trained teacher network;

the first acquisition module is used for inputting training data into the student network to obtain a first output characteristic of the training data and acquiring a first activation value distribution of each channel of the first output characteristic by adopting a normalized exponential function of a preset temperature coefficient;

the second acquisition module is used for inputting training data into the teacher network to obtain second output characteristics of the training data and acquiring second activation value distribution of each channel of the second output characteristics by adopting the normalized exponential function;

a first determination module for determining a target distillation loss from the first and second activation value distributions;

the second determining module is used for determining the regression loss of the student network according to the training data;

and the parameter updating module is used for updating the network parameters of the student network according to the regression loss and the distillation loss.

9. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the knowledge distillation method of any one of claims 1 to 7.

10. An electronic device, comprising:

one or more processors; and

a memory for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the knowledge distillation method of any of claims 1 to 7.