CN117892831A

CN117892831A - Task processing method and device based on non-data knowledge distillation and electronic equipment

Info

Publication number: CN117892831A
Application number: CN202211229722.5A
Authority: CN
Inventors: 禤韵怡; 浦世亮; 陈伟杰; 杨世才; 谢迪
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2022-10-09
Filing date: 2022-10-09
Publication date: 2024-04-16

Abstract

The application provides a task processing method and device based on distillation without data knowledge and electronic equipment, wherein the method comprises the following steps: receiving a student model acquisition request sent by a terminal device, wherein the student model acquisition request carries task processing requirements; obtaining a teacher model meeting the task processing requirements according to the task processing requirements; training the plurality of generators and the plurality of student models according to the teacher model; and sending the trained appointed student model in the plurality of student models to terminal equipment, and performing task processing by the terminal equipment by utilizing the appointed student model. The method can improve the accuracy of the terminal equipment when the trained student model is used for task processing.

Description

Task processing method and device based on non-data knowledge distillation and electronic equipment

Technical Field

The application relates to the technical field of knowledge migration, in particular to a task processing method and device based on distillation without knowledge and electronic equipment.

Background

In a real scenario, the terminal device typically does not have enough resources to run a larger, more complex model due to hardware performance limitations; because of the limitation of hardware resources, the terminal equipment cannot conduct knowledge extraction on larger and more complex models in a knowledge distillation mode, so that model compression is achieved, and a model with lower requirements on hardware resources is obtained.

Moreover, due to reasons of privacy policy, data security and the like of the company, the original training data belongs to secret information and is not disclosed. In the absence of raw data, it is difficult for a traditional knowledge distillation framework to obtain a student model with higher accuracy.

Therefore, in the conventional scheme, the terminal device can only use the model with low resource requirement, but relatively poor performance to perform tasks such as image processing, and the like, there is often a problem of insufficient accuracy.

Disclosure of Invention

In view of the above, the application provides a task processing method, device and electronic equipment based on no data knowledge distillation, so as to solve the problem of insufficient accuracy when a terminal device uses a model with poor performance to execute tasks such as image processing in the traditional scheme.

Specifically, the application is realized by the following technical scheme:

according to a first aspect of an embodiment of the present application, there is provided a task processing method based on no data knowledge distillation, applied to a server device, the server device implementing no data knowledge distillation using a no data knowledge distillation framework including a plurality of generators, a plurality of student models, the method comprising:

Receiving a student model acquisition request sent by a terminal device, wherein the student model acquisition request carries task processing requirements;

Obtaining a teacher model meeting the task processing requirements according to the task processing requirements;

Training the plurality of generators and the plurality of student models according to the teacher model:

In the training process of the generators, for any generator, carrying out feedback optimization on the generator according to the first type of loss and the second type of loss; the first type loss is used for representing similarity between the data distribution of the generated data of the generator and the data distribution of the original training data of the teacher model, and the higher the data distribution similarity between the data distribution of the generated data of the generator and the data distribution of the original training data of the teacher model is, the smaller the first type loss is; the second type loss is determined according to a first predicted distance of the teacher model and the student model to the generated data of the generator and a second predicted distance of different student models to the generated data of the generator, and the larger the sum value of the first predicted distance and the second predicted distance is, the smaller the second type loss is;

training the student model by using the generated data of the generator in the knowledge distillation process, and carrying out feedback optimization on the student model in training according to the third type of loss; the third type of loss is determined according to a third prediction distance of the teacher model and the student model to the generated data of the generator, and the smaller the third prediction distance is, the smaller the third type of loss is;

and sending the trained appointed student model in the plurality of student models to terminal equipment, and performing task processing by the terminal equipment by utilizing the appointed student model.

According to a second aspect of the embodiment of the present application, there is provided two task processing methods based on non-data knowledge distillation, deployed on a server device, the server device implementing the non-data knowledge distillation using a non-data knowledge distillation framework including a plurality of generators and a plurality of student models, the apparatus comprising:

The receiving unit is used for receiving a student model acquisition request sent by the terminal equipment, wherein the student model acquisition request carries task processing requirements;

the acquisition unit is used for acquiring a teacher model meeting the task processing requirements according to the task processing requirements;

The training unit is used for training the generators and the student models according to the teacher model:

And the task processing unit is used for transmitting the trained appointed student model in the plurality of student models to the terminal equipment, and the terminal equipment utilizes the appointed student model to process tasks.

According to a third aspect of embodiments of the present application, there is provided an electronic device comprising a processor and a memory storing machine executable instructions executable by the processor for executing the machine executable instructions to implement the method provided in the first aspect.

According to a fourth aspect of embodiments of the present application, there is provided a machine-readable storage medium having stored thereon machine-executable instructions which when executed by a processor implement the method provided in the first aspect.

According to the task processing method based on the data-free knowledge distillation, under the condition that a student model acquisition request sent by a collecting terminal device is received, a teacher model meeting task processing requirements is acquired according to task processing requirements carried in the student model acquisition request, and the data-free knowledge distillation is realized by using a data-free knowledge distillation frame comprising a plurality of generators and a plurality of student models, in the training process of the generators, similarity between data distribution of generated data of the generators and data distribution of original training data of the teacher model is considered, prediction distance of the teacher model and the student model to the generated data of the generators is considered, and prediction distance of different student models to the generated data of the generators is considered, so that confusion between the teacher model and the student model is caused under the condition that the data distribution is similar to that of original training data is guaranteed, and confusion among the different student models is caused, so that unaggling information in the teacher model is fully utilized.

Drawings

FIG. 1 is a flow diagram illustrating a task processing method based on distillation without knowledge in accordance with an exemplary embodiment of the present application;

FIG. 2 is a schematic diagram of a task processing device based on knowledge-based distillation without data, according to an exemplary embodiment of the present application;

Fig. 3 is a schematic diagram of a hardware structure of an electronic device according to an exemplary embodiment of the present application.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

In order to better understand the technical solution provided by the embodiments of the present application and make the above objects, features and advantages of the embodiments of the present application more obvious, the technical solution in the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Referring to fig. 1, a flow chart of a task processing method based on non-data knowledge distillation according to an embodiment of the present application is provided, wherein the method may be applied to a server device, and as shown in fig. 1, the task processing method based on non-data knowledge distillation may include the following steps:

Step S100, receiving a student model acquisition request sent by a terminal device, wherein the student model acquisition request carries task processing requirements.

In the embodiment of the application, considering that in the actual application scene, a plurality of terminal devices have limited hardware performance, the terminal devices generally do not have enough resources to run larger and more complex models, and the terminal devices generally cannot refine the knowledge of the larger and more complex models in a knowledge distillation mode, so that the model compression is realized, and the model with lower requirements on hardware resources is obtained.

For example, the processing chip of the terminal device generally has poor performance, such as poor computing power, and cannot run a larger and more complex model; furthermore, the performance of the GPU (Graphics Processing Unit, image processing unit) of the terminal device is often poor, even though the terminal device may not be provided with a GPU, and thus it cannot realize model training.

The processing chip of the terminal device may include, but is not limited to: CPU (Center Process Unit, central processing unit), DSP (DIGITAL SIGNAL processing) chip, FPGA (Field Programmable GATE ARRAY ) chip, AI (ARTIFICIAL INTELLIGENCE, artificial intelligence) processor, etc.

Therefore, the terminal device can send a learning model acquisition request to the server device according to the task processing requirements, and acquire the student model meeting the task processing requirements from the server.

The task processing requirement may be carried in a student model acquisition request sent by the terminal device to the server device, and test data for performing a performance test on the trained learning model.

The learning model is trained according to the task processing requirements and the results of the learning model processing tasks, and when the processing results of the learning model meet the task processing requirements, the learning model is determined to be trained.

For example, the task requirements of the terminal device may be determined according to the task configuration information of the related personnel for the terminal device.

For example, assuming that the terminal device is configured to perform an image classification task, the task processing requirements of the terminal device are image classification.

By way of example, the terminal device may include one or more of a camera, a sweeper, a robot, an autopilot, and the like.

For example, the task processing requirements may also include processing object information.

For example, taking an image classification task as an example, assuming that the terminal device is configured to classify cats/dogs, the task processing requirements may also carry processing object information (i.e., cats and dogs).

For example, taking the task of motion recognition as an example, the task related data may be video data, provided that the terminal device is configured to recognize the motion of the hand in hand.

It should be noted that, the student model acquisition request may also carry test data, and when the server device obtains the trained student model, the performance of the trained student model may be tested according to the test data.

The test data may be obtained from a specified database by the terminal device or imported into the terminal device by a person involved, for example.

Step S110, obtaining a teacher model meeting the task processing requirements according to the task processing requirements.

In the embodiment of the application, when the server side equipment receives the student model acquisition request sent by the terminal equipment, the task processing requirement carried in the student model acquisition request can be acquired, and the teacher model meeting the task processing requirement can be acquired according to the task processing requirement.

By way of example, teacher models may include already trained models (typically better performing models) obtained from a network, or trained models obtained from a particular device.

And step 120, training a plurality of generators and a plurality of student models according to the acquired teacher model.

In the embodiment of the application, since the original training data of the teacher model can not be obtained, the student model can be determined in a mode of no data knowledge distillation.

In order to fully mine the knowledge stored in the teacher model and improve the accuracy of the student model, a plurality of student models may be provided in a knowledge-free distillation frame.

In the process of training the generator, on one hand, the similarity between the data distribution of the generated data of the generator and the data distribution of the original training data can be improved as much as possible, and on the other hand, the prediction distances of the teacher model and the student model to the generated data of the generator can be enlarged as much as possible, and the prediction distances of different student models to the generated data of the generator can be enlarged so as to improve the information quantity of the generated data of the generator.

In addition, in order to increase the diversity of the generated data, a plurality of generators may be provided in the knowledge-free distillation frame, and training sample generation may be performed for a plurality of student models by the plurality of generators.

For example, different generators may generate training samples for different student models. For example, the generator may be in one-to-one correspondence with the student model.

Namely, the data-less known distillation frame in the embodiment of the application is a data-less known distillation frame comprising a plurality of generators and a plurality of student models.

For example, the training of the plurality of generators and the plurality of student models according to the acquired teacher model may include the steps of:

Step S121, for any generator, in the process of training the generator, feedback optimization is carried out on the generator according to the first type loss and the second type loss; the first type loss is used for representing the similarity between the data distribution of the generated data of the generator and the data distribution of the original training data of the teacher model, and the higher the data distribution similarity between the data distribution of the generated data of the generator and the data distribution of the original training data of the teacher model is, the smaller the first type loss is; the second type of loss is determined according to a first predicted distance of the teacher model and the student model to the generated data of the generator and a second predicted distance of the different student models to the generated data of the generator, and the larger the sum of the first predicted distance and the second predicted distance is, the smaller the second type of loss is.

Step S122, training a student model by using the generated data of the generator in the knowledge distillation process, and performing feedback optimization on the student model in training according to the third type loss; wherein the third type of loss is determined according to a third predicted distance of the teacher model and the student model to the generated data of the generator, and the smaller the third predicted distance is, the smaller the third type of loss is.

In the embodiment of the present application, in the training process of the generator, for any one of the generators, in the training process of the generator, the loss function of the generator may include a loss (referred to herein as a first type loss) for characterizing similarity between the data distribution of the generated data of the generator and the data distribution of the original training data of the teacher model, and a loss (referred to herein as a second type loss) determined according to the predicted distance (referred to herein as a first predicted distance) of the generated data of the generator by the teacher model and the student model, and the predicted distance (referred to herein as a second predicted distance) of the generated data of the generator by different student models.

Illustratively, the predicted distance of the teacher model and the student model to the generated data of the generator may be characterized by a distance between the output of the teacher model to the generated data of the generator and the output of the student model to the generated data of the generator.

For any generator, the higher the similarity between the data distribution of the generated data of the generator and the data distribution of the original training data of the teacher model is, the smaller the loss of the first type corresponding to the generator is, that is, the data distribution of the generated data of the generator and the data distribution of the original training data of the teacher model can be made to be more similar by reducing the loss of the first type corresponding to the generator in the training process.

For any one of the generators, the larger the sum of the first predicted distance and the second predicted distance is, the smaller the second type loss corresponding to the generator is, i.e. the predicted distance (i.e. the first predicted distance) of the teacher model and the student model to the generated data of the generator can be enlarged, and the predicted distance (i.e. the second predicted distance) of the different student models to the generated data of the generator can be enlarged by reducing the second type loss corresponding to the generator in the training process of the generator.

Through the training process, the data distribution of the generated data of the generator can be more similar to the data distribution of the original training data as far as possible, and the learning of the student model on the knowledge of the teacher model is ensured; and the generated data of the generator is easier to cause confusion between the teacher model and the student model and confusion between different student models, and the information which is not mined in the teacher model is fully utilized.

Taking task processing requirements of the terminal device as image processing (such as image classification or image segmentation and the like) as an example, in the training process of the generator, on one hand, the first type of loss can be determined according to the similarity between the data distribution of the generated image data of the generator and the data distribution of the original training image data of the teacher model.

On the other hand, the generated image data of the generator may be input into the teacher model and the student model, respectively, to obtain the output of the teacher model to the generated image data of the generator and the output of the student model to the generated image data of the generator, and the prediction distances (i.e., the first prediction distances) of the teacher model and the student model to the generated image data of the generator may be determined according to the distances between the output of the teacher model to the generated image data of the generator and the output of the student model to the generated image data of the generator.

In still another aspect, the generated image data of the generator may be input into different student models, to obtain outputs of the different student models to the generated image data of the generator, and the prediction distances (i.e., the second prediction distances) of the different student models to the generated image data of the generator may be determined according to the distances between the outputs of the different student models to the generated image data of the generator.

In the embodiment of the application, under the condition that the trained generator is obtained in the mode, knowledge distillation can be performed by utilizing the generated data of the generator, namely, the student model is trained.

For example, in training a student model using the generated data of the generator, the loss function of the student model may include a loss (referred to herein as a third type of loss) determined from a teacher model and a predicted distance of the student model to the generated data of the generator, and the student model in training may be feedback optimized according to the third type of loss.

Illustratively, the smaller the predicted distance of the teacher model and the student model to the generated data of the generator, the smaller the third type of loss, i.e., the predicted distance of the teacher model and the student model to the generated data of the generator may be reduced by reducing the third type of loss during knowledge distillation.

Taking task processing requirements of the terminal device as an example of image processing, under the condition that the trained generator is obtained, generating image data of the trained generator can be respectively input into a teacher model and a student model to obtain output of the generated image data of the trained generator by the teacher model and output of the generated image data of the trained generator by the student model, and third type loss is determined according to the distance between the output of the generated image data of the trained generator by the teacher model and the output of the generated image data of the trained generator by the student model.

The data distribution of the generated data of the trained generator is higher in similarity with the data distribution of the original training data, and the generated data of the generator is easy to cause confusion between a teacher model and a student model and confusion between different student models, so that the training of the student model is performed by using the generated data of the trained generator, and the performance of the student model obtained by training can be effectively improved under the condition of getting rid of dependence on the original training data.

It should be noted that in the embodiment of the present application, training and knowledge distillation for the generator may be alternately performed. Each time the generator generates data of a batch (i.e., generates data), the student model can be trained according to the generated data of the batch, and the second type of loss of the generator is determined by using the trained student model, so that feedback optimization is performed on the generator, and the data of the next batch is generated by using the optimized generator.

In addition, for each batch of data generated by the generator, storage may be performed, and in training the student model using the generated data of the generator, not only the generated data of the current batch but also the generated data of the historical batch may be used. For example, each time a student model is trained, the student model may be trained using all of the current generated data of the generator (including the generated data of the current batch and the generated data of all of the historical batches).

And step S120, transmitting the appointed student model in the trained plurality of student models to the terminal equipment, and performing task processing by the terminal equipment by using the appointed student model.

In the embodiment of the application, under the condition that the trained multiple student models are obtained in the mode, the appointed student model in the trained multiple student models can be sent to the terminal equipment, the appointed student model is deployed by the terminal equipment, and task processing is carried out by using the appointed student model.

For example, when the terminal device receives the specified student model sent by the server device, the specified student model may be stored in the memory.

When the terminal device needs to perform task processing, for example, when receiving a task processing instruction, a processing chip of the terminal device, such as a CPU, a DSP chip, an FPGA chip or an AI processor, may read a specified student model from a memory, and perform task processing using the specified student model.

For example, taking an image processing task as an example, the processing chip of the terminal device may perform image processing, such as image classification or image segmentation, on an acquired image, such as an image acquired by a camera of the terminal device, using the specified student model read from the memory.

For example, the specified student model may be a student model with optimal performance among a plurality of trained student models, for example, a student model with highest accuracy determined by testing with a test set.

Or the specified student model may be determined according to an actual requirement of the terminal device, for example, according to a requirement of the terminal device on the model structure, and the student model with the model structure meeting the requirement is determined as the specified student model from the trained plurality of student models.

By way of example, the generated data has the characteristics of large information quantity and high diversity, knowledge in the teacher model is fully mined, so that the student model can learn more comprehensive knowledge in the teacher model, and therefore, the precision of the student model can be effectively improved, and further, when the terminal equipment performs task processing by using the student model obtained through training, the accuracy of task processing can be effectively improved.

By way of example, the tasks may include image processing tasks, speech processing tasks, and/or text processing tasks, among others.

Taking an image processing task as an example, the terminal device may perform image processing on an image to be processed using the above-described specified student model, for example, performing image classification, object detection, image segmentation, or the like.

It should be noted that, when the student model acquisition request further carries test data, the server device may test the trained student model by using the test data carried in the received student model acquisition request, so as to determine whether the test result of the trained student model on the test data meets the requirement.

Taking an image classification task as an example, for any trained student model, the trained student model can be utilized to carry out image classification on test data, the accuracy of image classification is counted, and if the accuracy exceeds a preset accuracy threshold, the trained student model is determined to meet the requirement on the test result of the test data.

The server device may send the specified student model of the trained plurality of student models to the terminal device, where the specified student model is deployed by the terminal device, and the task processing is performed by using the specified student model, where the test results of the trained plurality of student models on the test data all meet the requirements.

The training method includes the steps that under the condition that the test result of any trained student model on test data does not meet the requirement, the training method can be used for retraining the student model according to the mode, and after training is completed, the test data are reused for testing the trained student model until the test results of a plurality of trained student models on the test data meet the requirement; or under the condition that the test result of the test data by the partly trained student model does not meet the requirement, the appointed student model can be selected from the trained student models with the test result meeting the requirement and sent to the terminal equipment, and the terminal equipment performs task processing by utilizing the appointed student model.

It can be seen that, in the method flow shown in fig. 1, under the condition of receiving a student model acquisition request sent by a collecting terminal device, according to task processing requirements carried in the student model acquisition request, a teacher model meeting the task processing requirements is acquired, and a non-data knowledge distillation frame including a plurality of generators and a plurality of student models is utilized to realize non-data knowledge distillation, in a training process of the generators, similarity between data distribution of generated data of the generators and data distribution of original training data of the teacher model is considered, prediction distances of the teacher model and the student model to the generated data of the generators are considered, and prediction distances of different student models to the generated data of the generators are also considered, so that confusion between the teacher model and the student model is caused under the condition that the data distribution is similar to that of original training data is ensured, and confusion among different student models is caused, so that non-mined information in the teacher model is fully utilized, in addition, the training process of the generators is effectively improved, the students can be assigned with the training terminal device by using the plurality of generators, the students can be further trained by using the student models, the students can be accurately designated by the student models, and the terminal device can be further trained by using the students in the training process of the student models, and the accuracy is improved.

In some embodiments, the first type of loss described above may be determined by:

determining a batch normalization loss according to the statistical mean and variance of the generated data of the generator at the batch normalization layer of the teacher model and the distance between the statistical mean and variance of the batch normalization layer of the teacher model;

Determining category priori loss according to the output of the teacher model to the generated data of the generator and the predefined category labels of the generated data of the generator;

A first type of loss for the generator is determined based on the normalized loss and the category prior loss.

For example, for any generator, the batch normalization loss may be determined according to a distance between a statistical mean and variance of the generated data of the generator at the batch normalization layer of the teacher model (i.e., a statistical mean of the generated data of the generator at the activation feature map of the batch normalization layer of the teacher model, and a statistical variance of the generated data of the generator at the activation feature map of the batch normalization layer of the teacher model), and a mean and variance of the batch normalization layer of the teacher model (i.e., a mean of the raw training data stored in the batch normalization layer of the teacher model, and a variance of the raw training data stored in the batch normalization layer of the teacher model).

Taking task processing requirements of terminal equipment as an image processing example, for any generator, the generated image data of the generator can be input into a teacher model, the statistical mean and the statistical variance of the activation feature images of the generated image data in the batch normalization layer of the teacher model are determined, the distance between the statistical mean and the statistical variance of the activation feature images of the generated image data in the batch normalization layer of the teacher model and the mean and the variance of the original training image data stored in the batch normalization layer of the teacher model is determined, and the batch normalization loss is determined according to the distance.

In one example, the batch normalization loss may be determined by the following equation:

Where μ _k (g (z)) is the statistical mean of the activation feature map of the generated data of the generator at the kth batch normalization layer of the teacher model, is the statistical variance of the activation feature map of the generated data of the generator at the kth batch normalization layer of the teacher model,/> is the mean of the raw training data stored in the kth batch normalization layer of the teacher model, and/> is the variance of the raw training data stored in the kth batch normalization layer of the teacher model.

The category a priori loss is determined by the following formula:

L_cls＝CE(f_t(g(z))，y)

Where cE (f _t (g (z)), y) is the cross entropy loss of f _t (g (z)) and y, f _t (g (z)) is the output of the teacher model to the generated data of the generator, and y is the predefined class label of the generated data of the generator.

In some embodiments, the second type of loss may be determined by:

determining a first countermeasures loss according to the distance between the teacher model output of the generated data of the generator and the target student model output of the generated data of the generator;

Determining a second countermeasures loss according to the distance between the output of the target student model to the generated data of the generator and the output of the other student models to the generated data of the generator; the target student model is a student model corresponding to the generator in the plurality of student models, and the other student models are other student models except the target student model in the plurality of student models;

A second type of loss is determined based on the first pair of resistive losses and the second pair of resistive losses.

Illustratively, in order to maximize the prediction distance of the teacher model and the student model to the generated data of the generator and the prediction distance of the different student models to the generated data of the generator during the training process of the generator, the penalty function of the generator may further include countermeasures.

For example, for any generator, the countermeasures in the generator's loss function may include the distance between the output of the generator's generated data according to the teacher model, the output of the generator's generated data by the student model corresponding to the generator (referred to herein as the target student model), the determined countermeasures (referred to herein as the first countermeasures), and the distance between the output of the generator's generated data according to the target student model and the output of the generator's generated data by the other student model (referred to herein as the second countermeasures).

For example, suppose that the generators in the data knowledge-free distillation framework provided by the embodiment of the present application include generators 1 to 3, the student models include student models 1 to 3, and the generator 1 corresponds to the student model 1, the generator 2 corresponds to the student model 2, and the generator 3 corresponds to the student model 3. In the training process of the generator 1, the first countermeasures loss can be determined according to the distance between the teacher model output of the generated data of the generator 1 and the student model 1 (i.e. the target student model corresponding to the generator 1) output of the generated data of the generator 1; the second countermeasures are determined in accordance with the distance between the output of the student model 1 to the generated data of the generator 1 and the output of the student model 2 to the generated data of the generator 1, and the distance between the output of the student model 1 to the generated data of the generator 1 and the output of the student model 3 to the generated data of the generator 1.

Illustratively, the greater the distance between the teacher model and the output of the generative data of the generator and the target student model, the greater the first fight loss.

Illustratively, the greater the distance between the output of the target student model to the generated data of the generator and the output of the other student models to the generated data of the generator, the greater the second combat loss.

For example, where the first pair of resistive losses and the second resistive loss are determined, the second type of loss for the generator may be determined as a function of the first pair of resistive losses and the second resistive loss.

Illustratively, the second type of loss is inversely related to the first countermeasures loss, the second countermeasures loss, respectively.

In one example, for generator g _i, the first countermeasures loss may be determined by the following equation:

G _i(z_i) is the KL divergence of the output of the student model s _i to the generated data of the generator g _i and the output of the teacher model to the generated data of the generator g _i, and is the target student model of the generator g _i.

Illustratively, for generator g _i, the second countermeasures loss may be determined by the following equation:

denotes that the type determination result of the generated data of the target student model s _i to the generator g _i is the same as the type determination result of the generated data of the other student model s _j to the generator g _i, that/> is the KL divergence of the output of the target student model s _i to the generated data of the generator g _i and the output of the other student model s _j to the generated data of the generator g _i, that m is the number of the other student models, that i is not equal to j, that is 1 when is satisfied, and that is 0 when is not satisfied.

For example, considering that each student model lacks knowledge in the initial stage of training, the fight loss between the student models in the initial stage of training helps less to the performance improvement of the generator, so in order to improve the training efficiency of the generator, when determining the second type loss of the generator, the fight loss between the student models can be adopted, and in the case that the class judgment results of the different student models on the generated samples of the generator are inconsistent, the fight loss between the student models is not introduced; under the condition that the class judgment results of the generated samples of the generator are consistent by different student models, the fight loss among the student models is introduced.

Wherein is the class of student model s _i with the highest confidence in the output of the generated data of generator g _i, and/() is the class of student model s _j with the highest confidence in the output of the generated data of generator g _i.

Wherein generator g _i is any one of the above-described plurality of generators.

In one example, determining the second type of loss from the first pair of resistive losses and the second resistive loss may include:

the second type loss L _adv is determined by the following formula:

L_adv＝-L_dis1

L_dis1＝L_ST+λL_SS

Wherein lambda is a preset super parameter.

In some embodiments, for generator g _i, feedback optimization of the generator according to the first type of loss and the second type of loss as described above may include:

The generator is feedback optimized by the following objective equation:

L_adv＝-L_dis1

Wherein, L _inv is a loss function of generator g _i, L _bn is a batch normalization loss of generator g _i, L _cls is a class prior loss of the generator, L _dis1 is an output distance loss of the generator, L _adv is an countermeasure loss of the generator, the first type loss includes L _bn and L _cls, the second type loss includes L _adv, α, β, γ is a preset weight, and is a parameter of generator g _i.

In some embodiments, training the student model using the generated data of the generator and performing feedback optimization on the student model in training according to the third type of loss may include:

Inputting a union of the generated data of the plurality of generators to a teacher model and a plurality of student models, respectively;

determining a third type of loss according to a distance between the output of the teacher model to the generated data of the generator and the output of the student model to the generated data of the generator;

and carrying out feedback optimization on the student model in training according to the third type of loss.

Illustratively, in order to fully utilize the diversity of the samples, in the knowledge distillation stage, the samples generated by different generators are respectively input into the teacher model and the student model in the form of a set ". U", and a third type of loss is determined according to the distance between the output of the teacher model to the generated data of the generator and the output of the student model to the generated data of the generator.

Illustratively, the smaller the distance between the output of the teacher model to the generated data of the generator and the output of the student model to the generated data of the generator, the smaller the third type of loss.

Accordingly, in the knowledge distillation stage, feedback optimization can be performed on the student model according to the principle of minimizing the third type of loss.

In one example, for student model s _i, feedback optimization of the student model under training according to the third type of loss described above includes:

feedback optimization is performed on the student model by the following objective equation:

Wherein is a union of the generated data of the plurality of generators,/> is a distance between the output of the student model s _i pair/> and the output of the teacher model pair/> , and/> is a parameter of the student model s _i.

In order to enable those skilled in the art to better understand the technical solutions provided by the embodiments of the present application, the technical solutions provided by the embodiments of the present application are described below with reference to specific examples.

This embodiment proposes a multi-student, multi-producer data distillation framework based on classroom learning. The framework is based on a classical antagonistic type distillation framework without data knowledge, and is expanded into a distillation mode of multi-student interaction and multi-generator integrated learning. Two ways of interaction are contained in the framework, namely teacher model-student model and student model-student model interaction.

By way of example, the generator on the one hand uses the prediction distance between the teacher model and the student as an optimization target, and digs out difficult samples which are not mastered by the current student model (namely, generated data which causes confusion between the teacher model and the student model) in the teacher model, and can obtain more comprehensive difficult samples due to the existence of a plurality of student model-teacher model pairs; on the other hand, the prediction distance between student models is enlarged as an optimization target, and difficult samples which are not mastered by all student models in teacher models are mined. Through interaction among students, generation of difficult samples with more information can be further promoted, and more comprehensive information extraction of a teacher model is achieved.

The task processing flow based on the distillation without knowledge is described in detail below taking an image processing task as an example.

1. Data similar to the original training data distribution is generated.

Illustratively, a generator g (-); θ _g is constructed that samples random noise z from a normal distribution as input, and is trained by designing a loss function to generate a generated image g (z; θ _g) that is similar to the original data distribution.

Illustratively, the loss function may include:

1.1 batch normalization loss

The distances between the statistical mean μ and variance σ ² of the activation of the generated image at the batch normalization layer in the teacher model and the measured mean and variance/> of the batch normalization layer in the teacher model are calculated so that the generated image has a similar distribution to the original data.

By way of example, L2 distance may be used as a measure of distance.

1.2 Class prior loss

The predefined class labels y are randomly sampled from the uniform distribution, the generated image is input into a teacher model, the cross entropy loss between the output of the teacher model f _t(·;θ_t) and the predefined class labels is calculated, and the output of the generated image in the pre-training model is in one-hot (independent hot code) form, so that the distribution similar to the original training data is obtained.

L_cls＝CE(f_t(g(z))，y)

1.3 Fight losses

1.3.1 First fight loss

Maximizing the distance of the generated image on the output of the teacher model and the student model causes confusion between the teacher model and the student model for the generated sample.

Illustratively, KL divergence may be used as a measure of distance.

1.3.2, Second challenge loss

In order to improve the information quantity of the sample, the embodiment is expanded into a mode of teacher model-multiple student model interaction based on single teacher model-student model interaction, and interaction among different student models is added.

When the loss resistance is determined, the loss resistance among different student models is added, so that the generator generates a generated image which causes confusion between the teacher model and the student model on one hand and generates a generated image which causes confusion among the different student models on the other hand.

For example, since each student model lacks knowledge in the initial stage of training, the assistance of the fight loss between the student models added in the initial stage of training to the improvement of the performance of the generator is small, so in order to improve the training efficiency of the generator, a conditional fight loss can be designed, and the fight loss between the student models is not introduced under the condition that the class judgment results of different student models on the generated samples of the generator are inconsistent; under the condition that the class judgment results of the generated samples of the generator are consistent by different student models, the fight loss among the student models is introduced.

/>

Total countermeasures (the second type of losses described above):

L_adv＝-L_dis1

L_dis1＝L_ST+λL_SS

Illustratively, the objective equation of the training generator may be expressed as:

L_adv＝-L_dis1

2. Knowledge distillation using generated images

Storing the generated images of each batch of the generator, and in the knowledge distillation stage, inputting the generated images of the current batch of the generator and the generated images of the historical batches into a teacher network and a student network respectively to minimize the distance between the teacher network and the student network on the output

For example, an average absolute error (Mean Absolute Error, abbreviated MAE) may be used as the distance metric.

In order to fully utilize the diversity of the samples, in the knowledge distillation stage, the samples generated by different generators are respectively input into a teacher model and a student model in the form of a set.

Illustratively, the objective equation for knowledge distillation can be expressed as:

In order to enable those skilled in the art to better understand the technical effects of the technical solutions provided by the embodiments of the present application, the effects of the embodiments of the present application are described below with reference to specific application examples.

The technical scheme provided by the embodiment of the application can be applied to image processing tasks including but not limited to image classification, object detection or image segmentation, and the like, and has a wide use scene in practical specific applications.

The following examples are illustrative.

1. Image classification

The terminal device may send a student model acquisition request to the server device, where the student model acquisition request may carry task processing requirements (such as image classification) and test data (a certain amount of labeled images to be classified).

When the server side equipment receives a student model acquisition request of the terminal equipment, a trained image classification model (which can be called a teacher image classification model) can be acquired from the network/other equipment, and the technical scheme provided by the embodiment of the application can be adopted under the strict data privacy protection requirement/or limited data transmission resource condition, so that a plurality of generators are trained only according to the acquired teacher image classification model, and the student image classification model is trained by utilizing the generated image data of the generators, so that a plurality of trained student image classification models are obtained.

Under the condition that a plurality of trained student image classification models are obtained and the test results of the test data by each student image classification model meet the requirements, the student image classification model (which can be called as a designated student image classification model) with the highest accuracy of image classification on the test data can be sent to the terminal equipment.

And under the condition that the terminal equipment receives the student image classification model, the specified student image classification model can be utilized to carry out image classification processing on the images to be classified.

Therefore, by adopting the technical scheme provided by the embodiment of the application, the student image classification model with high performance and low resource consumption can be obtained under the condition that the original data is unavailable, and further, the terminal equipment can execute the image classification task by using the student image classification model trained by the server equipment under the condition that the hardware resources are poor, so that the accuracy of the terminal equipment for carrying out the image classification task is improved.

2. Image segmentation

The terminal device may send a student model acquisition request to the server device, where the student model acquisition request may carry a task processing requirement (such as image segmentation).

When the server side device receives a student model acquisition request of the terminal device, a trained image segmentation model (which can be called a teacher image segmentation model) can be acquired from the network, the technical scheme provided by the embodiment of the application can be adopted, a plurality of generators are trained according to the acquired teacher image segmentation model, and the student image segmentation model is trained by utilizing the generated image data of the generators, so that a plurality of trained student image segmentation models are obtained.

In the case of obtaining a plurality of trained student image segmentation models, a student target detection model (which may be referred to as a designated student image segmentation model) with optimal performance for image segmentation of test data may be deployed on a terminal device.

And under the condition that the terminal equipment receives the student image segmentation model, the specified student target detection model can be utilized to carry out image segmentation processing on the image to be detected.

Therefore, by adopting the technical scheme provided by the embodiment of the application, the student image segmentation model with high performance and low resource consumption can be obtained under the condition that the original data is unavailable, and further, the terminal equipment can execute the image segmentation task by using the student image segmentation model trained by the server equipment under the condition that the hardware resources are poor, so that the performance of the terminal equipment for carrying out the image segmentation task is optimized.

The method provided by the application is described above. The device provided by the application is described below:

Referring to fig. 2, a schematic structural diagram of a task processing device based on distillation without knowledge according to an embodiment of the present application, as shown in fig. 2, the task processing device based on distillation without knowledge may include:

A receiving unit 210, configured to receive a student model acquisition request sent by a terminal device, where the student model acquisition request carries a task processing requirement;

an obtaining unit 220, configured to obtain a teacher model that meets the task processing requirement according to the task processing requirement;

a training unit 230, configured to train the plurality of generators and the plurality of student models according to the teacher model:

And the task processing unit 240 is configured to send the trained specified student model of the plurality of student models to a terminal device, where the terminal device performs task processing using the specified student model.

In some embodiments, the first type of loss is determined by:

Determining a first type of loss for the generator based on the normalized loss and the category prior loss.

In some embodiments, the batch normalization loss is determined by the following equation:

Wherein μ _k (g (z)) is a statistical mean of the activation feature map of the generated data of the generator at the kth batch normalization layer of the teacher model, is a statistical variance of the activation feature map of the generated data of the generator at the kth batch normalization layer of the teacher model,/> is a mean of the raw training data stored in the kth batch normalization layer of the teacher model,/> is a variance of the raw training data stored in the kth batch normalization layer of the teacher model;

The category a priori loss is determined by the following formula:

L_cls＝CE(f_t(g(z))，y)

In some embodiments, the second type of loss is determined by:

Determining a first countermeasures loss according to the distance between the teacher model and the output of the generated data of the generator and the output of the target student model;

Determining the second type of loss based on the first pair of resistive losses and the second pair of resistive losses.

In some embodiments, for generator g _i, the first countermeasures loss is determined by the following equation:

/>

Wherein g _i(z_i) is the generated data of generator g _i, is the KL divergence of the output of student model s _i to the generated data of generator g _i and the output of teacher model to the generated data of generator g _i, student model s _i is the target student model of generator g _i;

for generator g _i, the second countermeasures loss is determined by the following equation:

Wherein indicates that the result of determining the class of the generated data of the student model s _i to the generator g _i is the same as the result of determining the class of the generated data of the student model s _j to the generator g _i, is the KL divergence of the output of the student model s _i to the generated data of the generator g _i and the output of the student model s _j to the generated data of the generator g _i, m is the number of other student models, i+.j; is a value of 1 if is satisfied, and is a value of 0 if is not satisfied;

said determining said second type of loss from said first pair of resistive losses and said second resistive loss comprises:

the second type loss L _adv is determined by the following formula:

L_adv＝-L_dis1

L_dis1＝L_ST+λL_SS

Wherein lambda is a preset super parameter.

In some embodiments, for generator g _i, the feedback optimizing the generator in terms of the first type of loss and the second type of loss comprises:

The generator is feedback optimized by the following objective equation:

L_inv＝αL_bn+βL_cls+γL_adv

L_adv＝-L_dis1

Wherein, L _inv is a loss function of generator g _i, L _bn is a batch normalization loss of generator g _i, L _cls is a class priori loss of the generator, L _dis1 is an output distance loss of the generator, L _adv is an antagonism loss of the generator, the first type loss includes L _bm and L _cls, the second type loss includes L _adv, α, β, γ is a preset weight, and is a parameter of generator g _i.

In some embodiments, the training unit 220 uses the generated data of the generator to train the student model and performs feedback optimization on the student model under training according to the third type of loss, including:

inputting a union of the generated data of the plurality of generators to the teacher model and the plurality of student models, respectively;

In some embodiments, for the student model s _i, the feedback optimization of the student model under training according to the third type of loss includes:

Wherein is a union of the generated data of the plurality of generators, wherein/> is a distance between an output of a student model s _i pair/> and an output of the teacher model pair/> , and/> is a parameter of the student model s _i.

An embodiment of the present application provides an electronic device, including a processor and a memory, where the memory stores machine executable instructions executable by the processor, and the processor is configured to execute the machine executable instructions to implement the task processing method described above based on no data knowledge distillation.

Fig. 3 is a schematic hardware structure of an electronic device according to an embodiment of the present application. The electronic device may include a processor 301, a memory 302 storing machine-executable instructions. The processor 301 and the memory 302 may communicate via a system bus 303. Also, by reading and executing machine-executable instructions in memory 302 corresponding to the task processing logic based on the distillation of the awareness-free data, processor 301 can perform the task processing method based on the distillation of the awareness-free data described above.

The memory 302 referred to herein may be any electronic, magnetic, optical, or other physical storage device that may contain or store information, such as executable instructions, data, or the like. For example, a machine-readable storage medium may be: RAM (Radom Access Memory, random access memory), volatile memory, non-volatile memory, flash memory, a storage drive (e.g., hard drive), a solid state disk, any type of storage disk (e.g., optical disk, dvd, etc.), or a similar storage medium, or a combination thereof.

In some embodiments, a machine-readable storage medium, such as memory 302 in fig. 3, is also provided, having stored thereon machine-executable instructions that when executed by a processor implement the task processing method described above based on the distillation of non-knowledge-based data. For example, the storage medium may be ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The foregoing description of the preferred embodiments of the application is not intended to be limiting, but rather to enable any modification, equivalent replacement, improvement or the like to be made within the spirit and principles of the application.

Claims

1. A task processing method based on non-data knowledge distillation, characterized by being applied to a server device, wherein the server device realizes the non-data knowledge distillation by using a non-data knowledge distillation framework comprising a plurality of generators and a plurality of student models, and the method comprises the following steps:

2. The method of claim 1, wherein the first type of loss is determined by:

3. The method of claim 2, wherein the batch normalization loss is determined by the following equation:

The category a priori loss is determined by the following formula:

L_cls＝CE(f_t(g(z)),y)

4. The method of claim 1, wherein the second type of loss is determined by:

5. The method of claim 4, wherein for generator g _i, the first countermeasures loss is determined by the following equation:

Wherein represents the same result of determining the class of the generated data of the student model s _i to the generator g _i as the class of the generated data of the student model s _j to the generator g _i, and,/> is the KL divergence of the output of the student model s _i to the generated data of the generator g _i and the output of the student model s _j to the generated data of the generator g _i, m is the number of other student models, i not equal to j; when/> is satisfied, the value of/> is 1, and when/> is not satisfied, the value of/> is 0;

the second type loss L _adv is determined by the following formula:

L_adv＝-L_dis1

L_dis1＝L_ST+λL_SS

Wherein lambda is a preset super parameter.

6. The method of claim 1, wherein for generator g _i, said feedback optimizing the generator in terms of the first type of loss and the second type of loss comprises:

The generator is feedback optimized by the following objective equation:

L_inv＝αL_bn+βL_cls+γL_adv

L_adv＝-L_dis1

wherein, L _inv is a loss function of generator g _i, L _bn is a batch normalization loss of generator g _i, L _clsw is a class priori loss of the generator, L _dis1 is an output distance loss of the generator, L _adv is an antagonism loss of the generator, the first type loss includes L _bn and L _cls, the second type loss includes L _adv, α, β, γ is a preset weight, and is a parameter of generator g _i.

7. The method of claim 1, wherein training the student model using the generated data of the generator and feedback optimizing the student model under training according to the third type of loss comprises:

8. The method of claim 7, wherein for student model s _i, said feedback optimizing the student model under training in accordance with the third type of loss comprises:

9. A task processing device based on non-data aware distillation, characterized in that it is deployed in a server device that implements non-data aware distillation using a non-data aware distillation framework comprising a plurality of generators, a plurality of student models, the device comprising:

10. An electronic device comprising a processor and a memory, the memory storing machine executable instructions executable by the processor for executing the machine executable instructions to implement the method of any of claims 1-8.