CN113762368A

CN113762368A - Method, device, electronic equipment and storage medium for data distillation

Info

Publication number: CN113762368A
Application number: CN202110994122.7A
Authority: CN
Inventors: 李雨杭; 龚睿昊; 沈明珠; 余锋伟; 路少卿
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2021-08-27
Filing date: 2021-08-27
Publication date: 2021-12-07
Also published as: WO2023024406A1

Abstract

The embodiment of the application discloses a method, a device, electronic equipment and a storage medium for data distillation, wherein the method comprises the following steps: determining at least one batch of first distillation data to be trained; determining at least two pre-training models; determining a batch normalized statistical loss of each batch of the first distillation data in the corresponding pre-trained model based on the first statistical information in each of the pre-trained models; determining a target cross-entropy loss for each batch of the first distillation data in each of the pre-trained models based on an initialization tag for each batch of the first distillation data; and carrying out back propagation training on each batch of the first distillation data based on batch normalized statistical loss and the target cross entropy loss of each batch of the first distillation data in each pre-training model to obtain target distillation data.

Description

Method, device, electronic equipment and storage medium for data distillation

Technical Field

The present application relates to the field of computer vision, and relates to, but is not limited to, methods, apparatus, electronic devices, and storage media for data distillation.

Background

In the big data era, deep learning models are applied more and more frequently, and in order to apply the deep learning models to small devices such as mobile devices and sensors, the models sometimes need to be compressed and cut to be deployed to the small devices.

Compression of neural networks typically requires raw training data, since the compressed model typically also requires training to restore previous performance. However, the raw data is in some cases private, i.e. the raw data is at risk of not being available.

Disclosure of Invention

The embodiment of the application provides a data distillation method and device, electronic equipment and a storage medium.

The technical scheme of the embodiment of the application is realized as follows:

in a first aspect, embodiments of the present application provide a method for data distillation, including: determining at least one batch of first distillation data to be trained; at least one mixed data including two data label information exists in each batch of the first distillation data; determining at least two pre-training models; first statistical information of original data is stored in each pre-training model; determining a batch normalized statistical loss of each batch of the first distillation data in the corresponding pre-trained model based on the first statistical information in each of the pre-trained models;

determining a target cross-entropy loss for each batch of the first distillation data in each of the pre-trained models based on an initialization tag for each batch of the first distillation data; and carrying out back propagation training on each batch of the first distillation data based on batch normalized statistical loss and the target cross entropy loss of each batch of the first distillation data in each pre-training model to obtain target distillation data.

In some possible embodiments, the determining a batch normalized statistical loss of each batch of the first distillation data in the corresponding pre-trained model based on the first statistical information in each of the pre-trained models comprises: determining second statistical information for each batch of the first distillation data in each of the pre-training models; for each of the pre-training models, determining a batch normalized statistical loss between the first statistical information and the second statistical information.

In this way, by matching the statistical loss between the first distillation data and the raw data in each pre-trained model, and further by determining that the target loss matches the first distillation data and the features of the plurality of pre-trained models simultaneously, the generic feature spaces of the pre-trained models are combined, thereby making the trained target distillation data more generic than single-model distillation data.

In some possible embodiments, the determining at least two pre-trained models comprises: at least two different types of pre-trained models are randomly selected from a pre-trained model library.

Therefore, for each batch of first distillation data to be trained, at least two different types of pre-training models are randomly sampled from the pre-training library for feature mixing, so that the distilled data are matched with the feature distribution of any pre-training model, and better training speed and effect balance can be obtained.

In some possible embodiments, the back propagation training of each batch of the first distillation data based on the batch normalized statistical loss and the target cross-entropy loss of each batch of the first distillation data in each pre-training model to obtain the target distillation data comprises: determining a first loss corresponding to each pre-training model based on the batch normalization statistical loss and the target cross entropy loss of each pre-training model; averaging the first losses corresponding to each pre-trained model to obtain target losses of each batch of the first distillation data after passing through the at least two pre-trained models; and performing back propagation training on each batch of the first distillation data based on the target loss to obtain the target distillation data.

Therefore, the first loss of a batch of first distillation data for each pre-training model is obtained by combining the target cross entropy loss and the batch normalization statistical loss, the first loss corresponding to each pre-training model is further linearly integrated to obtain the target loss, the characteristic deviation generated by each pre-training model can be averaged, and the finally obtained target distillation data are more universal.

In some possible embodiments, the determining at least one batch of first distillation data to be trained comprises: initializing at least one batch of second distillation data based on the distribution information of the raw data; in each iterative training, every two image data in each batch of the second distillation data are mixed to obtain each batch of the first distillation data.

Therefore, random data mixing is carried out on the initialized batch of second distillation data in each iterative training, the inversion solution space can be reduced, the visual information of the trained target distillation data is more robust, and the target distillation data can adapt to different scales and spatial orientations.

In some possible embodiments, the mixing every two image data in each batch of the second distillation data to obtain each batch of the first distillation data includes: randomly selecting at least one pair of a first image and a second image from each batch of the second distillation data; reducing the size of each first image according to a specific scale; and randomly overlaying the reduced first image to the corresponding second image to obtain each batch of first distillation data.

In this way, the first image is reduced in equal proportion and randomly overlaid onto the second image in the data mixing, and the composite image after mixing will contain information of the first image and the second image. In the course of training each batch of first distillation data, it is ensured that the two mixed image information can be identified and correctly optimized by the pre-trained model.

In some possible embodiments, the overlaying the scaled-down first image into the corresponding second image to obtain each batch of the first distillation data includes: randomly generating a mixed area to be covered in the corresponding second image according to the specific proportion; randomly generating a binary mask based on a bounding box of the blending region; and superposing each pixel value of the reduced first image and the corresponding pixel value in the second image through the binary mask to obtain each batch of the first distillation data.

In this way, the pixel value of the first image and the pixel value of the second image after the reduction are superimposed by randomly generating the binary mask, so that the random mixing of the two images in the first distillation data is realized, and the synthetic image still has discriminative power when being trained and optimized by the pre-training model.

In some possible embodiments, the step of determining the target cross-entropy loss of each batch of the first distillation data in each pre-training model based on the initialization tag of each batch of the first distillation data includes: determining a hybrid cross-entropy loss for the composite image based on the initialization tag for the first image and the initialization tag for the second image; determining cumulative cross-entropy loss for other images in the batch of first distillation data, except for the first image and the second image, based on initialization tags of the other images; and determining the target cross entropy loss of each batch of the first distillation data after passing through each pre-training model based on the mixed cross entropy loss and the accumulated cross entropy loss.

In this way, the mixed cross entropy loss of the synthesized image is determined by the initialization tags of the first image and the second image, and it can be ensured that the information of the original first image and the original second image can be identified and correctly optimized by the pre-trained model. And summing the mixed cross entropy loss of the synthetic images in a batch of first distillation data and the accumulated cross entropy loss of other images, and minimizing to obtain the target cross entropy loss of each batch of first distillation data after each pre-training model, so that the finally obtained target distillation data is closer to the real data. In some possible embodiments, the determining the hybrid cross-entropy loss for the composite image based on the initialization tag for the first image and the initialization tag for the second image comprises: determining a first cross-entropy loss for each batch of the first distillation data after passing through each of the pre-trained models based on an initialization label for the first image; determining a second cross-entropy loss for each batch of the first distillation data after passing through each of the pre-trained models based on the initialization label of the second image; and linearly summing the first cross entropy loss and the second cross loss based on the synthesis proportion between the first image and the second image to obtain the mixed cross entropy loss of the synthesized image.

In this way, determining the mixed cross-entropy loss of the composite image by using the initialization tags of the images prior to the mixing of the data helps to generate robust features that are still discriminative during training, while being able to prevent incorrect inversion solutions, synthesizing distilled images with accurate tag information.

In some possible embodiments, the first statistical information includes a mean and a variance, each of the pre-trained models includes at least two component blocks, each of the component blocks includes a convolutional layer and a batch normalization layer, and the method further includes: extracting a mean and a variance of the raw data from a batch normalization layer of each of the component blocks; the mean and variance are obtained by counting the characteristics of the original data extracted by the convolutional layer; and determining first statistical information in each pre-training model based on the mean and the variance of the at least two component blocks in each pre-training model.

Therefore, the mean value and the variance stored in the batch normalization layer in each component block in the pre-training model are extracted, the first statistical information corresponding to the pre-training model is determined, and the activation distribution of the original data in the pre-training network can be reasonably simulated by the target distillation data obtained through training.

In some possible embodiments, the distribution information is gaussian distribution information, and the initializing at least one batch of second distillation data based on the distribution information of the raw data comprises: acquiring Gaussian distribution information of the original data; initializing at least N initial pixel value matrixes based on data randomly sampled from the Gaussian distribution information to obtain each batch of the second distillation data; n is an integer of 2 or more.

Therefore, data are randomly sampled from the Gaussian distribution information of the original data, and at least one batch of second distillation data are obtained through initialization, so that the defect that the original data cannot be directly obtained is overcome.

In some possible embodiments, the back propagation training of each batch of the first distillation data based on the batch normalized statistical loss and the target cross-entropy loss of each batch of the first distillation data in each pre-training model to obtain the target distillation data comprises: determining a target loss of each batch of the first distillation data through the at least two pre-trained models based on a batch normalized statistical loss and the target cross-entropy loss of each batch of the first distillation data in each pre-trained model; determining a gradient of the target loss for the first distillation data in each iterative training in a back propagation process; updating each of the first distillation data based on the gradient; and obtaining the target distillation data when the batch normalization statistical loss and the target cross entropy loss of the updated first distillation data in each pre-training model converge to stable values.

In this way, the first distillation data is optimized by a random gradient descent method in the back propagation process, and the iterative training process of each batch of first distillation data is accelerated.

In some possible embodiments, the method further comprises: and performing model compression on each pre-training model when the number of at least one batch of the target distillation data reaches a specific threshold value.

Therefore, the generated target distillation data has more visual information and is universal, and can be directly used for the model compression process, thereby simplifying the compression process.

In a second aspect, embodiments of the present application provide an apparatus for data distillation, comprising a first determination module, a second determination module, a third determination module, a fourth determination module, and a training module, wherein:

the first determining module is used for determining at least one batch of first distillation data to be trained; at least one mixed data including two data label information exists in each batch of the first distillation data; the second determining module is used for determining at least two pre-training models; first statistical information of original data is stored in each pre-training model; the third determining module is used for determining batch normalized statistical loss of each batch of the first distillation data in the corresponding pre-training model based on the first statistical information in each pre-training model; the fourth determination module is used for determining target cross entropy loss of each batch of the first distillation data in each pre-training model based on the initialization tag of each batch of the first distillation data; the training module is used for carrying out back propagation training on each batch of the first distillation data based on batch normalization statistical loss and the target cross entropy loss of each batch of the first distillation data in each pre-training model to obtain target distillation data.

In a third aspect, an embodiment of the present application provides an electronic device, including a memory and a processor, where the memory stores a computer program executable on the processor, and the processor implements the steps in the above-mentioned data distillation method when executing the program.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the steps in the above-mentioned method for data distillation.

The beneficial effects brought by the technical scheme provided by the embodiment of the application at least comprise:

in an embodiment of the present application, first, at least one batch of first distillation data to be trained is determined; then determining at least two pre-training models; determining batch normalized statistical loss of each batch of the first distillation data in the corresponding pre-training model based on the first statistical information in each pre-training model; determining a target cross-entropy loss for each batch of the first distillation data in each of the pre-trained models based on an initialization tag for each batch of the first distillation data; finally, performing back propagation training on each batch of the first distillation data based on batch normalization statistical loss and the target cross entropy loss of each batch of the first distillation data in each pre-training model to obtain target distillation data; therefore, robust visual information is generated through data mixing, and meanwhile, the characteristics of the original data stored in the pre-training models are utilized, so that the trained target distillation data can be matched with the characteristic distribution of the models, and therefore the target distillation data is more universal and has better effect.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained by those skilled in the art without inventive efforts, wherein:

FIG. 1 is a schematic flow diagram of a method for distillation of data provided in an embodiment of the present application;

FIG. 2 is a schematic flow diagram of a method of data distillation provided in an embodiment of the present application;

FIG. 3 is a schematic flow diagram of a method of data distillation provided in an embodiment of the present application;

FIG. 4A is a schematic flow diagram of a method of data distillation provided in an embodiment of the present application;

fig. 4B is a schematic diagram of a data mixing method according to an embodiment of the present application;

FIG. 5A is a logic flow diagram of a method of data distillation provided in an embodiment of the present application;

FIG. 5B is an algorithmic framework incorporating data blending and feature blending provided by an embodiment of the present application;

FIG. 5C is a schematic diagram of a training process combining data blending and feature blending according to an embodiment of the present application;

FIG. 6 is a schematic structural diagram of a data distillation apparatus provided in an embodiment of the present application;

fig. 7 is a hardware entity diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The following examples are intended to illustrate the present application but are not intended to limit the scope of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

It should be noted that the terms "first \ second \ third" referred to in the embodiments of the present application are only used for distinguishing similar objects and do not represent a specific ordering for the objects, and it should be understood that "first \ second \ third" may be interchanged under predetermined orders or sequences where possible, so that the embodiments of the present application described herein can be implemented in an order other than that illustrated or described herein.

It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which embodiments of the present application belong. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

The scheme provided by the embodiment of the application relates to the technical field of deep learning, and for facilitating understanding of the scheme of the embodiment of the application, terms related to the related technology are briefly explained at first:

artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science, attempting to understand the essence of intelligence and producing a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like. The embodiment of the application relates to a machine learning technology.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

Unlike knowledge Distillation, which migrates knowledge from a complex network to a simpler model, this method leaves the model unchanged, compresses knowledge of a large number of data sets (typically containing thousands or millions of images) in the original data set onto a small number of synthetic data, and trains the model on the synthetic data with performance comparable to that of the model on the original data set.

The distillation data used in the data-free compression method provided in the related art is obtained by training through distillation from a pre-training model, and the pre-training model stores some characteristics of the original training data and can reflect the distribution property of the original training data. Therefore, when any initialization picture is input to the pre-training model, it can be observed whether the features of the distillation picture output by the model are consistent with those of the original training picture before. Minimizing the difference between these two features, a process that may be referred to as data distillation, yields a distillation picture that is very close to the original training picture. And the distillation pictures can be continuously used for training the model to improve the performance of the compression model.

The data distillation process can effectively get rid of the dependence on the original training data in the model compression process. However, the existing data distillation method still has two problems: (1) the resulting distillation data are not universal. The raw training data can be used to compress any pre-trained model, however, the distillation data generated by pre-trained model a is difficult to use to compress pre-trained model B, and vice versa. (2) The generated distillation data is inaccurate, and the generation of the distillation data from the pre-training model is equivalent to solving the problem of an inverse function. However, the neural network is highly irreversible and highly non-convex, resulting in inaccurate distillation data being generated and large differences from the original training pictures.

The embodiment of the application provides a data distillation method which is applied to electronic equipment. The electronic device includes, but is not limited to, a mobile phone, a laptop, a tablet and a web-enabled device, a multimedia device, a streaming media device, a mobile internet device, a wearable device, or other types of devices. The functions implemented by the method can be implemented by calling program code by a processor in an electronic device, and the program code can be stored in a computer storage medium. The processor may be configured to perform a process for generating distillation data, and the memory may be configured to store intermediate data required in generating distillation data and target distillation data generated.

Fig. 1 is a schematic flow chart of a method for distilling data provided in an embodiment of the present application, and as shown in fig. 1, the method at least comprises the following steps:

step S110, determining at least one batch of first distillation data to be trained;

here, the at least one batch of first distillation data to be trained refers to one or more data images of a batch (batch), the batch size being a hyper-parameter defining the number of samples to be processed by the pre-trained model during each training.

At least one mixed data including two kinds of data tag information exists in each batch of the first distillation data. For example, if the mixed data is a composite image obtained by mixing two images, the composite image carries all the label information of the original two images. The first distillation data to be trained can be in a matrix form, and pixel values in the matrix can be randomly initialized through distribution information of the original data. The specific values of the rows and columns of the matrix can also be determined according to actual needs. In this way, the acquisition of the first distillation data to be trained can be done quickly and easily, providing a good basis for subsequent processing.

Step S120, determining at least two pre-training models;

here, the pre-training model means that the model can be loaded with model parameters having the same network structure at the beginning of training, so as to make the model more effective through feature migration.

In some embodiments, the pre-training model includes one or more of a Convolutional Neural Network (Convolutional Neural Network), a Recurrent Neural Network (RNN), and a Deep Neural Network (DNN), which is not limited in this embodiment.

First statistical information of the raw data, such as a mean value, a standard value, a variance and other statistical values, is stored in each pre-training model. Wherein, the mean value is the expected value of the normal distribution curve, and the value determines the position of the original data distribution; the variance is the square of the standard deviation, which determines the amplitude of the original data distribution.

It should be noted that, in each neural network, the distribution features of the training data are encoded and processed in its unique feature space. Thus, in training the pre-trained model with the raw data, first statistical information of the raw data is stored in the batch normalization layer of each of the constituent blocks of the pre-trained model.

Step S130, determining batch normalized statistical loss of each batch of the first distillation data in the corresponding pre-training model based on the first statistical information in each pre-training model;

here, first, by matching the first statistical information in each of the pre-training models and the second statistical information of the first distillation data in each of the pre-training models, the statistical loss between the first distillation data and the original data in the corresponding pre-training model is determined, and then the statistical characteristics of the first distillation data and the plurality of pre-training models can be further matched, and the common characteristic spaces of the pre-training models are combined, so that the target distillation data obtained by training is more common than the data obtained by single model distillation.

It should be noted that the calculated mean and variance of the batch normalization layer in the pre-trained model are based on all the first distillation data included in the current batch.

Step S140, determining target cross entropy loss of each batch of the first distillation data in each pre-training model based on the initialization label of each batch of the first distillation data;

here, the initialization tag is randomly initialized when the first distillation data is initialized using the gaussian distribution of the raw data, and for example, any one of 1 to 1000 is set for each of a batch of the first distillation data as the initialization tag of the data. The initialization tags of different data may be the same or different.

And matching the initialization label of each data with the prediction label of the corresponding data output by the pre-trained model, and linearly summing the matching results of all the data in the batch of the first distillation data. A target cross entropy loss for a batch of the first distillation data in each of the pre-trained models can be determined.

Step S150, performing back propagation training on each batch of the first distillation data based on batch normalization statistical loss and the target cross entropy loss of each batch of the first distillation data in each pre-training model to obtain target distillation data.

Here, the target distillation data is a distillation picture very close to the original training picture. There is usually some information on the target distillation data, encoding the discriminating characteristics of each class.

The target loss of the first distillation data passing through the at least two pre-training models can be determined by performing linear processing on the batch normalized statistical loss and the target cross-entropy loss of each batch of the first distillation data in each pre-training model, for example, determining the cumulative loss of each pre-training model, then summing all pre-training models, and then averaging.

Calculating the target loss in the back propagation process, aiming at the Gradient of the first distillation data, performing Gradient Descent updating such as a Stochastic Gradient Descent method (SGD), gradually optimizing the first distillation data, and performing two thousand times of iterative training to finally obtain the target distillation data.

Fig. 2 is a schematic flow chart of a method for distilling data provided in an embodiment of the present application, and as shown in fig. 2, the method at least includes the following steps:

step S210, determining at least one batch of first distillation data to be trained;

here, at least one mixed data including tag information of two kinds of data exists in each batch of the first distillation data.

Step S220, randomly selecting at least two pre-training models of different types from a pre-training model library;

here, first statistical information of the raw data is stored in each of the pre-training models. The different types of pre-trained models differ in their network structure.

It is to be noted that the feature mixing method in the embodiment of the present application requires feature information of a plurality of models, but too many models affect the speed of optimization. Thus, for each batch of first distillation data to be optimized, only a small portion of the model is sampled for feature mixing. Experiments prove that the three models can bring the best speed and effect balance.

Step S230, determining second statistical information of each batch of the first distillation data in each pre-training model;

here, since in each neural network, the distribution features of the training data are encoded and processed in its unique feature space, i.e., the Batch Normalization (BN) layer. Thus, the first distillation data can reasonably simulate the feature distribution of the original image in the pre-trained network. After each distillation training of the first distillation data, the activation mean and the variance are extracted from the batch normalization layer of each pre-training model as second statistical information of the first distillation data in the corresponding pre-training model.

Step S240, determining a batch normalization statistical loss between the first statistical information and the second statistical information for each pre-training model;

here, the batch normalized statistical loss for each of the pre-trained models is determined by matching the second statistical information of the first distillation data in each of the pre-trained models with the second statistical information stored in the corresponding pre-trained model.

In some embodiments, the first statistical information includes a mean and a variance, each of the pre-trained models includes at least two component blocks, each of the component blocks includes a convolutional layer and a batch normalization layer, and the first statistical information in each of the pre-trained models can be determined by: firstly, extracting the mean value and the variance of the original data from the batch normalization layer of each composition block; the mean and variance are obtained by counting the characteristics of the original data extracted by the convolutional layer; then, first statistical information in each pre-training model is determined based on the mean and variance of the at least two component blocks in each pre-training model. Therefore, the mean value and the variance stored in the batch normalization layer in each component block in the pre-training model are extracted, the first statistical information corresponding to the pre-training model is determined, and the activation distribution of the original data in the pre-training network can be reasonably simulated by the target distillation data obtained through training.

Illustratively, the batch normalized statistical loss L for each of the pre-trained models can be determined by the following equation (1)_BN：

In the formula of_i(X)，

Respectively, the mean and variance of the first distillation data X in the ith component block,

respectively the running average and variance stored in the BN layer of the corresponding pre-training model, wherein the expression |₂The two vectors in the expression are differentiated, and then the root of the square sum of each component is obtained.

Step S250, determining target cross entropy loss of each batch of the first distillation data in each pre-training model based on the initialization label of each batch of the first distillation data;

here, the target cross entropy loss is the total loss of the last layer output of the pre-trained model.

The target cross entropy loss L of a batch of said first distillation data in each of said pre-trained models can be determined by the following equation (2)_CE：

Wherein m represents the number of first distillation data of a batch, i.e. the batch size, y_iFor the initialization tag of each piece of data,

and (4) a prediction label output by the pre-training model.

It is to be noted that, for the mixed data in the batch of the first distillation data, i.e., the composite image obtained by mixing the two images, the label of the composite image is a mixed label determined based on the labels of the original two images.

Step S260, determining a first loss corresponding to each pre-training model based on the batch normalization statistical loss and the target cross entropy loss of each pre-training model;

and performing linear summation on the calculated batch normalized statistical loss and the target cross entropy loss aiming at each pre-training model to obtain a first loss corresponding to the corresponding pre-training model.

Step S270, averaging the first losses corresponding to the pre-training models to obtain target losses of each batch of the first distillation data passing through the at least two pre-training models;

here, the first losses corresponding to all the pre-trained models are summed and averaged to minimize, and the target loss of each batch of the first distillation data through the at least two pre-trained models is obtained.

By way of example, consider a model cluster M ═ a comprising at least two pre-trained models₁,A₂,...,A_mThe above target losses, which aim to optimize each batch of first distillation data simultaneously in combination with feature mixing and data mixing, can be found by the following equation (3):

wherein,

for each batch of the first distillation data

Model A is pre-trained_iThe statistical loss is normalized for the batch of,

first distillation data for each batch

Model A is pre-trained_iCross entropy loss of (2). Lambda [ alpha ]₁、λ₂The coefficient is generally set according to actual conditions; and m is the number of pre-training models.

In some embodiments, an a priori loss may also be applied to the first distillation data to ensure that the image is substantially smooth, where the a priori loss is defined as the mean square error between the first distillation data and a gaussian-blurred version thereof. Therefore, the final minimized target is determined as the target loss by combining the three of the batch normalization statistical loss, the target cross entropy loss and the prior loss.

Here, a target loss of the first distillation data through the at least two pre-trained models is determined by matching a batch normalized statistical loss and a target cross-entropy loss between the first distillation data and features of the plurality of pre-trained models. In this way, the generic feature spaces of each pre-trained model are combined by determining the target loss, thereby making the trained target distillation data more generic than that of single model distillation.

Step S280, performing back propagation training on each batch of the first distillation data based on the target loss to obtain the target distillation data.

Here, the optimization of each batch of first distillation data would use a Stochastic Gradient Descent (SGD) method, and each batch of data would go through twenty thousand iterations from initialization to training completion. The stochastic gradient descent is a simple but very effective method, and is mainly used for learning of linear classifiers under convex loss functions such as support vector machines and logistic regression. Only one piece of data in one dimension is randomly taken to obtain gradient each time, and the result is taken as the step size of gradient decrease of the dimension.

In some embodiments, step S280 may be implemented by: determining a gradient of the target loss against the first distillation data for each iterative training; updating each of the first distillation data based on the gradient; and obtaining the target distillation data when the batch normalization statistical loss and the target cross entropy loss of the updated first distillation data in each pre-training model converge to stable values. In this way, the first distillation data is optimized by a random gradient descent method in the back propagation process, and the iterative training process of each batch of first distillation data is accelerated.

In an embodiment of the application, at least two different types of pre-trained models are randomly sampled from a pre-trained library for each batch of first distillation data to be trained, and features of the first distillation data and the plurality of pre-trained models are simultaneously matched. Therefore, the general feature spaces of the pre-training models are combined, so that the distilled data are matched with the feature distribution of any pre-training model, and better training speed and effect balance can be obtained.

Fig. 3 is a schematic flow chart of a method for distilling data provided in an embodiment of the present application, and as shown in fig. 3, the method at least includes the following steps:

step S310, determining at least one batch of second distillation data based on the distribution information of the original data;

here, the raw data may be an image in at least one of the following general data sets: ImageNet dataset, ResNet50 dataset, MobileNet V2 dataset, and the like.

The Distribution information of the raw data may be Normal Distribution (Normal Distribution), also known as Gaussian Distribution (Gaussian Distribution). The distribution curve is bell-shaped, namely two ends are low and the middle is high, and the distribution curve is symmetrical left and right. If the random variable X follows a normal distribution with mathematical expectation of μ and variance σ ^2, it is denoted as N (μ, σ ^ 2). The probability density function determines its position for the expected value μ of a normal distribution and its standard deviation σ determines the amplitude of the distribution. A normal distribution when μ ═ 0 and σ ═ 1 is a standard normal distribution.

Here, the distribution information is gaussian distribution information including a mean, a variance, and the like of the original data. For visible (Red-Green-Blue, RGB) images, the mean and standard deviation of the three color channels are not the same. For example, the mean value of each image in the ImageNet dataset is [0.485, 0.456, 0.406] and the variance is [0.229,0.224,0.225 ].

In some embodiments, the step S210 may be implemented by: acquiring Gaussian distribution information of the original data; initializing at least N initial pixel value matrixes based on data randomly sampled from the Gaussian distribution information to obtain each batch of second distillation data; n is an integer of 2 or more. Therefore, data are randomly sampled from the Gaussian distribution information of the original data, and second distillation data can be obtained through initialization, so that the defect that the original data cannot be directly obtained is overcome.

Step S320, determining at least two pre-training models;

here, first statistical information of the raw data is stored in each of the pre-training models.

Step S330, mixing every two image data in each batch of second distillation data in each iterative training to obtain a batch of first distillation data;

in each iterative training, any two image data are randomly selected for data mixing, so that the mixed image contains information of the two image data, and the generated target distillation data is more robust and has more visual information, and can adapt to different scales and spatial orientations. The target distillation data generated is therefore more efficient for model compression.

It should be noted that the execution sequence of step S320 and step S330 is not limited, and step S330 may be executed first, step S320 may be executed first, or step S330 may be executed simultaneously.

Step S340, determining batch normalized statistical loss of each batch of the first distillation data in the corresponding pre-training model based on the first statistical information in each pre-training model;

step S350, determining target cross entropy loss of each batch of the first distillation data in each pre-training model based on the initialization label of each batch of the first distillation data;

here, each batch of the first distillation data includes a composite image obtained by mixing the first image and the second image. That is, a pair of first image and second image is randomly selected from a batch of second distillation data and mixed to obtain a composite image, wherein the first image and the second image respectively have the initialization tags corresponding to the first image and the second image.

The target cross entropy loss for each batch of the first distillation data in each of the pre-trained models can be determined by:

step S3501, determining the mixed cross entropy loss of the synthetic image based on the initialization tag of the first image and the initialization tag of the second image;

here, determining the mixed cross entropy loss of the synthesized image by the initialization tags of the first image and the second image, respectively, may ensure that the information of the original first image and the second image can be recognized and correctly optimized by the pre-trained model.

Step S3502, determining cumulative cross entropy loss of other images except the first image and the second image in each batch of the first distillation data based on initialization tags of the other images;

here, the process of calculating the cumulative cross entropy loss may refer to equation (2) except that the variable i takes a value of another image other than the first image and the second image.

Step S3502, determining a target cross entropy loss of each batch of the first distillation data after passing through each of the pre-training models based on the mixed cross entropy loss and the accumulated cross entropy loss.

And summing the mixed cross entropy loss of the synthetic images in one batch of first distillation data and the accumulated cross entropy losses of other images to minimize the sum, so as to obtain the target cross entropy loss of each batch of first distillation data after each pre-training model.

In some embodiments, the step S3501 may be implemented by: determining a pass of each batch of the first distillation data based on an initialization tag of the first imageA first cross entropy loss after each of the pre-trained models; determining a second cross-entropy loss for each batch of the first distillation data after passing through each of the pre-trained models based on the initialization label of the second image; and linearly summing the first cross entropy loss and the second cross loss based on a composite ratio between the first image and the second image to obtain the mixed cross entropy loss. In practice, the above-described hybrid cross-entropy loss can be determined by the following equation (4)

Wherein,

for the first distillation data after mixing,

is composed of

Mixed cross entropy loss, Y, through some pre-trained model₂And Y₁Initialization tags for the first image and the second image before blending,

and

a first cross entropy loss corresponding to the first image and a second cross entropy loss corresponding to the second image, respectively, where β is a proportionality coefficient between an area of the mixing region and an area of the second image.

In this way, the mixed cross entropy loss is determined by using the initialization label of each image before the mixed data, and then the target cross entropy loss is determined by combining the accumulated cross entropy losses of other images, so that the robust characteristic still having discriminative power can be generated during training, meanwhile, the incorrect inversion solution can be prevented, and the target distillation data with accurate label information can be synthesized.

And S360, performing back propagation training on each batch of the first distillation data based on batch normalization statistical loss and the target cross entropy loss of each pre-training model to obtain target distillation data.

Here, the first loss of a batch of first distillation data for each pre-training model is obtained by combining the target cross entropy loss and the batch normalization statistical loss, and then the first losses corresponding to each pre-training model are linearly integrated, so that the characteristic deviation generated by each pre-training model can be averaged, and the finally obtained target distillation data is closer to the real data and is more universal.

Here, the generated target distillation data can match the feature distribution of the plurality of models due to the utilization of the features of the plurality of pre-trained models, and meanwhile, the target distillation data is more robust and has more visual information through data mixing. After the target distillation data is obtained, it can be used directly to compress the pre-trained or otherwise unidentified model.

In some embodiments, each of the pre-trained models is model compressed if the number of at least one batch of the target distillation data reaches a certain threshold. Therefore, the generated target distillation data has more visual information and is universal, and can be directly used for the model compression process, thereby simplifying the compression process.

In some embodiments, in the case that a new model compression needs to be proposed, the generated target distillation data can be directly utilized for compression, and the generation process of the distillation data is not needed. The target distillation data generated by the embodiment of the application is accurate enough, and only one target distillation data set (including enough target distillation data) needs to be generated, so that the target distillation data set can be suitable for all pre-training models and can be generalized to unidentified models.

In some embodiments, the model compression operation may include Network Quantization (Network Quantization), Network Pruning (Network Pruning), Knowledge Distillation (Knowledge Distillation), and the like. The pruning technical process can complete pruning decision and reconstruction after pruning based on the target distillation data set; the model quantization process may be based on the target distillation data set to complete an in-training quantization process or a calibration process that implements post-training quantization based on the target distillation data set; the knowledge distilling process can send the target distilling data set to the teacher network and the student network separately to complete the knowledge transferring process.

In the embodiment of the application, by mixing random data of a batch of first distillation data in each iterative training, the inversion solution space can be reduced, and meanwhile, the visual information of the trained target distillation data is more robust and can adapt to different scales and spatial orientations. And determining the target loss of the batch of the first distillation data through a plurality of pre-training models by combining the batch normalization statistical loss and the target cross entropy loss. Thereby enabling the finally obtained target distillation data to be closer to the real data.

Based on fig. 3 and fig. 4A are schematic flow charts of a data distillation method provided in an embodiment of the present application, and as shown in fig. 4A, the step S330 "described above can be implemented by mixing every two image data in each batch of second distillation data in each iterative training to obtain a batch of first distillation data" through the following steps:

step S410, randomly selecting at least one pair of a first image and a second image from each batch of the second distillation data;

here, each pair of the first image and the second image is arbitrary two pieces of second distillation data. Because the first image and the second image are randomly selected and mixed in each iteration process, the visual information of the trained target distillation data is more robust and can adapt to different scales and spatial orientations.

Step S420, reducing the size of each first image according to a specific scale;

here, the specific ratio is a ratio between a bounding box of the random covered area on the second image and a size of the original image data.

Step S430, randomly overlaying the reduced first image onto the corresponding second image to obtain each batch of the first distillation data.

Here, the overlaid composite image includes information of the two image data by overlaying the reduced first image on an arbitrary area of the second image.

As shown in fig. 4B, the first image 41 is scaled down to a fixed size, and then the scaled-down first image 42 is randomly overlaid on the second image 43. The composite image 44 covered at this time will contain all the information of the first image 41 and the second image 43.

In some embodiments, the step S430 may be implemented by: randomly generating a mixed area to be covered in the corresponding second image according to the specific proportion; randomly generating a binary mask based on a bounding box of the blending region; and superposing each pixel value of the reduced first image and the corresponding pixel value in the second image through the binary mask to obtain each batch of the first distillation data. In implementation, the blending of any two image data is achieved by the following equations (5) and (6):

wherein x is_l,x_r,y_d,y_uThe boundaries of the left frame, the right frame, the upper frame and the lower frame of the boundary frame of the mixed region are arranged in sequence, alpha_ijIs an element in the binary mask, if α_ijThe value is 1 if the boundary is located in the bounding box, if alpha is_ijAnd the value is 0 when the boundary is outside the bounding box.

Wherein, X₂,X₁Respectively a first image and a second image to be mixed,

to overlay the scaled-down first image onto the second image, g () is a linear interpolation function, the first image X may be overlaid₂Adjusted to the same size as the bounding box.

In this way, the pixel value of the first image and the pixel value of the second image after the reduction are superposed by randomly generating the binary mask, and the random mixing of the two images in the batch of second distillation data is realized, so that the mixed data, namely the batch of first distillation data still has discriminative power when being trained and optimized by the pre-training model.

In the embodiment of the application, the first image is subjected to equal-scale reduction in data mixing and randomly overlaid into the second image, and the mixed composite image contains information of the first image and the second image. In training a batch of first distillation data, it is ensured that the two mixed image information can be identified and correctly optimized by the pre-trained model.

The method of the above data distillation is described below with reference to a specific example, however, it should be noted that the specific example is only for better describing the present application and is not to be construed as limiting the present application.

Fig. 5A is a logic flow diagram of a method for data distillation provided in an embodiment of the present application, as shown in fig. 5A, the method at least includes the following steps:

step S510, initializing a batch of distillation data by using Gaussian distribution of the original data;

here, the distillation process requires initializing a batch of distillation data (corresponding to the second distillation data) with a gaussian distribution that uses the statistical values of the raw data to initialize a batch of distillation data.

In practice, the distribution information of the original training data is obtained first, and then a batch of distillation data is initialized by using the statistical value of the gaussian distribution.

Step S520, randomly sampling three pre-training models from a pre-training model cluster;

step S530, training a batch of distillation data through three pre-training models by using data mixing and feature mixing;

here, the present example employed a random gradient descent method to train a mixed batch of distillation data (corresponding to the first distillation data).

In step S540, the above training process is repeated until a specified amount of target distillation data is generated.

Here, the process exits when the generation of all the target distillation data is completed, and otherwise, the steps S520 to S530 are repeated.

The target distillation data generated by the embodiment of the application can be generalized even to unidentified models, because the generated target distillation data is accurate and universal enough. Therefore, when a new model compression requirement is set, the generated target distillation data can be directly used for compression.

Fig. 5B is an algorithm framework combining data mixing and feature mixing provided in the embodiment of the present application, and as shown in fig. 5B, a solid arrow in the framework represents a forward training flow direction, and a dashed arrow represents a flow direction of a backward inversion process. In the training process, there are two major processes, namely data mixing 51 and feature mixing 52. Where the data mix 51 is used to generate robust visual information and the feature mix 52 is used to generate generic generalizable pictures.

In the forward training process, a data mixing 51 process is firstly performed, and a first image 501 and a second image 502 are randomly selected from a batch of distillation data to be mixed, so as to obtain a composite image 503. Then, a feature mixing 52 process is performed, and a batch of distillation data including the synthetic image 503 is input into a pre-training model 504, a pre-training model 505, and a pre-training model 506 (three models are illustrated, but not limited to, a plurality of models) respectively, and a target cross entropy loss 507 and a batch normalized statistical loss 508 after passing through the three pre-training models are calculated. Wherein each component block of each pre-training model comprises an activation layer 61, a batch normalization layer 62 and a convolution layer 63, and the average value and variance (mu, sigma) of a batch of distillation data are counted in the batch normalization layer 62²) The mean and variance stored in all the component blocks of each pre-training model can be extracted to determine the above batch normalizationStatistical loss 508 is quantified.

In the backward inversion process, the target loss is further determined using the target cross entropy loss 507 and the batch normalized statistical loss 508 determined in the last iterative training process to optimize the composite image 503 using the target loss. Since the pre-trained model already captures the image class information of the distillation data, the knowledge can be inverted by assigning a mixture label to the composite image 503.

As shown in fig. 5C, the forward training process includes the following steps:

step S5301, mixing random data of a batch of distillation data;

in the data mixing, random mixing is performed on any two image data from a batch of initialized distillation data. The composite image will now contain information of both image data. In training a mixed batch of distillation data, it is ensured that the information of the two image data can be identified and correctly optimized by the model.

Step S5302, inputting the mixed batch of distillation data into three pre-training models respectively, and determining the difference between the current statistic and the original statistic stored by the models;

here, the current statistic is second statistic information of a batch of distillation data, and the raw statistic is first statistic information of raw data stored in the model.

In feature mixing, statistical information of a batch of distillation data in a pre-trained model is calculated. The first statistical information, i.e., mean and variance, for the raw data is stored in each layer of the pre-trained model. When a batch of distillation data is input, if the mean variance of the batch of distillation data in the pre-trained model and the mean variance of the raw data are not very different, the batch of distillation data and the raw data are considered to be very similar. If the phase difference is still large, it is necessary to continue to minimize the error between them.

Notably, the need to match the features of a batch of distillation data and three pre-trained models simultaneously in feature blending allows for the generation of target distillation data that is more versatile than single model distillation. According to the method and the device, the characteristics of a plurality of pre-training models are utilized, so that the generated target distillation data are matched with the characteristic distribution of the plurality of models. This makes the target distillation data more versatile and effective.

Step S5303, calculating a gradient by utilizing difference back propagation, and performing gradient descent updating on a batch of mixed distillation data;

in step S5304, it is determined whether the iterative training is completed.

Here, under the condition of about twenty thousand iterations, the target distillation data and the original data can be made to look consistent, so that the iterative training is finished; otherwise, step S5301 is continued after each training.

In the related art, distillation data generated by a data-free compression method can only be used for compression of the model, and because the distillation process is irreversible operation and has a lot of unreasonable visual information, the generated distillation data cannot be migrated and needs to be generated repeatedly.

The distillation data are trained by using data mixing and feature mixing, so that the generated target distillation data have more robust visual information and features, and therefore the target distillation data are very universal and can be used for different models. Meanwhile, the target distillation data can be used all the time only by generating once.

According to the embodiment of the application, the universality of distillation data generation is improved by using a plurality of pre-training models, and robust visual information is generated by mixing data information. In practical use, for common raw data sets, distillation data sets can be generated in advance and directly used for model compression, such as quantization and pruning. For unseen raw data sets, distillation data would need to be generated for use.

Based on the foregoing embodiments, the present application further provides an apparatus for data distillation, where the apparatus includes modules, sub-modules included in the modules, and units, and can be implemented by a processor in an electronic device; of course, the implementation can also be realized through a specific logic circuit; in the implementation process, the Processor may be a Central Processing Unit (CPU), a microprocessor Unit (MPU), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), or the like.

Fig. 6 is a schematic structural diagram of an apparatus for data distillation according to an embodiment of the present application, and as shown in fig. 6, the apparatus 600 includes a first determining module 610, a second determining module 620, a third determining module 630, a fourth determining module 640, and a training module 650, where:

the first determining module 610 is configured to determine at least one batch of first distillation data to be trained; at least one mixed data including two data label information exists in each batch of the first distillation data;

the second determining module 620 is configured to determine at least two pre-training models; first statistical information of original data is stored in each pre-training model;

the third determining module 630, configured to determine a batch normalized statistical loss of each batch of the first distillation data in the corresponding pre-trained model based on the first statistical information in each pre-trained model;

the fourth determining module 640, configured to determine a target cross-entropy loss of each batch of the first distillation data in each pre-training model based on an initialization tag of each batch of the first distillation data;

the training module 650 is configured to perform back propagation training on each batch of the first distillation data based on a batch normalized statistical loss and the target cross entropy loss of each batch of the first distillation data in each pre-training model, so as to obtain target distillation data.

In some possible embodiments, the third determining module 630 includes a first determining sub-module and a second determining sub-module, wherein: the first determining submodule determines second statistical information of each batch of the first distillation data in each pre-training model; the second determining submodule determines, for each of the pre-training models, a batch normalized statistical loss between the first statistical information and the second statistical information.

In some possible embodiments, the second determining module 620 is further configured to randomly select at least two different types of pre-training models from a pre-training model library.

In some possible embodiments, the training module 650 includes a third determining sub-module, a processing sub-module, and a training sub-module, wherein: the third determining submodule is used for determining a first loss corresponding to each pre-training model based on the batch normalization statistical loss and the target cross entropy loss of each pre-training model; the processing submodule is used for averaging the first loss corresponding to each pre-training model to obtain the target loss of each batch of the first distillation data passing through the at least two pre-training models; and the training submodule is used for carrying out back propagation training on each batch of the first distillation data based on the target loss to obtain the target distillation data.

In some possible embodiments, the first determination module comprises an initialization sub-module and a mixing sub-module, wherein: the initialization sub-module is used for initializing at least one batch of second distillation data based on the distribution information of the original data; and the mixing submodule is used for mixing every two image data in each batch of the second distillation data in each iterative training to obtain each batch of the first distillation data.

In some possible embodiments, the mixing submodule includes a selecting unit, a scaling unit, and an overlaying unit, where: the selecting unit is used for randomly selecting at least one pair of first image and second image from each batch of the second distillation data; the scaling unit is used for reducing the size of each first image according to a specific scale; the covering unit is used for randomly covering the reduced first image into the corresponding second image to obtain each batch of first distillation data.

In some possible embodiments, the overlay unit comprises a first generation subunit, a second generation subunit, and an overlay subunit, wherein: the first generating subunit is configured to randomly generate a mixed region to be covered in the corresponding second image according to the specific proportion; the second generating subunit is configured to randomly generate a binary mask based on the bounding box of the mixed region; the superposition subunit is configured to superpose, through the binary mask, each pixel value of the reduced first image and a corresponding pixel value in the second image, so as to obtain each batch of the first distillation data.

In some possible embodiments, each batch of the first distillation data includes a composite image obtained by mixing the first image and the second image, and the fourth determining module 640 includes a fourth determining sub-module, a fifth determining sub-module, and a sixth determining sub-module, where: the fourth determination submodule is configured to determine a mixed cross-entropy loss of the composite image based on the initialization tag of the first image and the initialization tag of the second image; the fifth determining sub-module is used for determining the accumulated cross entropy loss of other images except the first image and the second image in each batch of the first distillation data based on the initialization tags of the other images; the sixth determining submodule is configured to determine a target cross entropy loss of each batch of the first distillation data after passing through each pre-training model based on the mixed cross entropy loss and the accumulated cross entropy loss.

In some possible embodiments, the fourth determination submodule comprises a second determination unit, a third determination unit and a summation unit, wherein: the second determining unit is used for determining a first cross entropy loss of each batch of the first distillation data after passing through each pre-training model based on the initialization label of the first image; a third determining unit, configured to determine a second cross-entropy loss after each batch of the first distillation data passes through each pre-training model based on an initialization tag of the second image; the summation unit is configured to perform linear summation on the first cross entropy loss and the second cross entropy loss based on a synthesis ratio between the first image and the second image, so as to obtain the target cross entropy loss.

In some possible embodiments, the first statistical information includes a mean and a variance, each of the pre-trained models includes at least two component blocks, each of the component blocks includes a convolutional layer and a batch normalization layer, and the third determining module 630 further includes an extracting sub-module and an accumulating sub-module, where: the extraction submodule is used for extracting the mean value and the variance of the original data from the batch normalization layer of each composition block; the mean and variance are obtained by counting the characteristics of the original data extracted by the convolutional layer; the accumulation submodule is used for determining first statistical information in each pre-training model based on the mean and the variance of the at least two component blocks in each pre-training model.

In some possible embodiments, the distribution information is gaussian distribution information, and the initialization submodule includes an obtaining unit and an initialization unit, where: the acquisition unit is used for acquiring Gaussian distribution information of the original data; the initialization unit is used for initializing at least N initial pixel value matrixes based on data randomly sampled from the Gaussian distribution information to obtain each batch of second distillation data; n is an integer of 2 or more.

In some possible embodiments, the training submodule comprises a fourth determining unit, an updating unit and a training unit, wherein: the fourth determination unit is used for determining the gradient of the target loss for the first distillation data in each iteration training in the back propagation process; the updating unit is used for updating each data in the first distillation data based on the gradient; the training unit is used for obtaining the target distillation data when the batch normalization statistical loss and the target cross entropy loss of the updated first distillation data in each pre-training model converge to stable values.

In some possible embodiments, the apparatus further comprises a compression module for performing model compression on each of the pre-trained models if the number of at least one batch of the target distillation data reaches a certain threshold.

Here, it should be noted that: the above description of the apparatus embodiments, similar to the above description of the method embodiments, has similar beneficial effects as the method embodiments. For technical details not disclosed in the embodiments of the apparatus of the present application, reference is made to the description of the embodiments of the method of the present application for understanding.

It should be noted that, in the embodiment of the present application, if the method for data distillation is implemented in the form of a software functional module and is sold or used as a stand-alone product, it may also be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for enabling an electronic device (which may be a smartphone with a camera, a tablet computer, etc.) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, or an optical disk. Thus, embodiments of the present application are not limited to any specific combination of hardware and software.

Accordingly, embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps in the method for data distillation described in any of the above embodiments.

Correspondingly, in an embodiment of the present application, there is also provided a chip, which includes a programmable logic circuit and/or program instructions, and when the chip is operated, is used to implement the steps in the method for data distillation described in any of the above embodiments.

Correspondingly, in an embodiment of the present application, there is also provided a computer program product, which is used to implement the steps in the method for data distillation described in any of the above embodiments when the computer program product is executed by a processor of an electronic device.

Based on the same technical concept, the embodiment of the present application provides an electronic device for implementing the method for distilling data described in the above method embodiment. Fig. 7 is a hardware entity diagram of an electronic device according to an embodiment of the present application, as shown in fig. 7, the electronic device 700 includes a memory 710 and a processor 720, the memory 710 stores a computer program that can be executed on the processor 720, and the processor 720 executes the computer program to implement steps in any of the methods for data distillation according to the embodiments of the present application.

The Memory 710 is configured to store instructions and applications executable by the processor 720, and may also buffer data (e.g., image data, audio data, voice communication data, and video communication data) to be processed or already processed by the processor 720 and modules in the electronic device, and may be implemented by a FLASH Memory (FLASH) or a Random Access Memory (RAM).

Processor 720, when executing the program, performs the steps of any of the methods of data distillation described above. The processor 720 generally controls the overall operation of the electronic device 700.

The Processor may be at least one of an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a Central Processing Unit (CPU), a controller, a microcontroller, and a microprocessor. It is understood that the electronic device implementing the above-mentioned processor function may be other electronic devices, and the embodiments of the present application are not particularly limited.

The computer storage medium/Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read Only Memory (EPROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a magnetic Random Access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical Disc, or a Compact Disc Read-Only Memory (CD-ROM), and the like; and may be various electronic devices such as mobile phones, computers, tablet devices, personal digital assistants, etc., including one or any combination of the above-mentioned memories.

Here, it should be noted that: the above description of the storage medium and device embodiments is similar to the description of the method embodiments above, with similar advantageous effects as the method embodiments. For technical details not disclosed in the embodiments of the storage medium and apparatus of the present application, reference is made to the description of the embodiments of the method of the present application for understanding.

It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in the various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application. The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms. The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units; can be located in one place or distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiments of the present application.

In addition, all functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit. Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing an automatic test line of a device to perform all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, a ROM, a magnetic or optical disk, or other various media that can store program code. The methods disclosed in the several method embodiments provided in the present application may be combined arbitrarily without conflict to obtain the method embodiments. The features disclosed in several of the method or apparatus embodiments provided herein may be combined in any combination to yield a method embodiment or an apparatus embodiment without conflict. The above description is only for the embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of data distillation, the method comprising:

determining at least one batch of first distillation data to be trained; at least one mixed data including two data label information exists in each batch of the first distillation data;

determining at least two pre-training models; first statistical information of original data is stored in each pre-training model;

determining a batch normalized statistical loss of each batch of the first distillation data in the corresponding pre-trained model based on the first statistical information in each of the pre-trained models;

determining a target cross-entropy loss for each batch of the first distillation data in each of the pre-trained models based on an initialization tag for each batch of the first distillation data;

and carrying out back propagation training on each batch of the first distillation data based on batch normalized statistical loss and the target cross entropy loss of each batch of the first distillation data in each pre-training model to obtain target distillation data.

2. The method of claim 1, wherein determining a batch normalized statistical loss for each batch of the first distillation data in the respective pre-trained model based on the first statistical information in each of the pre-trained models comprises:

determining second statistical information for each batch of the first distillation data in each of the pre-training models;

for each of the pre-training models, determining a batch normalized statistical loss between the first statistical information and the second statistical information.

3. The method of claim 1 or 2, wherein the determining at least two pre-trained models comprises:

at least two different types of pre-trained models are randomly selected from a pre-trained model library.

4. The method of any of claims 1 to 3, wherein the back propagation training of each batch of the first distillation data based on a batch normalized statistical loss and the target cross-entropy loss of each batch of the first distillation data in each of the pre-trained models to obtain target distillation data comprises:

determining a first loss corresponding to each pre-training model based on the batch normalization statistical loss and the target cross entropy loss of each pre-training model;

averaging the first losses corresponding to each pre-trained model to obtain target losses of each batch of the first distillation data after passing through the at least two pre-trained models;

and performing back propagation training on each batch of the first distillation data based on the target loss to obtain the target distillation data.

5. The method of any one of claims 1 to 4, wherein determining at least one batch of first distillation data to be trained comprises:

initializing at least one batch of second distillation data based on the distribution information of the raw data;

in each iterative training, every two image data in each batch of the second distillation data are mixed to obtain each batch of the first distillation data.

6. The method of claim 5, wherein the blending each two image data of each batch of the second distillation data to obtain each batch of the first distillation data comprises:

randomly selecting at least one pair of a first image and a second image from each batch of the second distillation data;

reducing the size of each first image according to a specific scale;

and randomly overlaying the reduced first image to the corresponding second image to obtain each batch of first distillation data.

7. The method of claim 6, wherein overlaying the scaled-down first image onto the corresponding second image to obtain each batch of the first distillation data comprises:

randomly generating a mixed area to be covered in the corresponding second image according to the specific proportion;

randomly generating a binary mask based on a bounding box of the blending region;

and superposing each pixel value of the reduced first image and the corresponding pixel value in the second image through the binary mask to obtain each batch of the first distillation data.

8. The method of any one of claims 1 to 7, wherein each batch of the first distillation data comprises a composite image obtained by mixing a first image and a second image, and wherein determining the target cross-entropy loss of each batch of the first distillation data in each pre-training model based on the initialized label of each batch of the first distillation data comprises:

determining a hybrid cross-entropy loss for the composite image based on the initialization tag for the first image and the initialization tag for the second image;

determining cumulative cross-entropy loss for other images in the batch of first distillation data, except for the first image and the second image, based on initialization tags of the other images;

and determining the target cross entropy loss of each batch of the first distillation data after passing through each pre-training model based on the mixed cross entropy loss and the accumulated cross entropy loss.

9. The method of claim 8, wherein determining the hybrid cross-entropy loss for the composite image based on the initialization tag for the first image and the initialization tag for the second image comprises:

determining a first cross-entropy loss for each batch of the first distillation data after passing through each of the pre-trained models based on an initialization label for the first image;

determining a second cross-entropy loss for each batch of the first distillation data after passing through each of the pre-trained models based on the initialization label of the second image;

and linearly summing the first cross entropy loss and the second cross loss based on the synthesis proportion between the first image and the second image to obtain the mixed cross entropy loss of the synthesized image.

10. The method of any of claims 1 to 9, wherein the first statistical information comprises a mean and a variance, each of the pre-trained models comprises at least two component blocks, each of the component blocks comprises a convolutional layer and a batch normalization layer, the method further comprising:

extracting a mean and a variance of the raw data from a batch normalization layer of each of the component blocks; the mean and variance are obtained by counting the characteristics of the original data extracted by the convolutional layer;

and determining first statistical information in each pre-training model based on the mean and the variance of the at least two component blocks in each pre-training model.

11. The method of claim 5, wherein the distribution information is gaussian distribution information, and wherein initializing at least one batch of second distillation data based on the distribution information of the raw data comprises:

acquiring Gaussian distribution information of the original data;

initializing at least N initial pixel value matrixes based on data randomly sampled from the Gaussian distribution information to obtain each batch of the second distillation data; n is an integer of 2 or more.

12. The method of claim 4, wherein the back propagation training of each batch of the first distillation data based on the target loss to obtain the target distillation data comprises:

determining a gradient of the target loss for the first distillation data in each iterative training in a back propagation process;

updating each of the first distillation data based on the gradient;

and obtaining the target distillation data when the batch normalization statistical loss and the target cross entropy loss of the updated first distillation data in each pre-training model converge to stable values.

13. The method of any of claims 1 to 12, further comprising:

and performing model compression on each pre-training model when the number of at least one batch of the target distillation data reaches a specific threshold value.

14. An apparatus for data distillation, comprising a first determination module, a second determination module, a third determination module, a fourth determination module, and a training module, wherein:

the first determining module is used for determining at least one batch of first distillation data to be trained; at least one mixed data including two data label information exists in each batch of the first distillation data;

the second determining module is used for determining at least two pre-training models; first statistical information of original data is stored in each pre-training model;

the third determining module is used for determining batch normalized statistical loss of each batch of the first distillation data in the corresponding pre-training model based on the first statistical information in each pre-training model;

the fourth determination module is used for determining target cross entropy loss of each batch of the first distillation data in each pre-training model based on the initialization tag of each batch of the first distillation data;

the training module is used for carrying out back propagation training on each batch of the first distillation data based on batch normalization statistical loss and the target cross entropy loss of each batch of the first distillation data in each pre-training model to obtain target distillation data.

15. An electronic device comprising a memory and a processor, the memory storing a computer program operable on the processor, wherein the processor implements the steps of the method of any one of claims 1 to 13 when executing the program.

16. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 13.