WO2023024406A1

WO2023024406A1 - Data distillation method and apparatus, device, storage medium, computer program, and product

Info

Publication number: WO2023024406A1
Application number: PCT/CN2022/071121
Authority: WO
Inventors: 李雨杭; 龚睿昊; 沈明珠; 余锋伟; 路少卿
Original assignee: 上海商汤智能科技有限公司
Priority date: 2021-08-27
Filing date: 2022-01-10
Publication date: 2023-03-02
Also published as: CN113762368A

Abstract

A data distillation method and apparatus, a device, a storage medium, a computer program, and a product. The method comprises: determining at least one batch of first distillation data to be trained (S110); determining at least two pretraining models (S120); determining a batch normalization statistical loss of each batch of first distillation data in the corresponding pretraining model on the basis of first statistical information in each pretraining model (S130); determining a target cross entropy loss of each batch of first distillation data in each pretraining model on the basis of an initialized label of each piece of data in each batch of first distillation data (S140); performing back propagation training on each batch of first distillation data on the basis of the batch normalization statistical loss and the target cross entropy loss of each batch of first distillation data in each pretraining model, to obtain target distillation data (S150).

Description

Method, device, equipment, storage medium, computer program and product of data distillation

Cross References to Related Applications

This disclosure is based on the Chinese patent application with the application number 202110994122.7, the application date is August 27, 2021, and the application name is "Data Distillation Method, Device, Electronic Equipment, and Storage Medium", and claims the priority of the Chinese patent application , the entire content of the Chinese patent application is hereby incorporated into this disclosure by way of full text.

technical field

The present disclosure relates to the field of computer vision, and relates to but not limited to methods, devices, equipment, storage media, computer programs and products of data distillation.

Background technique

In the era of big data, deep learning models are used more and more frequently. In order to apply deep learning models to small devices such as mobile devices and sensors, sometimes the models must be compressed and cropped before they can be deployed to small devices.

The compression of neural networks usually requires the original training data, because the compressed model generally needs to be trained to restore the previous performance. However, the original data is private in some cases, that is, the original data faces the risk of being unobtainable.

Contents of the invention

Embodiments of the present disclosure provide a data distillation method, device, device, storage medium, computer program, and product.

The technical scheme of the embodiment of the present disclosure is realized in this way:

In the first aspect, an embodiment of the present disclosure provides a data distillation method, including: determining at least one batch of first distillation data to be trained; in each batch of the first distillation data, there is at least one mixing data; determining at least two pre-training models; wherein each of the pre-training models stores the first statistical information of the original data; based on the first statistical information of each of the pre-training models, determining each batch of The batch normalized statistical loss of the first distillation data in the corresponding pre-training model; based on the initialization label of each data in each batch of the first distillation data, determine each batch of the first distillation data in each The target cross-entropy loss in the pre-training model; based on the batch-normalized statistical loss and the target cross-entropy loss of each batch of the first distillation data in each of the pre-training models, for each batch The first distillation data is subjected to backpropagation training to obtain target distillation data.

In the second aspect, an embodiment of the present disclosure provides a device for data distillation, including that the device includes a first determination module, a second determination module, a third determination module, a fourth determination module, and a training module, wherein:

The first determination module is configured to determine at least one batch of first distillation data to be trained; in each batch of the first distillation data, there is at least one mixed data including two kinds of data label information; the second determination module , configured to determine at least two pre-training models; wherein, each of the pre-training models stores the first statistical information of the original data; the third determination module is configured to be based on the first statistical information of each of the pre-training models Statistical information, determining the batch normalized statistical loss of each batch of the first distillation data in the corresponding pre-training model; the fourth determination module is configured to be based on each batch of the first distillation data The initialization label of the data determines the target cross-entropy loss of each batch of the first distillation data in each of the pre-training models; the training module is configured to be based on each batch of the first distillation data in each The batch normalized statistical loss and the target cross-entropy loss in the pre-training model perform backpropagation training on each batch of the first distillation data to obtain the target distillation data.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including a memory and a processor, the memory stores a computer program that can run on the processor, and the processor implements the above data distillation method when executing the program in the steps.

In a fourth aspect, an embodiment of the present disclosure provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps in the above data distillation method are implemented.

In a fifth aspect, the present disclosure further provides a computer program, including computer readable codes, when the computer readable codes run in an electronic device, a processor in the electronic device executes the method for realizing the above first aspect steps in the implementation.

In a sixth aspect, the present disclosure further provides a computer program product, the computer program product including one or more instructions, and the one or more instructions are suitable for being loaded by a processor and executing the steps in the first aspect above.

The beneficial effects brought by the technical solutions provided by the embodiments of the present disclosure at least include:

In the embodiment of the present disclosure, firstly, at least one batch of first distillation data to be trained is determined; then at least two pre-training models are determined; and each batch is determined based on the first statistical information in each of the pre-training models The batch normalized statistical loss of the first distillation data in the corresponding pre-training model; based on the initialization label of each data in each batch of the first distillation data, determine each batch of the first distillation data in each A target cross-entropy loss in the pre-training model; finally, based on the batch normalized statistical loss and the target cross-entropy loss of each batch of the first distillation data in each of the pre-training models, for each A batch of the first distillation data is subjected to backpropagation training to obtain the target distillation data; in this way, robust visual information is generated through data mixing, and at the same time, the features of the original data stored in multiple pre-training models are used to make the trained The target distillation data can match the feature distribution of multiple models, so the target distillation data is more versatile and the effect is better.

Description of drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present disclosure, the drawings that need to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present disclosure. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings without creative work, wherein:

FIG. 1 is a schematic flowchart of a data distillation method provided by an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart of a data distillation method provided by an embodiment of the present disclosure;

FIG. 3 is a schematic flowchart of a data distillation method provided by an embodiment of the present disclosure;

FIG. 4A is a schematic flow diagram of a data distillation method provided by an embodiment of the present disclosure;

FIG. 4B is a schematic diagram of a data mixing method provided by an embodiment of the present disclosure;

FIG. 5A is a logic flow diagram of a data distillation method provided by an embodiment of the present disclosure;

FIG. 5B is an algorithm framework combining data mixing and feature mixing provided by an embodiment of the present disclosure;

FIG. 5C is a schematic diagram of a training process combining data mixing and feature mixing provided by an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of the composition and structure of a data distillation device provided by an embodiment of the present disclosure;

FIG. 7 is a schematic diagram of a hardware entity of an electronic device provided by an embodiment of the present disclosure.

Detailed ways

In order to make the purpose, technical solutions and advantages of the embodiments of the present disclosure clearer, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below in conjunction with the drawings in the embodiments of the present disclosure. Obviously, the described embodiments It is a part of the embodiments of the present disclosure, but not all of them. The following examples are used to illustrate the present disclosure, but not to limit the scope of the present disclosure. Based on the embodiments in the present disclosure, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of the present disclosure.

In the following description, references to "some embodiments" describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or a different subset of all possible embodiments, and Can be combined with each other without conflict.

It should be pointed out that the term "first\second\third" involved in the embodiments of the present disclosure is only to distinguish similar objects, and does not represent a specific ordering of objects. Understandably, "first\second\third "Where permitted, the preset order or sequence may be interchanged so that the embodiments of the disclosure described herein can be practiced in an order other than that illustrated or described herein.

Those skilled in the art can understand that, unless otherwise defined, all terms (including technical terms and scientific terms) used herein have the same meanings as those of ordinary skill in the art to which the embodiments of the present disclosure belong. It should also be understood that terms, such as those defined in commonly used dictionaries, should be understood to have meanings consistent with their meaning in the context of the prior art, and unless specifically defined as herein, are not intended to be idealized or overly Formal meaning to explain.

The solutions provided by the embodiments of the present disclosure relate to the field of deep learning technology. In order to facilitate the understanding of the solutions of the embodiments of the present disclosure, a brief description of the nouns involved in the related technologies is given first:

Artificial Intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can respond in a similar way to human intelligence. Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.

Artificial intelligence technology is a comprehensive subject that involves a wide range of fields, including both hardware-level technology and software-level technology. Artificial intelligence basic technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics. Artificial intelligence software technology mainly includes several major directions such as computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.

Machine Learning (ML) is a multi-field interdisciplinary subject, involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other disciplines. Specializes in the study of how computers simulate or implement human learning behaviors to acquire new knowledge or skills, and reorganize existing knowledge structures to continuously improve their performance. Machine learning is the core of artificial intelligence and the fundamental way to make computers intelligent, and its application pervades all fields of artificial intelligence. Machine learning and deep learning usually include techniques such as artificial neural network, belief network, reinforcement learning, transfer learning, inductive learning, and teaching learning.

Dataset distillation (Dataset Distillation) is different from knowledge distillation, which is to transfer knowledge from complex networks to simpler models. This method keeps the model unchanged and compresses the knowledge of a large number of data sets (usually containing thousands or millions of images) in the original data set. To a small amount of synthetic data, the performance of the model trained on the synthetic data is almost the same as that on the original data set.

The distillation data used in the data-free compression method provided in the related art is trained from a pre-training model through distillation. This pre-training model stores some characteristics of the original training data, which can reflect the distribution properties of the original training data. . Therefore, when any initialization picture is input to the pre-training model, it can be observed whether the characteristics of the distilled picture output by the model are consistent with those of the previous original training picture. By minimizing the difference between these two features (this process can be called data distillation), you can obtain a distilled picture that is very close to the original training picture. The distilled images can continue to be used to train the model to improve the performance of the compressed model.

The data distillation process can effectively get rid of the dependence on the original training data in the model compression process. However, there are still two problems in the existing data distillation methods: (1) The generated distillation data is not universal. The original training data can be used to compress any pre-trained model, however, the distilled data generated by pre-trained model A is difficult to be used to compress pre-trained model B, and vice versa. (2) The generated distillation data is inaccurate. Generating distillation data from the pre-trained model is equivalent to solving the problem of the inverse function. However, the neural network is highly irreversible and highly non-convex, resulting in inaccurate distillation data generated, which is quite different from the original training pictures.

An embodiment of the present disclosure provides a method for data distillation, which is applied to an electronic device. The electronic devices include, but are not limited to, mobile phones, laptop computers, tablet computers and PDAs, multimedia devices, streaming media devices, mobile Internet devices, wearable devices or other types of devices. The functions realized by the method can be realized by calling the program codes by the processor in the electronic device, and of course the program codes can be stored in the computer storage medium. It can be seen that the electronic device at least includes a processor and a storage medium. The processor can be used to process the process of generating distillation data, and the memory can be used to store intermediate data required in the process of generating distillation data and generated target distillation data.

Fig. 1 is a schematic flow chart of a data distillation method provided by an embodiment of the present disclosure. As shown in Fig. 1, the method at least includes the following steps:

Step S110, determining at least one batch of first distillation data to be trained;

Here, the at least one batch of first distillation data to be trained refers to one or more batches of data images, and the batch size is a hyperparameter used to define the samples to be processed by the pre-training model during each training process number.

In each batch of the first distillation data, there is at least one mixed data including two kinds of data label information. For example, the mixed data is a composite image obtained by mixing two images, and the composite image carries all label information of the original two images. The first distillation data to be trained may be in the form of a matrix, and the pixel values in the matrix may be randomly initialized through the distribution information of the original data. The specific values of the rows and columns of the matrix can also be determined according to actual needs. In this way, the first distillation data to be trained can be obtained quickly and conveniently, thereby providing a good basis for subsequent processing.

Step S120, determining at least two pre-training models;

Here, the pre-trained model means that the model can be loaded with model parameters having the same network structure as the model at the beginning of training, with the purpose of feature migration to make the model better.

In some embodiments, the pre-training model includes a convolutional neural network (Convolutional Neural Network), a recursive neural network (Recursive Neural Network, RNN), a recurrent neural network (Recurrent Neural Network, RNN), a deep neural network (Deep Neural Network) Network, DNN), which is not limited in this embodiment of the present disclosure.

Each of the pre-trained models stores first statistical information of the original data, such as statistical values such as mean, standard value, and variance. Among them, the mean is the expected value of the normal distribution curve, which determines the position of the original data distribution; the variance is the square of the standard deviation, and the standard deviation determines the magnitude of the original data distribution.

It should be noted that in each neural network, the distribution features of the training data are encoded and processed in its unique feature space. Therefore, in the process of using the original data to train the pre-training model, the first statistical information of the original data is stored in the batch normalization layer of each component block of the pre-training model.

Step S130, based on the first statistical information in each pre-training model, determine the batch normalized statistical loss of each batch of the first distillation data in the corresponding pre-training model;

Here, firstly, by matching the first statistical information in each pre-training model with the second statistical information of the first distillation data in each pre-training model, the difference between the first distillation data and the original data in the corresponding pre-training model is determined. The statistical loss between the first distillation data and the statistical features of multiple pre-training models can be further matched, and the general feature space of each pre-training model can be combined, so that the target distillation data obtained by training is compared with the single-model distillation. Data is more generic.

It should be noted that the mean and variance calculated by the batch normalization layer in the pre-training model are based on all the first distillation data included in the current batch.

Step S140, based on the initialization label of each data in each batch of the first distillation data, determine the target cross-entropy loss of each batch of the first distillation data in each of the pre-training models;

Here, the initialization tag is randomly initialized when the first distillation data is initialized using the Gaussian distribution of the original data, for example, any value from 1 to 1000 is set for each data in a batch of first distillation data as the data initialization label for . Wherein, the initialization tags of different data may be the same or different.

By matching the initialization label of each data with the predicted label of the corresponding data output by the pre-trained model, and linearly summing the matching results of all data in a batch of first distilled data. A target cross-entropy loss in each of the pre-trained models may be determined for a batch of the first distilled data.

Step S150, based on the batch normalized statistical loss and the target cross-entropy loss of each batch of the first distillation data in each of the pre-training models, reverse each batch of the first distillation data Propagate training to obtain target distillation data.

Here, the target distillation data is a distillation picture that is very close to the original training picture. Usually there is a certain amount of information on the target distillation data, which encodes the discriminative features of each category.

Perform linear processing on the batch normalized statistical loss and the target cross-entropy loss of each batch of the first distillation data in each of the pre-training models, for example, first determine the cumulative loss of each pre-training model and then All the pre-trained models are summed and then averaged to determine the target loss of the first distillation data passing through the at least two pre-trained models.

In the backpropagation process, the target loss is calculated for the gradient of the first distillation data, and the gradient descent update is performed such as the stochastic gradient descent method (Stochastic Gradient Descent, SGD), and the first distillation data is gradually optimized. After about 20,000 iterations of training, Finally, the target distillation data is obtained.

Fig. 2 is a schematic flow chart of a data distillation method provided by an embodiment of the present disclosure. As shown in Fig. 2, the method at least includes the following steps:

Step S210, determining at least one batch of first distillation data to be trained;

Here, there is at least one mixed data including two kinds of data label information in each batch of the first distillation data.

Step S220, randomly selecting at least two different types of pre-training models from the pre-training model library;

Here, the first statistical information of the original data is stored in each pre-training model. Different types of pre-trained models have different network structures.

It is worth noting that the feature mixing method in the embodiment of the present disclosure requires feature information of multiple models, but too many models will affect the speed of optimization. Therefore, for each batch of first distillation data to be optimized, only a small part of the model is sampled for feature mixing. Experiments prove that the three models can bring the best speed and effect balance.

In this way, for each batch of first distillation data to be trained, at least two different types of pre-training models are randomly sampled from the pre-training library for feature mixing, so that the distilled data matches the feature distribution of any pre-training model, In this way, a better balance between training speed and effect can be obtained.

Step S230, determining the second statistical information of each batch of the first distillation data in each of the pre-training models;

Here, because in each neural network, the distribution features of the training data are encoded and processed in its unique feature space, namely the Batch Normalization (BN) layer. Therefore, the first distilled data can reasonably simulate the feature distribution of the original image in the pre-trained network. After each distillation training on the first distillation data, the activation mean and variance are extracted from the batch normalization layer of each pre-training model as the second statistical information of the first distillation data in the corresponding pre-training model.

Step S240, determining a batch normalized statistical loss between the first statistical information and the second statistical information for each of the pre-trained models;

Here, the batch normalized statistical loss for each pre-training model is determined by matching the second statistical information in each pre-training model of the first distillation data with the second statistical information stored in the corresponding pre-training model.

In some embodiments, the first statistical information includes mean and variance, each of the pre-training models includes at least two building blocks, and each of the building blocks includes a convolutional layer and a batch normalization layer, which can be obtained by The following steps determine the first statistical information in each of the pre-training models: first, extract the mean and variance of the raw data from the batch normalization layer of each of the constituent blocks; the mean and variance are obtained by The features of the original data extracted by the convolutional layer are statistically obtained; then, based on the mean and variance of the at least two constituent blocks in each of the pre-training models, determine the The first statistic for . In this way, by extracting the mean and variance stored in the batch normalization layer in each component block of the pre-training model, the first statistical information of the corresponding pre-training model is determined to ensure that the target distillation data obtained by training can reasonably simulate the pre-training network. Activation distribution on raw data.

Exemplarily, the batch normalization statistical loss L _BN for each of the pre-trained models can be determined by the following formula (1):

where μ _i (X),

are the mean and variance of the first distillation data X in the i-th component block, respectively,

are the running average and variance stored in the BN layer of the corresponding pre-trained model, respectively, where the expression ||.||2 is the square root of the sum of the squares of each component after the difference between the two vectors in the expression.

Step S250, based on the initialization label of each data in each batch of the first distillation data, determine the target cross-entropy loss of each batch of the first distillation data in each of the pre-training models;

Here, the target cross-entropy loss is the total loss output by the last layer of the pre-trained model.

The target cross-entropy loss L _CE of a batch of the first distillation data in each of the pre-training models can be determined by the following formula (2):

Among them, m represents the number of a batch of first distillation data, that is, the batch size, and y _i is the initialization label of each data,

Predicted label output for the pretrained model.

It is worth noting that, for the mixed data in a batch of first distillation data, that is, a synthesized image obtained by mixing two images, the label of the synthesized image is a mixed label determined based on the labels of the original two images.

Step S260, based on the batch normalized statistical loss of each pre-training model and the target cross-entropy loss, determine the first loss corresponding to the corresponding pre-training model;

Here, for each pre-training model, the calculated batch normalization statistical loss and the target cross-entropy loss are linearly summed to obtain the first loss corresponding to the corresponding pre-training model.

Step S270, calculating the average value of the first loss corresponding to each of the pre-training models to obtain the target loss of each batch of the first distillation data passing through the at least two pre-training models;

Here, the first losses corresponding to all the pre-training models are first summed, averaged, and then minimized to obtain the target loss of each batch of the first distillation data passing through the at least two pre-trained models.

As an example, considering a model cluster M={A ₁ ,A ₂ ,...,A _m } including at least two pre-trained models, the above target loss can be calculated by the following formula (3), aiming to combine feature mixture and Data blending optimizes each batch of first distillation data:

in,

For each batch of first distillation data

The batch normalized statistical loss of the pre-trained model A _i ,

First distillation data for each batch

The cross-entropy loss of the pre-trained model A _i . λ ₁ and λ ₂ are coefficients, which are generally set according to the actual situation; m is the number of pre-trained models.

In some embodiments, a prior loss may also be applied to the first distilled data to ensure that the image is generally smooth, wherein the prior loss is defined as the mean square error between the first distilled data and its Gaussian blurred version. Therefore, the final minimization objective is determined as the objective loss by combining batch normalized statistical loss, objective cross-entropy loss and prior loss.

Here, the target loss of the first distillation data passing through the at least two pre-training models is determined by matching the batch normalized statistical loss and the target cross-entropy loss between the features of the first distillation data and the plurality of pre-training models. In this way, the general feature space of each pre-trained model is combined by determining the target loss, so that the target distillation data obtained by training is more general than the single-model distillation data.

Step S280, based on the target loss, perform backpropagation training on each batch of the first distillation data to obtain the target distillation data.

Here, the optimization of each batch of first distillation data will use the Stochastic Gradient Descent (SGD) method, and each batch of data will go through 20,000 iterations from initialization to the end of training. Stochastic gradient descent is a simple but very effective method, which is mostly used for learning linear classifiers under convex loss functions such as support vector machines and logistic regression. Each time, only a piece of data in one dimension is randomly selected to find the gradient, and the result is used as the step size of the gradient descent of this dimension.

In some implementations, step S280 can be implemented through the following process: in each iterative training, determine the gradient of the target loss for the first distillation data; based on the gradient, for each of the first distillation data The data is updated; when the batch normalization statistical loss and the target cross-entropy loss of the updated first distillation data in each of the pre-training models converge to a stable value, the target distillation data is obtained. In this way, the stochastic gradient descent method is used to optimize the first distillation data during the backpropagation process, and the iterative training process of each batch of first distillation data is accelerated.

In the embodiment of the present disclosure, for each batch of first distillation data to be trained, at least two different types of pre-training models are randomly sampled from the pre-training library, and the characteristics of the first distillation data and multiple pre-training models are matched at the same time . In this way, the general feature space of each pre-training model is combined, so that the distilled data can match the feature distribution of any pre-training model, so that a better balance between training speed and effect can be obtained.

Fig. 3 is a schematic flow chart of a method for data distillation provided by an embodiment of the present disclosure. As shown in Fig. 3, the method at least includes the following steps:

Step S310, based on the distribution information of the original data, determine at least one batch of second distillation data;

Here, the original data can be images in at least one of the following general data sets: ImageNet data set, ResNet50 data set, MobileNet V2 data set, etc.

The distribution information of the raw data may be a normal distribution (Normal Distribution), also known as a Gaussian distribution (Gaussian Distribution). The distribution curve is bell-shaped, that is, the two ends are low and the middle is high, and the left and right sides are symmetrical. If the random variable X obeys a normal distribution with a mathematical expectation of μ and a variance of σ^2, it is denoted as N(μ, σ^2). Its probability density function is the expected value μ of the normal distribution determines its position, and its standard deviation σ determines the magnitude of the distribution. The normal distribution when μ=0, σ=1 is the standard normal distribution.

Here, the distribution information is Gaussian distribution information, including the mean value, variance, etc. of the original data. For visible light (Red-Green-Blue, RGB) images, the mean and standard deviation of the three color channels are different. For example, the mean value of each image in the ImageNet dataset is [0.485, 0.456, 0.406], and the variance is [0.229, 0.224, 0.225].

In some implementations, the above step S210 can be implemented through the following process: obtaining the Gaussian distribution information of the original data; based on the randomly sampled data from the Gaussian distribution information, initializing at least N initial pixel value matrices to obtain Each batch of second distillation data; N is an integer greater than or equal to 2. In this way, the data is randomly sampled from the Gaussian distribution information of the original data, and initialized to obtain the second distillation data, which makes up for the inability to directly obtain the original data.

Step S320, determine at least two pre-training models;

Here, the first statistical information of the original data is stored in each pre-training model.

Step S330, in each iterative training, mix every two image data in each batch of second distillation data to obtain a batch of first distillation data;

Here, in each iterative training, any two image data are randomly selected for data mixing, so that the mixed image contains the information of the two image data, so that the generated target distillation data is more robust and has more visual information, which can Adapt to different scales and spatial orientations. Therefore, the generated target distillation data is more effective for model compression.

It should be noted that, the execution order of the above step S320 and step S330 is not limited, step S330 may be executed first, step S320 may be executed first, or may be executed at the same time.

Step S340, based on the first statistical information in each pre-training model, determine the batch normalized statistical loss of each batch of the first distillation data in the corresponding pre-training model;

Step S350, based on the initialization label of each data in each batch of the first distillation data, determine the target cross-entropy loss of each batch of the first distillation data in each of the pre-training models;

Here, each batch of first distillation data includes a composite image obtained by mixing the first image and the second image. That is to say, a pair of the first image and the second image is randomly selected from a batch of second distillation data and mixed to obtain a composite image, wherein the first image and the second image respectively have corresponding initialization labels.

The target cross-entropy loss of each batch of the first distillation data in each of the pre-training models can be determined by the following steps:

Step S3501, based on the initialization label of the first image and the initialization label of the second image, determine the hybrid cross-entropy loss of the composite image;

Here, the hybrid cross-entropy loss of the synthetic image is determined by the respective initialization labels of the first image and the second image, which can ensure that the information of the original first image and the second image can be recognized and correctly optimized by the pre-trained model.

Step S3502, based on the initialization labels of other images in each batch of first distillation data except the first image and the second image, determine the cumulative cross-entropy loss of the other images;

Here, the process of calculating the cumulative cross-entropy loss may refer to formula (2), except that the value of the variable i is other images than the first image and the second image.

Step S3502, based on the mixed cross-entropy loss and the cumulative cross-entropy loss, determine the target cross-entropy loss after each batch of the first distillation data passes through each of the pre-trained models.

Here, the mixed cross-entropy loss of the synthetic image in a batch of first distillation data and the cumulative cross-entropy loss of other images are summed and then minimized to obtain each batch of the first distillation data after passing through each of the pre-trained models The target cross-entropy loss.

In some implementations, the above step S3501 can be implemented through the following process: based on the initialization label of the first image, determine the first cross-entropy loss after each batch of the first distillation data passes through each of the pre-training models ; Based on the initialization label of the second image, determine the second cross-entropy loss after each batch of the first distillation data passes through each of the pre-training models; based on the difference between the first image and the second image Combining ratios, linearly summing the first cross-entropy loss and the second cross-entropy loss to obtain the mixed cross-entropy loss. In practice, the above mixed cross-entropy loss can be determined by the following formula (4):

in,

is the first distillation data after mixing,

for

After the mixed cross-entropy loss of a pre-trained model, Y ₂ and Y ₁ are the initialization labels of the first image and the second image before mixing, respectively,

and

are the first cross-entropy loss corresponding to the first image and the second cross-entropy loss corresponding to the second image, respectively, and β is the proportional coefficient between the area of the mixed region and the area of the second image.

In this way, by using the initialization labels of each image before mixing data to determine the mixed cross-entropy loss, and then combined with the accumulated cross-entropy loss of other images to determine the target cross-entropy loss, it helps to generate robust features that are still discriminative during training, while It can prevent incorrect inversion solutions and synthesize target distillation data with accurate label information.

Step S360, based on the batch normalized statistical loss and the target cross-entropy loss of each pre-trained model, perform backpropagation training on each batch of the first distillation data to obtain target distillation data.

Here, the first loss of a batch of first distillation data for each pre-training model is obtained by combining the target cross-entropy loss and batch normalization statistical loss, and then the first loss corresponding to each pre-training model is linearly integrated , can average the feature deviation produced by each pre-training model, so that the final target distillation data is closer to the real data and more general.

Here, due to the use of the features of multiple pre-trained models, the generated target distillation data can match the feature distribution of multiple models, and at the same time, the target distillation data is more robust and has more visual information through data mixing. After the target distillation data is obtained, it can be directly used to compress the pre-trained model or other unidentified models.

In some implementations, when the quantity of at least one batch of the target distillation data reaches a specific threshold, model compression is performed on each of the pre-trained models. In this way, since the generated target distillation data has more visual information and is universal, it can be directly used in the model compression process, simplifying the compression process.

In some implementations, when a new model compression needs to be proposed, the generated target distillation data can be directly used for compression, without further performing the distillation data generation process. The target distillation data generated by the embodiment of the present disclosure is accurate enough, and it only needs to generate a target distillation data set (including a sufficient amount of target distillation data) once to apply to all pre-trained models and generalize to unidentified models.

In some implementations, the model compression operation may include network quantization (Network Quantization), network pruning (Network Pruning), knowledge distillation (Knowledge Distillation) and the like. The pruning process can complete the pruning decision and post-pruning reconstruction based on the target distillation data set; the model quantization process can complete the training quantization process based on the target distillation data set or the calibration process of post-training quantization based on the target distillation data set; knowledge The distillation process can send the target distillation data set to the teacher network and the student network to complete the process of knowledge transfer.

In the embodiment of the present disclosure, by randomly mixing a batch of first distillation data in each iterative training, the inversion solution space can be reduced, and at the same time, the visual information of the trained target distillation data is more robust and adaptable to Different scales and spatial orientations. Simultaneously combine the batch normalized statistical loss and the target cross-entropy loss to determine the target loss of a batch of first-distilled data through multiple pre-trained models. So that the final target distillation data is closer to the real data.

Based on Fig. 3, Fig. 4A is a schematic flowchart of the data distillation method provided by the embodiment of the present disclosure. As shown in Fig. 4A, the above step S330 "in each iterative training, for every two batches of the second distillation data The image data are mixed to obtain a batch of first distillation data" which can be achieved by the following steps:

Step S410, randomly selecting at least one pair of the first image and the second image from each batch of the second distillation data;

Here, each pair of the first image and the second image is any two second distillation data. Because the first image and the second image are randomly selected and mixed during each round of iteration, the visual information of the target distillation data obtained after training will be more robust and able to adapt to different scales and spatial orientations.

Step S420, reducing the size of each of the first images according to a specific ratio;

Here, the specific ratio is a ratio between the bounding box of the random coverage area on the second image and the size of the original image data.

In step S430, the reduced first image is randomly overlaid on the corresponding second image to obtain each batch of the first distillation data.

Here, by covering any area of the second image with the reduced first image, the overlaid composite image includes information of two image data.

As shown in FIG. 4B , the first image 41 is proportionally reduced to a fixed size, and then the reduced first image 42 is randomly overlaid on the second image 43 . At this time, the overlaid composite image 44 will contain all the information of the first image 41 and the second image 43 .

In some implementation manners, the step S430 can be implemented through the following process: according to the specific ratio, randomly generate a mixed area to be covered in the corresponding second image; based on the bounding box of the mixed area, randomly generate A binary mask: superimposing each pixel value of the reduced first image with a corresponding pixel value in the second image by using the binary mask to obtain each batch of the first distillation data. In implementation, the mixing of any two image data is realized by the following formulas (5) and (6):

Among them, x _l , x _r , y _d , y _u are the boundaries of the left box, right box, upper box, and lower box of the bounding box of the mixed area in turn, and α _ij is the element in the binary mask. If α _ij is located at the boundary The value is 1 if α _ij is outside the bounding box, and 0 if α ij is outside the bounding box.

Wherein, X ₂ , X ₁ are the first image and the second image to be mixed respectively,

To overlay the downscaled first image onto the composite image on the second image, g(.) is a linear interpolation function that resizes the first image X ₂ to the same size as the bounding box.

In this way, by randomly generating a binary mask and superimposing the pixel values of the reduced first image and the pixel values of the second image, the random mixing of two images in a batch of second distillation data is realized, so that the mixed data is a batch of first Distilled data is still discriminative when optimized by pretrained model training.

In the embodiment of the present disclosure, the first image is scaled down and randomly overlaid into the second image during data mixing, and the mixed composite image will contain information of the first image and the second image. In the process of training a batch of first-distilled data, it is ensured that the two mixed image information can be recognized and optimized correctly by the pre-trained model.

The above data distillation method will be described below in conjunction with a specific embodiment, however, it should be noted that this specific embodiment is only for better illustrating the present disclosure, and does not constitute an improper limitation of the present disclosure.

FIG. 5A is a logic flow diagram of a data distillation method provided by an embodiment of the present disclosure. As shown in FIG. 5A, the method includes at least the following steps:

Step S510, using the Gaussian distribution of the original data to initialize a batch of distillation data;

Here, the distillation process needs to first use a Gaussian distribution to initialize a batch of distillation data (equivalent to the second distillation data), and the Gaussian distribution uses the statistical value of the original data to initialize a batch of distillation data.

In the implementation, the distribution information of the original training data is obtained first, and then a batch of distilled data is initialized with the statistical value of the Gaussian distribution.

Step S520, randomly sampling three pre-training models from the pre-training model cluster;

Step S530, using data mixing and feature mixing to train a batch of distillation data through three pre-trained models;

Here, the embodiment of the present disclosure adopts a stochastic gradient descent method to train a batch of mixed distillation data (equivalent to the first distillation data).

Step S540, repeating the above training process until a specified amount of target distillation data is generated.

Here, if the generation of all target distillation data is completed, exit, otherwise, repeat steps S520 to S530.

Since the generated target distillation data is sufficiently accurate and general, the target distillation data generated in the embodiments of the present disclosure can even be generalized to unidentified models. Therefore, when a new model compression requirement is proposed, the generated target distillation data can be directly used for compression.

FIG. 5B is an algorithm framework combining data mixing and feature mixing provided by an embodiment of the present disclosure. As shown in FIG. 5B , the solid arrows in the framework reflect the forward training flow, and the dotted arrows reflect the backward inversion process flow. . There are two major processes of data mixing 51 and feature mixing 52 in the training process. Among them, the data mixture 51 is used to generate robust visual information, and the feature mixture 52 is used to generate generalizable images.

In the forward training process, the data mixing 51 process is firstly performed, and the first image 501 and the second image 502 are randomly selected from a batch of distillation data for mixing to obtain a composite image 503 . Then carry out the process of feature mixing 52, and input a batch of distillation data including the composite image 503 into the pre-training model 504, the pre-training model 505 and the pre-training model 506 respectively (the three models of the example, but not limited, can be multiple ), calculate the target cross-entropy loss 507 and the batch normalization statistical loss 508 after three pre-trained models. Wherein, each component block of each pre-training model includes an activation layer 61, a batch normalization layer 62 and a convolutional layer 63, and the batch normalization layer 62 counts the mean and variance (μ , σ2), the mean and variance stored in all the blocks of each pre-training model can be extracted to determine the above-mentioned batch normalization statistical loss 508 .

In the backward inversion process, the target loss is further determined using the target cross-entropy loss 507 and the batch normalization statistical loss 508 determined in the previous iterative training process, so as to optimize the composite image 503 using the target loss. Since the pre-trained model already captures the image category information of the distilled data, knowledge can be retrieved by assigning mixed labels to the synthesized images 503 .

As shown in Figure 5C, the forward training process includes the following steps:

Step S5301, performing random data mixing on a batch of distillation data;

In data mixing, two image data are randomly selected from the initialized batch of distilled data for random mixing. At this time, the composite image will contain the information of the two image data. In the process of training the mixed batch of distilled data, it is ensured that the information of the two image data can be recognized by the model and optimized correctly.

Step S5302, input the mixed batch of distillation data into the three pre-training models respectively, and determine the difference between the current statistic and the original statistic stored in the model;

Here, the current statistic is the second statistic of a batch of distillation data, and the original statistic is the first statistic of the original data stored in the model.

In Feature Mixing, compute statistics on a batch of distilled data in a pretrained model. In each layer of the pretrained model, the first statistics for the original data, namely the mean and variance, are stored. When a batch of distillation data is input, if the mean variance of a batch of distillation data in the pre-training model is not much different from the mean variance of the original data, then the batch of distillation data is considered to be very similar to the original data. If the difference is still large, you need to continue to minimize the error between them.

It is worth noting that in feature mixing, it is necessary to match a batch of distillation data and the features of the three pre-trained models at the same time, which makes the generated target distillation data more versatile than single-model distillation. The embodiment of the present disclosure utilizes the features of multiple pre-trained models, so that the generated target distillation data matches the feature distribution of multiple models. This makes target distillation data more general and effective.

Step S5303, using difference backpropagation to calculate the gradient, and performing gradient descent update on the mixed batch of distillation data;

Step S5304, judging whether the iterative training is completed.

Here, in the case of iterating about 20,000 rounds, the target distillation data can be made to look consistent with the original data, so that the iterative training is completed; otherwise, step S5301 is continued after each training.

The distillation data generated by the non-data compression method in the related art can only be used for the compression of this model, and because the distillation process is an irreversible operation, there is a lot of unreasonable visual information, so that the generated distillation data cannot be transferred and needs to be generated repeatedly.

In the embodiments of the present disclosure, by using data mixing and feature mixing to train distillation data, the generated target distillation data has more robust visual information and features, so the target distillation data is very versatile and can be used for different models. At the same time, the embodiment of the present disclosure only needs to generate the target distillation data once, and then it can be used all the time.

The embodiments of the present disclosure use multiple pre-trained models to improve the versatility of distillation data generation, and at the same time use data information mixture to generate robust visual information. In practical use, for common raw data sets, distillation data sets can be generated in advance and directly used for model compression, such as quantization and pruning. For unseen raw data sets, distillation data needs to be generated before use.

Based on the aforementioned embodiments, the present disclosure further provides a device for data distillation. The device includes each module included, each sub-module included in each module, and each unit, and the processor in the electronic device can It can also be realized by a specific logic circuit; in the process of implementation, the processor can be a central processing unit (Central Processing Unit, CPU), a microprocessor (Micro Processing Unit, MPU), a digital signal processor ( Digital Signal Processor, DSP) or Field Programmable Gate Array (Field Programmable Gate Array, FPGA), etc.

Fig. 6 is a schematic diagram of the composition and structure of a data distillation device provided by an embodiment of the present disclosure. As shown in Fig. 6, the device 600 includes a first determination module 610, a second determination module 620, a third determination module 630, a Four determination module 640 and training module 650, wherein:

The first determination module 610 is configured to determine at least one batch of first distillation data to be trained; in each batch of the first distillation data, there is at least one mixed data including two kinds of data label information;

The second determination module 620 is configured to determine at least two pre-training models; wherein, each of the pre-training models stores the first statistical information of the original data;

The third determination module 630 is configured to determine the batch normalized statistical loss of each batch of the first distillation data in the corresponding pre-training model based on the first statistical information in each of the pre-training models;

The fourth determination module 640 is configured to determine the target intersection of each batch of the first distillation data in each of the pre-training models based on the initialization label of each data in each batch of the first distillation data entropy loss;

The training module 650 is configured to, based on the batch normalized statistical loss and the target cross-entropy loss of each batch of the first distillation data in each of the pre-training models, for each batch of the first distillation data The distillation data is backpropagated for training to obtain the target distillation data.

In some possible embodiments, the third determining module 630 includes a first determining submodule and a second determining submodule, wherein: the first determining submodule determines that each batch of the first distillation data is The second statistical information in the pre-training model; the second determination submodule, for each of the pre-training models, determines the batch normalization between the first statistical information and the second statistical information Statistical loss.

In some possible embodiments, the second determining module 620 is further configured to randomly select at least two different types of pre-training models from the pre-training model library.

In some possible embodiments, the training module 650 includes a third determination submodule, a processing submodule and a training submodule, wherein: the third determination submodule is configured to Normalizing the statistical loss and the target cross-entropy loss to determine the first loss corresponding to the corresponding pre-training model; the processing submodule is configured to average the first loss corresponding to each of the pre-training models to obtain Each batch of the first distillation data undergoes the target loss of the at least two pre-trained models; the training submodule is configured to perform backpropagation on each batch of the first distillation data based on the target loss Training, obtain the target distillation data.

In some possible embodiments, the first determination module includes an initialization submodule and a mixing submodule, wherein: the initialization submodule is configured to initialize at least one batch of second distillation data based on the distribution information of the original data; The mixing sub-module is configured to mix every two image data in each batch of the second distillation data to obtain each batch of the first distillation data in each iterative training.

In some possible embodiments, the mixing submodule includes a selection unit, a scaling unit, and a covering unit, wherein: the selection unit is configured to randomly select at least one pair of first distillation data from each batch of the second distillation data image and a second image; the scaling unit is configured to reduce the size of each of the first images according to a specific ratio; the covering unit is configured to randomly cover the reduced first image to the corresponding In the second image, each batch of the first distillation data is obtained.

In some possible embodiments, the covering unit includes a first generating subunit, a second generating subunit, and a superposition subunit, wherein: the first generating subunit is configured to, according to the specific ratio, in the corresponding The mixed region to be covered is randomly generated in the second image; the second generating subunit is configured to randomly generate a binary mask based on the bounding box of the mixed region; the superimposing subunit is configured to pass the The binary mask superimposes each pixel value of the reduced first image and the corresponding pixel value in the second image to obtain each batch of the first distillation data.

In some possible embodiments, each batch of first distillation data includes a synthetic image obtained by mixing the first image and the second image, and the fourth determining module 640 includes a fourth determining submodule, a fifth determining The submodule and the sixth determination submodule, wherein: the fourth determination submodule is configured to determine the hybrid cross-entropy loss of the composite image based on the initialization label of the first image and the initialization label of the second image ; The fifth determination submodule is configured to determine the accumulation of the other images based on the initialization labels of the other images in each batch of first distillation data except the first image and the second image Cross-entropy loss; the sixth determining submodule is configured to determine the target of each batch of the first distillation data after passing through each of the pre-training models based on the mixed cross-entropy loss and the cumulative cross-entropy loss Cross entropy loss.

In some possible embodiments, the fourth determining submodule includes a first determining unit, a second determining unit, and a summing unit, wherein: the first determining unit is configured to initialize labels based on the first image , to determine the first cross-entropy loss after each batch of the first distillation data passes through each of the pre-training models; the second determining unit is configured to determine each batch of the The second cross-entropy loss after the first distillation data passes through each of the pre-training models; the summation unit is configured to calculate the first image based on the synthesis ratio between the first image and the second image. A cross-entropy loss and the second cross-entropy loss are linearly summed to obtain the target cross-entropy loss.

In some possible embodiments, the first statistical information includes mean and variance, each of the pre-training models includes at least two building blocks, and each of the building blocks includes a convolutional layer and a batch normalization layer, The third determination module 630 also includes an extraction submodule and an accumulation submodule, wherein: the extraction submodule is configured to extract the mean value and variance of the original data from the batch normalization layer of each of the constituent blocks ; The mean value and variance are statistically obtained through the features of the original data extracted by the convolution layer; the accumulation sub-module is configured to be based on the at least two constituent blocks in each of the pre-training models The mean and variance of , determine the first statistical information in each of said pre-trained models.

In some possible embodiments, the distribution information is Gaussian distribution information, and the initialization submodule includes an acquisition unit and an initialization unit, wherein: the acquisition unit is configured to acquire the Gaussian distribution information of the original data; the The initialization unit is configured to initialize at least N initial pixel value matrices based on randomly sampled data from the Gaussian distribution information to obtain each batch of the second distillation data; N is an integer greater than or equal to 2.

In some possible embodiments, the training submodule has a third determination unit, an update unit and a training unit, wherein: the third determination unit is configured to determine the The gradient of the target loss with respect to the first distillation data; the update unit is configured to update each data in the first distillation data based on the gradient; the training unit is configured to update after the update The first distillation data obtains the target distillation data when the batch normalization statistical loss and the target cross-entropy loss in each of the pre-training models converge to a stable value.

In some possible embodiments, the device further includes a compression module configured to perform model compression on each of the pre-trained models when the quantity of at least one batch of the target distillation data reaches a specific threshold.

It should be noted here that: the description of the above device embodiment is similar to the description of the above method embodiment, and has similar beneficial effects as the method embodiment. For technical details not disclosed in the device embodiments of the present disclosure, please refer to the description of the method embodiments of the present disclosure for understanding.

It should be noted that, in the embodiments of the present disclosure, if the above data distillation method is implemented in the form of software function modules and sold or used as an independent product, it can also be stored in a computer-readable storage medium. Based on this understanding, the essence of the technical solutions of the embodiments of the present disclosure or the part that contributes to the related technologies can be embodied in the form of software products, the computer software products are stored in a storage medium, and include several instructions to make An electronic device (which may be a smart phone with a camera, a tablet computer, etc.) executes all or part of the methods described in various embodiments of the present disclosure. The aforementioned storage medium includes: various media that can store program codes such as U disk, mobile hard disk, read-only memory (Read Only Memory, ROM), magnetic disk or optical disk. As such, embodiments of the present disclosure are not limited to any predetermined combination of hardware and software.

Correspondingly, an embodiment of the present disclosure provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the steps in any one of the methods for data distillation described in the above-mentioned embodiments are implemented. Correspondingly, in the embodiments of the present disclosure, a chip is also provided, the chip includes programmable logic circuits and/or program instructions, and when the chip is running, it is used to implement the data distillation described in any of the above embodiments steps in the method. Correspondingly, in the embodiments of the present disclosure, a computer program product is also provided, and when the computer program product is executed by the processor of the electronic device, it is used to implement the data distillation method in any of the above embodiments. step.

An embodiment of the present disclosure further provides a computer program product, the computer program product carries a program code, and instructions included in the program code can be used to execute the steps in any one of the data distillation methods in the above method embodiments. Wherein, the above-mentioned computer program product may be specifically implemented by means of hardware, software or a combination thereof. In an optional embodiment, the computer program product is embodied as a computer storage medium, and in another optional embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK) etc. wait.

An embodiment of the present disclosure also provides a computer program, including computer readable codes. When the computer readable codes are run in an electronic device, the processor in the electronic device executes any one of the above method embodiments. The data distillation method.

Based on the same technical concept, an embodiment of the present disclosure provides an electronic device configured to implement the data distillation method described in the above method embodiment. FIG. 7 is a schematic diagram of hardware entities of an electronic device provided by an embodiment of the present disclosure. As shown in FIG. 7 , the electronic device 700 includes a memory 710 and a processor 720, and the memory 710 stores a A computer program, the processor 720 implements the steps in any one of the data distillation methods in the embodiments of the present disclosure when executing the program.

The memory 710 is configured to store instructions and applications executable by the processor 720, and can also cache data to be processed or processed by the processor 720 and various modules in the electronic device (for example, image data, audio data, voice communication data and video data). Communication data), which can be realized by flash memory (FLASH) or random access memory (Random Access Memory, RAM).

When the processor 720 executes the program, the steps of any one of the data distillation methods described above are realized. The processor 720 generally controls the overall operation of the electronic device 700 .

The above-mentioned processor can be an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a digital signal processor (Digital Signal Processor, DSP), a digital signal processing device (Digital Signal Processing Device, DSPD), a programmable logic device (Programmable Logic At least one of Device, PLD), Field Programmable Gate Array (Field Programmable Gate Array, FPGA), Central Processing Unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor. Understandably, the electronic device that implements the above processor function may also be other, which is not specifically limited in this embodiment of the present disclosure.

The above-mentioned computer storage medium/memory can be read-only memory (Read Only Memory, ROM), programmable read-only memory (Programmable Read-Only Memory, PROM), erasable programmable read-only memory (Erasable Programmable Read-Only Memory, EPROM), Electrically Erasable Programmable Read-Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), Magnetic Random Access Memory (Ferromagnetic Random Access Memory, FRAM), Flash Memory (Flash Memory), Magnetic Surface Memory, CD-ROM, or CD-ROM (Compact Disc Read-Only Memory, CD-ROM) and other memories; it can also be various electronic devices including one or any combination of the above-mentioned memories, such as mobile phones, computers, tablet devices, personal digital assistants wait.

It should be pointed out here that: the descriptions of the above storage medium and device embodiments are similar to the descriptions of the above method embodiments, and have similar beneficial effects to those of the method embodiments. For technical details not disclosed in the storage medium and device embodiments of the present disclosure, please refer to the description of the method embodiments of the present disclosure for understanding.

It should be understood that reference throughout the specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic related to the embodiment is included in at least one embodiment of the present disclosure. Thus, appearances of "in one embodiment" or "in an embodiment" in various places throughout the specification are not necessarily referring to the same embodiment. Furthermore, the predetermined features, structures or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that in various embodiments of the present disclosure, the sequence numbers of the above-mentioned processes do not mean the order of execution, and the execution order of the processes should be determined by their functions and internal logic, rather than by the embodiments of the present disclosure. The implementation process constitutes any limitation. The serial numbers of the above-mentioned embodiments of the present disclosure are for description only, and do not represent the advantages and disadvantages of the embodiments.

It should be noted that, in this document, the term "comprising", "comprising" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article or apparatus comprising a set of elements includes not only those elements, It also includes other elements not expressly listed, or elements inherent in the process, method, article, or device. Without further limitations, an element defined by the phrase "comprising a ..." does not preclude the presence of additional identical elements in the process, method, article, or apparatus comprising that element.

In the several embodiments provided in the present disclosure, it should be understood that the disclosed devices and methods may be implemented in other ways. The device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods, such as: multiple units or components can be combined, or May be integrated into another system, or some features may be ignored, or not implemented. In addition, the coupling, or direct coupling, or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be electrical, mechanical or other forms of. The units described above as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units; they may be located in one place or distributed to multiple network units; Part or all of the units can be selected according to actual needs to achieve the purpose of the solutions of the embodiments of the present disclosure.

In addition, each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, or each unit may be used as a single unit, or two or more units may be integrated into one unit; the above-mentioned integration The unit can be realized in the form of hardware or in the form of hardware plus software functional unit. Alternatively, if the above-mentioned integrated units of the present disclosure are realized in the form of software function modules and sold or used as independent products, they may also be stored in a computer-readable storage medium. Based on this understanding, the essence of the technical solutions of the embodiments of the present disclosure or the part that contributes to the related technologies can be embodied in the form of software products, the computer software products are stored in a storage medium, and include several instructions to make The equipment automatic test line executes all or part of the methods described in various embodiments of the present disclosure. The aforementioned storage medium includes various media capable of storing program codes such as removable storage devices, ROMs, magnetic disks or optical disks. The methods disclosed in the several method embodiments provided in the present disclosure can be combined arbitrarily to obtain the method embodiments if there is no conflict. The features disclosed in several method or device embodiments provided in the present disclosure may be combined arbitrarily without conflict to obtain method embodiments or device embodiments. The above is only the embodiment of the present disclosure, but the scope of protection of the present disclosure is not limited thereto. Anyone skilled in the art can easily think of changes or substitutions within the technical scope of the present disclosure, and should within the protection scope of the present disclosure. Therefore, the protection scope of the present disclosure should be determined by the protection scope of the claims.

Industrial Applicability

In the method of data distillation disclosed in the embodiments of the present disclosure, at least one batch of first distillation data to be trained is first determined; then at least two pre-training models are determined; and based on the first statistical information in each of the pre-training models, it is determined The batch normalized statistical loss of each batch of the first distillation data in the corresponding pre-training model; then, based on the initialization label of each data in each batch of the first distillation data, determine each batch of the first distillation data The target cross-entropy loss of the distillation data in each of the pre-training models; finally, based on the batch normalized statistical loss and the target cross-entropy of each batch of the first distillation data in each of the pre-training models Loss, performing backpropagation training on each batch of the first distillation data to obtain the target distillation data; in this way, robust visual information is generated through data mixing, while utilizing the characteristics of the original data stored in multiple pre-training models, The trained target distillation data can match the feature distribution of multiple models, so that the target distillation data is more versatile and the effect is better.

Claims

A method of data distillation, the method comprising:

Determine at least one batch of first distillation data to be trained; in each batch of the first distillation data, there is at least one mixed data including two kinds of data label information;

Determine at least two pre-training models; wherein, each of the pre-training models stores the first statistical information of the original data;

determining a batch normalized statistical loss of each batch of the first distillation data in the corresponding pre-training model based on the first statistical information in each of the pre-training models;

determining a target cross-entropy loss in each of the pre-trained models for each of the batches of the first distillation data based on an initialization label for each of the data in each of the batches of the first distillation data;

performing backpropagation training on each batch of the first distillation data based on the batch normalized statistical loss and the target cross-entropy loss in each batch of the first distillation data in each of the pre-training models, Obtain the target distillation data.
The method according to claim 1, wherein, based on the first statistical information in each of the pre-training models, determining the batch normalization statistics of each batch of the first distillation data in the corresponding pre-training model loss, including:

determining second statistics for each batch of said first distillation data in each of said pre-trained models;

For each of the pre-trained models, a batch normalized statistical loss between the first statistical information and the second statistical information is determined.
The method according to claim 1 or 2, wherein said determining at least two pre-training models comprises:

Randomly select at least two pre-trained models of different types from the pre-trained model library.
The method according to any one of claims 1 to 3, wherein the batch normalized statistical loss and the target cross-entropy in each of the pre-training models are based on each batch of the first distillation data Loss, performing backpropagation training on each batch of the first distillation data to obtain the target distillation data, including:

Based on the batch normalization statistical loss and the target cross-entropy loss of each of the pre-training models, determine the first loss corresponding to the corresponding pre-training model;

averaging the first losses corresponding to each of the pre-training models to obtain the target loss of each batch of the first distillation data passing through the at least two pre-training models;

Based on the target loss, perform backpropagation training on each batch of the first distillation data to obtain the target distillation data.
The method according to any one of claims 1 to 4, wherein said determining at least one batch of first distillation data to be trained comprises:

Initializing at least one batch of second distillation data based on the distribution information of the original data;

In each iterative training, every two image data in each batch of the second distillation data are mixed to obtain each batch of the first distillation data.
The method according to claim 5, wherein said mixing every two image data in each batch of said second distillation data to obtain each batch of said first distillation data comprises:

randomly selecting at least one pair of first and second images from each batch of said second distillation data;

reducing the size of each of the first images according to a specific ratio;

The reduced first image is randomly overlaid on the corresponding second image to obtain each batch of the first distillation data.
The method according to claim 6, wherein said overlaying the reduced first image onto the corresponding second image to obtain each batch of said first distillation data comprises:

Randomly generating a mixed area to be covered in the corresponding second image according to the specific ratio;

randomly generating a binary mask based on the bounding box of the blended region;

Each batch of the first distillation data is obtained by superimposing each pixel value of the reduced first image with a corresponding pixel value in the second image through the binary mask.
The method according to any one of claims 1 to 7, wherein said each batch of first distillation data includes a synthetic image obtained by mixing a first image and a second image, and said based on each batch of said first An initialization label for each data in the distillation data, determining the target cross-entropy loss of each batch of the first distillation data in each of the pre-training models, including:

determining a hybrid cross-entropy loss for the composite image based on the initialization label of the first image and the initialization label of the second image;

Determining cumulative cross-entropy losses for images other than the first image and the second image in each batch of first distillation data based on initialization labels for the other images;

Based on the mixed cross-entropy loss and the cumulative cross-entropy loss, a target cross-entropy loss after each batch of the first distillation data passes through each of the pre-trained models is determined.
The method according to claim 8, wherein said determining the hybrid cross-entropy loss of said composite image based on the initialization label of said first image and the initialization label of said second image comprises:

Based on the initialization label of the first image, determine the first cross-entropy loss after each batch of the first distillation data passes through each of the pre-training models;

Based on the initialization label of the second image, determine a second cross-entropy loss after each batch of the first distillation data passes through each of the pre-training models;

Based on the synthesis ratio between the first image and the second image, linearly sum the first cross-entropy loss and the second cross-entropy loss to obtain a mixed cross-entropy loss of the synthesized image.
The method according to any one of claims 1 to 9, wherein said first statistical information includes mean and variance, each of said pre-trained models includes at least two building blocks, each of said building blocks includes convolution layer and a batch normalization layer, the method further comprising:

Extracting the mean value and variance of the original data from the batch normalization layer of each of the constituent blocks; the mean value and variance are statistically obtained through the characteristics of the original data extracted by the convolution layer;

Determining first statistical information in each of said pre-trained models based on the mean and variance of said at least two constituent blocks in each of said pre-trained models.
The method according to claim 5, wherein the distribution information is Gaussian distribution information, and the distribution information based on the original data is used to initialize at least one batch of second distillation data, including:

Obtaining Gaussian distribution information of the raw data;

Initialize at least N initial pixel value matrices based on randomly sampled data from the Gaussian distribution information to obtain each batch of second distillation data; N is an integer greater than or equal to 2.
The method according to claim 4, wherein, based on the target loss, performing backpropagation training on each batch of the first distillation data to obtain the target distillation data includes:

In each iterative training in the backpropagation process, determining the gradient of the target loss with respect to the first distillation data;

updating each of the first distillation data based on the gradient;

The target distillation data is obtained when the batch normalization statistical loss and the target cross-entropy loss in each of the pre-training models of the updated first distillation data converge to a stable value.
The method according to any one of claims 1 to 12, wherein the method further comprises:

When the quantity of at least one batch of the target distillation data reaches a specific threshold, perform model compression on each of the pre-trained models.
A device for data distillation, the device includes a first determination module, a second determination module, a third determination module, a fourth determination module and a training module, wherein:

The first determination module is configured to determine at least one batch of first distillation data to be trained; in each batch of the first distillation data, there is at least one mixed data including two kinds of data label information;

The second determination module is configured to determine at least two pre-training models; wherein, each of the pre-training models stores the first statistical information of the original data;

The third determination module is configured to determine the batch normalized statistical loss of each batch of the first distillation data in the corresponding pre-training model based on the first statistical information in each of the pre-training models;

The fourth determination module is configured to determine the target cross entropy of each batch of the first distillation data in each of the pre-training models based on the initialization label of each data in each batch of the first distillation data loss;

The training module is configured to, based on the batch normalized statistical loss and the target cross-entropy loss of each batch of the first distillation data in each of the pre-training models, for each batch of the first distillation data The data is backpropagated for training to obtain the target distillation data.
The apparatus according to claim 14, wherein the third determining module comprises:

The first determination submodule determines the second statistical information of each batch of the first distillation data in each of the pre-training models;

The second determination submodule determines, for each of the pre-trained models, a batch normalized statistical loss between the first statistical information and the second statistical information.
The device according to claim 14 or 15, wherein the second determining module is further configured to randomly select at least two different types of pre-training models from a pre-training model library.
The device according to any one of claims 14 to 16, wherein the training module comprises:

The third determining submodule is configured to determine the first loss corresponding to the corresponding pre-training model based on the batch normalized statistical loss of each of the pre-training models and the target cross-entropy loss;

The processing submodule is configured to average the first losses corresponding to each of the pre-training models to obtain the target loss of each batch of the first distillation data after passing through the at least two pre-training models;

The training submodule is configured to perform backpropagation training on each batch of the first distillation data based on the target loss to obtain the target distillation data.
The device according to any one of claims 14 to 17, wherein the first determining module comprises:

The initialization submodule is configured to initialize at least one batch of second distillation data based on the distribution information of the original data;

The mixing sub-module is configured to, in each iterative training, mix every two image data in each batch of the second distillation data to obtain each batch of the first distillation data.
The apparatus of claim 18, wherein the hybrid submodule comprises:

a selection unit configured to randomly select at least one pair of first image and second image from each batch of the second distillation data;

a scaling unit configured to reduce the size of each of the first images according to a specific ratio;

The covering unit is configured to randomly cover the reduced first image into the corresponding second image to obtain each batch of the first distillation data.
The apparatus of claim 19, wherein the covering unit comprises:

The first generating subunit is configured to randomly generate a mixed area to be covered in the corresponding second image according to the specific ratio;

The second generation subunit is configured to randomly generate a binary mask based on the bounding box of the mixed region;

The superposition subunit is configured to superimpose each pixel value of the reduced first image and the corresponding pixel value in the second image through the binary mask to obtain each batch of the first distillation data .
The device according to any one of claims 14 to 20, wherein each batch of first distillation data includes a composite image obtained by mixing the first image and the second image, and the fourth determination module includes:

The fourth determining submodule is configured to determine the hybrid cross-entropy loss of the composite image based on the initialization label of the first image and the initialization label of the second image;

The fifth determination sub-module is configured to determine the cumulative cross-entropy loss of the other images based on the initialization labels of the images other than the first image and the second image in each batch of first distillation data ;

The sixth determining submodule is configured to determine the target cross-entropy loss of each batch of the first distillation data after passing through each of the pre-trained models based on the mixed cross-entropy loss and the cumulative cross-entropy loss.
The apparatus according to claim 21, wherein the fourth determining submodule comprises:

The first determination unit is configured to determine the first cross-entropy loss of each batch of the first distillation data after passing through each of the pre-training models based on the initialization label of the first image;

The second determination unit is configured to determine a second cross-entropy loss after each batch of the first distillation data passes through each of the pre-training models based on the initialization label of the second image;

A summation unit configured to linearly sum the first cross-entropy loss and the second cross-entropy loss based on the synthesis ratio between the first image and the second image to obtain the target cross-entropy loss.
The apparatus according to any one of claims 14 to 22, wherein said first statistical information includes mean and variance, each of said pre-trained models includes at least two building blocks, each of said building blocks includes convolution layer and batch normalization layer, the third determination module further includes:

an extraction sub-module configured to extract the mean and variance of the raw data from the batch normalization layer of each of the constituent blocks; the mean and variance are features of the raw data extracted by the convolutional layer obtained through statistics;

The accumulating submodule is configured to determine the first statistical information in each of the pre-trained models based on the mean and variance of the at least two constituent blocks in each of the pre-trained models.
The device according to claim 19, wherein the distribution information is Gaussian distribution information, and the initialization submodule comprises:

An acquisition unit configured to acquire Gaussian distribution information of the original data;

The initialization unit is configured to initialize at least N initial pixel value matrices based on randomly sampled data from the Gaussian distribution information to obtain each batch of the second distillation data; N is an integer greater than or equal to 2.
The apparatus of claim 18, wherein the training submodule:

A third determining unit configured to determine the gradient of the target loss with respect to the first distillation data in each iterative training in the backpropagation process;

an updating unit configured to update each of the first distillation data based on the gradient;

The training unit is configured to obtain the target distillation data when the batch normalization statistical loss and the target cross-entropy loss in each of the pre-training models of the updated first distillation data converge to a stable value.
The device according to any one of claims 14 to 25, wherein the device further comprises: a compression module configured to, when the quantity of at least one batch of the target distillation data reaches a specific threshold, Pretrained model for model compression.
An electronic device, comprising a memory and a processor, the memory stores a computer program that can run on the processor, and the processor implements the steps in the method of any one of claims 1 to 13 when executing the program .
A computer-readable storage medium, on which a computer program is stored, wherein, when the computer program is executed by a processor, the steps in the method of any one of claims 1 to 13 are realized.
A computer program, comprising computer readable code, when the computer readable code is run in the electronic device, the processor in the electronic device executes the method for implementing any one of claims 1 to 13 .
A computer program product comprising one or more instructions adapted to be loaded by a processor to execute the steps in the method of any one of claims 1 to 13.