WO2023024406A1 - Data distillation method and apparatus, device, storage medium, computer program, and product - Google Patents

Data distillation method and apparatus, device, storage medium, computer program, and product Download PDF

Info

Publication number
WO2023024406A1
WO2023024406A1 PCT/CN2022/071121 CN2022071121W WO2023024406A1 WO 2023024406 A1 WO2023024406 A1 WO 2023024406A1 CN 2022071121 W CN2022071121 W CN 2022071121W WO 2023024406 A1 WO2023024406 A1 WO 2023024406A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
batch
distillation
training
image
Prior art date
Application number
PCT/CN2022/071121
Other languages
French (fr)
Chinese (zh)
Inventor
李雨杭
龚睿昊
沈明珠
余锋伟
路少卿
Original Assignee
上海商汤智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海商汤智能科技有限公司 filed Critical 上海商汤智能科技有限公司
Publication of WO2023024406A1 publication Critical patent/WO2023024406A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections

Definitions

  • the present disclosure relates to the field of computer vision, and relates to but not limited to methods, devices, equipment, storage media, computer programs and products of data distillation.
  • the compression of neural networks usually requires the original training data, because the compressed model generally needs to be trained to restore the previous performance.
  • the original data is private in some cases, that is, the original data faces the risk of being unobtainable.
  • Embodiments of the present disclosure provide a data distillation method, device, device, storage medium, computer program, and product.
  • an embodiment of the present disclosure provides a data distillation method, including: determining at least one batch of first distillation data to be trained; in each batch of the first distillation data, there is at least one mixing data; determining at least two pre-training models; wherein each of the pre-training models stores the first statistical information of the original data; based on the first statistical information of each of the pre-training models, determining each batch of The batch normalized statistical loss of the first distillation data in the corresponding pre-training model; based on the initialization label of each data in each batch of the first distillation data, determine each batch of the first distillation data in each The target cross-entropy loss in the pre-training model; based on the batch-normalized statistical loss and the target cross-entropy loss of each batch of the first distillation data in each of the pre-training models, for each batch The first distillation data is subjected to backpropagation training to obtain target distillation data.
  • an embodiment of the present disclosure provides a device for data distillation, including that the device includes a first determination module, a second determination module, a third determination module, a fourth determination module, and a training module, wherein:
  • the first determination module is configured to determine at least one batch of first distillation data to be trained; in each batch of the first distillation data, there is at least one mixed data including two kinds of data label information; the second determination module , configured to determine at least two pre-training models; wherein, each of the pre-training models stores the first statistical information of the original data; the third determination module is configured to be based on the first statistical information of each of the pre-training models Statistical information, determining the batch normalized statistical loss of each batch of the first distillation data in the corresponding pre-training model; the fourth determination module is configured to be based on each batch of the first distillation data
  • the initialization label of the data determines the target cross-entropy loss of each batch of the first distillation data in each of the pre-training models; the training module is configured to be based on each batch of the first distillation data in each The batch normalized statistical loss and the target cross-entropy loss in the pre-training model perform backpropagation training on each batch of the first distillation data to obtain the target distillation
  • an embodiment of the present disclosure provides an electronic device, including a memory and a processor, the memory stores a computer program that can run on the processor, and the processor implements the above data distillation method when executing the program in the steps.
  • an embodiment of the present disclosure provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps in the above data distillation method are implemented.
  • the present disclosure further provides a computer program, including computer readable codes, when the computer readable codes run in an electronic device, a processor in the electronic device executes the method for realizing the above first aspect steps in the implementation.
  • the present disclosure further provides a computer program product, the computer program product including one or more instructions, and the one or more instructions are suitable for being loaded by a processor and executing the steps in the first aspect above.
  • At least one batch of first distillation data to be trained is determined; then at least two pre-training models are determined; and each batch is determined based on the first statistical information in each of the pre-training models
  • the batch normalized statistical loss of the first distillation data in the corresponding pre-training model based on the initialization label of each data in each batch of the first distillation data, determine each batch of the first distillation data in each A target cross-entropy loss in the pre-training model; finally, based on the batch normalized statistical loss and the target cross-entropy loss of each batch of the first distillation data in each of the pre-training models, for each A batch of the first distillation data is subjected to backpropagation training to obtain the target distillation data; in this way, robust visual information is generated through data mixing, and at the same time, the features of the original data stored in multiple pre-training models are used to make the trained
  • the target distillation data can match the feature distribution of multiple models, so the target distillation data is more versatile and the effect is
  • FIG. 1 is a schematic flowchart of a data distillation method provided by an embodiment of the present disclosure
  • FIG. 2 is a schematic flow chart of a data distillation method provided by an embodiment of the present disclosure
  • FIG. 3 is a schematic flowchart of a data distillation method provided by an embodiment of the present disclosure
  • FIG. 4A is a schematic flow diagram of a data distillation method provided by an embodiment of the present disclosure.
  • FIG. 4B is a schematic diagram of a data mixing method provided by an embodiment of the present disclosure.
  • FIG. 5A is a logic flow diagram of a data distillation method provided by an embodiment of the present disclosure.
  • FIG. 5B is an algorithm framework combining data mixing and feature mixing provided by an embodiment of the present disclosure.
  • FIG. 5C is a schematic diagram of a training process combining data mixing and feature mixing provided by an embodiment of the present disclosure
  • FIG. 6 is a schematic diagram of the composition and structure of a data distillation device provided by an embodiment of the present disclosure.
  • FIG. 7 is a schematic diagram of a hardware entity of an electronic device provided by an embodiment of the present disclosure.
  • first ⁇ second ⁇ third involved in the embodiments of the present disclosure is only to distinguish similar objects, and does not represent a specific ordering of objects. Understandably, “first ⁇ second ⁇ third "Where permitted, the preset order or sequence may be interchanged so that the embodiments of the disclosure described herein can be practiced in an order other than that illustrated or described herein.
  • Artificial Intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can respond in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Artificial intelligence technology is a comprehensive subject that involves a wide range of fields, including both hardware-level technology and software-level technology.
  • Artificial intelligence basic technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes several major directions such as computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • Machine Learning is a multi-field interdisciplinary subject, involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other disciplines. Specializes in the study of how computers simulate or implement human learning behaviors to acquire new knowledge or skills, and reorganize existing knowledge structures to continuously improve their performance.
  • Machine learning is the core of artificial intelligence and the fundamental way to make computers intelligent, and its application pervades all fields of artificial intelligence.
  • Machine learning and deep learning usually include techniques such as artificial neural network, belief network, reinforcement learning, transfer learning, inductive learning, and teaching learning.
  • Dataset distillation (Dataset Distillation) is different from knowledge distillation, which is to transfer knowledge from complex networks to simpler models. This method keeps the model unchanged and compresses the knowledge of a large number of data sets (usually containing thousands or millions of images) in the original data set. To a small amount of synthetic data, the performance of the model trained on the synthetic data is almost the same as that on the original data set.
  • the distillation data used in the data-free compression method provided in the related art is trained from a pre-training model through distillation.
  • This pre-training model stores some characteristics of the original training data, which can reflect the distribution properties of the original training data. . Therefore, when any initialization picture is input to the pre-training model, it can be observed whether the characteristics of the distilled picture output by the model are consistent with those of the previous original training picture. By minimizing the difference between these two features (this process can be called data distillation), you can obtain a distilled picture that is very close to the original training picture.
  • the distilled images can continue to be used to train the model to improve the performance of the compressed model.
  • the data distillation process can effectively get rid of the dependence on the original training data in the model compression process.
  • the generated distillation data is not universal.
  • the original training data can be used to compress any pre-trained model, however, the distilled data generated by pre-trained model A is difficult to be used to compress pre-trained model B, and vice versa.
  • the generated distillation data is inaccurate. Generating distillation data from the pre-trained model is equivalent to solving the problem of the inverse function.
  • the neural network is highly irreversible and highly non-convex, resulting in inaccurate distillation data generated, which is quite different from the original training pictures.
  • An embodiment of the present disclosure provides a method for data distillation, which is applied to an electronic device.
  • the electronic devices include, but are not limited to, mobile phones, laptop computers, tablet computers and PDAs, multimedia devices, streaming media devices, mobile Internet devices, wearable devices or other types of devices.
  • the functions realized by the method can be realized by calling the program codes by the processor in the electronic device, and of course the program codes can be stored in the computer storage medium.
  • the electronic device at least includes a processor and a storage medium.
  • the processor can be used to process the process of generating distillation data
  • the memory can be used to store intermediate data required in the process of generating distillation data and generated target distillation data.
  • Fig. 1 is a schematic flow chart of a data distillation method provided by an embodiment of the present disclosure. As shown in Fig. 1, the method at least includes the following steps:
  • Step S110 determining at least one batch of first distillation data to be trained
  • the at least one batch of first distillation data to be trained refers to one or more batches of data images
  • the batch size is a hyperparameter used to define the samples to be processed by the pre-training model during each training process number.
  • each batch of the first distillation data there is at least one mixed data including two kinds of data label information.
  • the mixed data is a composite image obtained by mixing two images, and the composite image carries all label information of the original two images.
  • the first distillation data to be trained may be in the form of a matrix, and the pixel values in the matrix may be randomly initialized through the distribution information of the original data. The specific values of the rows and columns of the matrix can also be determined according to actual needs. In this way, the first distillation data to be trained can be obtained quickly and conveniently, thereby providing a good basis for subsequent processing.
  • Step S120 determining at least two pre-training models
  • the pre-trained model means that the model can be loaded with model parameters having the same network structure as the model at the beginning of training, with the purpose of feature migration to make the model better.
  • the pre-training model includes a convolutional neural network (Convolutional Neural Network), a recursive neural network (Recursive Neural Network, RNN), a recurrent neural network (Recurrent Neural Network, RNN), a deep neural network (Deep Neural Network) Network, DNN), which is not limited in this embodiment of the present disclosure.
  • Convolutional Neural Network Convolutional Neural Network
  • RNN recursive neural network
  • Recurrent Neural Network Recurrent Neural Network
  • DNN deep neural network
  • Each of the pre-trained models stores first statistical information of the original data, such as statistical values such as mean, standard value, and variance.
  • the mean is the expected value of the normal distribution curve, which determines the position of the original data distribution
  • the variance is the square of the standard deviation
  • the standard deviation determines the magnitude of the original data distribution.
  • the distribution features of the training data are encoded and processed in its unique feature space. Therefore, in the process of using the original data to train the pre-training model, the first statistical information of the original data is stored in the batch normalization layer of each component block of the pre-training model.
  • Step S130 based on the first statistical information in each pre-training model, determine the batch normalized statistical loss of each batch of the first distillation data in the corresponding pre-training model;
  • the difference between the first distillation data and the original data in the corresponding pre-training model is determined.
  • the statistical loss between the first distillation data and the statistical features of multiple pre-training models can be further matched, and the general feature space of each pre-training model can be combined, so that the target distillation data obtained by training is compared with the single-model distillation. Data is more generic.
  • the mean and variance calculated by the batch normalization layer in the pre-training model are based on all the first distillation data included in the current batch.
  • Step S140 based on the initialization label of each data in each batch of the first distillation data, determine the target cross-entropy loss of each batch of the first distillation data in each of the pre-training models;
  • the initialization tag is randomly initialized when the first distillation data is initialized using the Gaussian distribution of the original data, for example, any value from 1 to 1000 is set for each data in a batch of first distillation data as the data initialization label for .
  • the initialization tags of different data may be the same or different.
  • a target cross-entropy loss in each of the pre-trained models may be determined for a batch of the first distilled data.
  • Step S150 based on the batch normalized statistical loss and the target cross-entropy loss of each batch of the first distillation data in each of the pre-training models, reverse each batch of the first distillation data Propagate training to obtain target distillation data.
  • the target distillation data is a distillation picture that is very close to the original training picture.
  • the target distillation data which encodes the discriminative features of each category.
  • the target loss is calculated for the gradient of the first distillation data, and the gradient descent update is performed such as the stochastic gradient descent method (Stochastic Gradient Descent, SGD), and the first distillation data is gradually optimized. After about 20,000 iterations of training, Finally, the target distillation data is obtained.
  • SGD stochastic gradient descent method
  • At least one batch of first distillation data to be trained is determined; then at least two pre-training models are determined; and each batch is determined based on the first statistical information in each of the pre-training models
  • the batch normalized statistical loss of the first distillation data in the corresponding pre-training model based on the initialization label of each data in each batch of the first distillation data, determine each batch of the first distillation data in each A target cross-entropy loss in the pre-training model; finally, based on the batch normalized statistical loss and the target cross-entropy loss of each batch of the first distillation data in each of the pre-training models, for each A batch of the first distillation data is subjected to backpropagation training to obtain the target distillation data; in this way, robust visual information is generated through data mixing, and at the same time, the features of the original data stored in multiple pre-training models are used to make the trained
  • the target distillation data can match the feature distribution of multiple models, so the target distillation data is more versatile and the effect is
  • Fig. 2 is a schematic flow chart of a data distillation method provided by an embodiment of the present disclosure. As shown in Fig. 2, the method at least includes the following steps:
  • Step S210 determining at least one batch of first distillation data to be trained
  • Step S220 randomly selecting at least two different types of pre-training models from the pre-training model library
  • the first statistical information of the original data is stored in each pre-training model.
  • Different types of pre-trained models have different network structures.
  • the feature mixing method in the embodiment of the present disclosure requires feature information of multiple models, but too many models will affect the speed of optimization. Therefore, for each batch of first distillation data to be optimized, only a small part of the model is sampled for feature mixing. Experiments prove that the three models can bring the best speed and effect balance.
  • Step S230 determining the second statistical information of each batch of the first distillation data in each of the pre-training models
  • the distribution features of the training data are encoded and processed in its unique feature space, namely the Batch Normalization (BN) layer. Therefore, the first distilled data can reasonably simulate the feature distribution of the original image in the pre-trained network.
  • the activation mean and variance are extracted from the batch normalization layer of each pre-training model as the second statistical information of the first distillation data in the corresponding pre-training model.
  • Step S240 determining a batch normalized statistical loss between the first statistical information and the second statistical information for each of the pre-trained models
  • the batch normalized statistical loss for each pre-training model is determined by matching the second statistical information in each pre-training model of the first distillation data with the second statistical information stored in the corresponding pre-training model.
  • the first statistical information includes mean and variance
  • each of the pre-training models includes at least two building blocks
  • each of the building blocks includes a convolutional layer and a batch normalization layer, which can be obtained by The following steps determine the first statistical information in each of the pre-training models: first, extract the mean and variance of the raw data from the batch normalization layer of each of the constituent blocks; the mean and variance are obtained by The features of the original data extracted by the convolutional layer are statistically obtained; then, based on the mean and variance of the at least two constituent blocks in each of the pre-training models, determine the The first statistic for .
  • the first statistical information of the corresponding pre-training model is determined to ensure that the target distillation data obtained by training can reasonably simulate the pre-training network. Activation distribution on raw data.
  • the batch normalization statistical loss L BN for each of the pre-trained models can be determined by the following formula (1):
  • ⁇ i (X) are the mean and variance of the first distillation data X in the i-th component block, respectively, are the running average and variance stored in the BN layer of the corresponding pre-trained model, respectively, where the expression
  • 2 is the square root of the sum of the squares of each component after the difference between the two vectors in the expression.
  • Step S250 based on the initialization label of each data in each batch of the first distillation data, determine the target cross-entropy loss of each batch of the first distillation data in each of the pre-training models;
  • the target cross-entropy loss is the total loss output by the last layer of the pre-trained model.
  • the target cross-entropy loss L CE of a batch of the first distillation data in each of the pre-training models can be determined by the following formula (2):
  • m represents the number of a batch of first distillation data, that is, the batch size
  • y i is the initialization label of each data, Predicted label output for the pretrained model.
  • the label of the synthesized image is a mixed label determined based on the labels of the original two images.
  • Step S260 based on the batch normalized statistical loss of each pre-training model and the target cross-entropy loss, determine the first loss corresponding to the corresponding pre-training model;
  • the calculated batch normalization statistical loss and the target cross-entropy loss are linearly summed to obtain the first loss corresponding to the corresponding pre-training model.
  • Step S270 calculating the average value of the first loss corresponding to each of the pre-training models to obtain the target loss of each batch of the first distillation data passing through the at least two pre-training models;
  • the first losses corresponding to all the pre-training models are first summed, averaged, and then minimized to obtain the target loss of each batch of the first distillation data passing through the at least two pre-trained models.
  • the above target loss can be calculated by the following formula (3), aiming to combine feature mixture and Data blending optimizes each batch of first distillation data:
  • a prior loss may also be applied to the first distilled data to ensure that the image is generally smooth, wherein the prior loss is defined as the mean square error between the first distilled data and its Gaussian blurred version. Therefore, the final minimization objective is determined as the objective loss by combining batch normalized statistical loss, objective cross-entropy loss and prior loss.
  • the target loss of the first distillation data passing through the at least two pre-training models is determined by matching the batch normalized statistical loss and the target cross-entropy loss between the features of the first distillation data and the plurality of pre-training models.
  • the general feature space of each pre-trained model is combined by determining the target loss, so that the target distillation data obtained by training is more general than the single-model distillation data.
  • Step S280 based on the target loss, perform backpropagation training on each batch of the first distillation data to obtain the target distillation data.
  • Stochastic Gradient Descent SGD
  • each batch of data will go through 20,000 iterations from initialization to the end of training.
  • Stochastic gradient descent is a simple but very effective method, which is mostly used for learning linear classifiers under convex loss functions such as support vector machines and logistic regression.
  • convex loss functions such as support vector machines and logistic regression.
  • step S280 can be implemented through the following process: in each iterative training, determine the gradient of the target loss for the first distillation data; based on the gradient, for each of the first distillation data The data is updated; when the batch normalization statistical loss and the target cross-entropy loss of the updated first distillation data in each of the pre-training models converge to a stable value, the target distillation data is obtained.
  • the stochastic gradient descent method is used to optimize the first distillation data during the backpropagation process, and the iterative training process of each batch of first distillation data is accelerated.
  • each batch of first distillation data to be trained at least two different types of pre-training models are randomly sampled from the pre-training library, and the characteristics of the first distillation data and multiple pre-training models are matched at the same time .
  • the general feature space of each pre-training model is combined, so that the distilled data can match the feature distribution of any pre-training model, so that a better balance between training speed and effect can be obtained.
  • Fig. 3 is a schematic flow chart of a method for data distillation provided by an embodiment of the present disclosure. As shown in Fig. 3, the method at least includes the following steps:
  • Step S310 based on the distribution information of the original data, determine at least one batch of second distillation data
  • the original data can be images in at least one of the following general data sets: ImageNet data set, ResNet50 data set, MobileNet V2 data set, etc.
  • the distribution information of the raw data may be a normal distribution (Normal Distribution), also known as a Gaussian distribution (Gaussian Distribution).
  • Normal Distribution also known as a Gaussian distribution (Gaussian Distribution).
  • the distribution curve is bell-shaped, that is, the two ends are low and the middle is high, and the left and right sides are symmetrical.
  • N( ⁇ , ⁇ 2) the random variable X obeys a normal distribution with a mathematical expectation of ⁇ and a variance of ⁇ 2, it is denoted as N( ⁇ , ⁇ 2).
  • Its probability density function is the expected value ⁇ of the normal distribution determines its position, and its standard deviation ⁇ determines the magnitude of the distribution.
  • the distribution information is Gaussian distribution information, including the mean value, variance, etc. of the original data.
  • the mean and standard deviation of the three color channels are different.
  • the mean value of each image in the ImageNet dataset is [0.485, 0.456, 0.406]
  • the variance is [0.229, 0.224, 0.225].
  • the above step S210 can be implemented through the following process: obtaining the Gaussian distribution information of the original data; based on the randomly sampled data from the Gaussian distribution information, initializing at least N initial pixel value matrices to obtain Each batch of second distillation data; N is an integer greater than or equal to 2.
  • the data is randomly sampled from the Gaussian distribution information of the original data, and initialized to obtain the second distillation data, which makes up for the inability to directly obtain the original data.
  • Step S320 determine at least two pre-training models
  • the first statistical information of the original data is stored in each pre-training model.
  • Step S330 in each iterative training, mix every two image data in each batch of second distillation data to obtain a batch of first distillation data;
  • any two image data are randomly selected for data mixing, so that the mixed image contains the information of the two image data, so that the generated target distillation data is more robust and has more visual information, which can Adapt to different scales and spatial orientations. Therefore, the generated target distillation data is more effective for model compression.
  • step S330 may be executed first, step S320 may be executed first, or may be executed at the same time.
  • Step S340 based on the first statistical information in each pre-training model, determine the batch normalized statistical loss of each batch of the first distillation data in the corresponding pre-training model;
  • Step S350 based on the initialization label of each data in each batch of the first distillation data, determine the target cross-entropy loss of each batch of the first distillation data in each of the pre-training models;
  • each batch of first distillation data includes a composite image obtained by mixing the first image and the second image. That is to say, a pair of the first image and the second image is randomly selected from a batch of second distillation data and mixed to obtain a composite image, wherein the first image and the second image respectively have corresponding initialization labels.
  • the target cross-entropy loss of each batch of the first distillation data in each of the pre-training models can be determined by the following steps:
  • Step S3501 based on the initialization label of the first image and the initialization label of the second image, determine the hybrid cross-entropy loss of the composite image;
  • the hybrid cross-entropy loss of the synthetic image is determined by the respective initialization labels of the first image and the second image, which can ensure that the information of the original first image and the second image can be recognized and correctly optimized by the pre-trained model.
  • Step S3502 based on the initialization labels of other images in each batch of first distillation data except the first image and the second image, determine the cumulative cross-entropy loss of the other images;
  • the process of calculating the cumulative cross-entropy loss may refer to formula (2), except that the value of the variable i is other images than the first image and the second image.
  • Step S3502 based on the mixed cross-entropy loss and the cumulative cross-entropy loss, determine the target cross-entropy loss after each batch of the first distillation data passes through each of the pre-trained models.
  • the mixed cross-entropy loss of the synthetic image in a batch of first distillation data and the cumulative cross-entropy loss of other images are summed and then minimized to obtain each batch of the first distillation data after passing through each of the pre-trained models
  • the target cross-entropy loss is summed and then minimized to obtain each batch of the first distillation data after passing through each of the pre-trained models The target cross-entropy loss.
  • the above step S3501 can be implemented through the following process: based on the initialization label of the first image, determine the first cross-entropy loss after each batch of the first distillation data passes through each of the pre-training models ; Based on the initialization label of the second image, determine the second cross-entropy loss after each batch of the first distillation data passes through each of the pre-training models; based on the difference between the first image and the second image Combining ratios, linearly summing the first cross-entropy loss and the second cross-entropy loss to obtain the mixed cross-entropy loss.
  • the above mixed cross-entropy loss can be determined by the following formula (4):
  • Y 2 and Y 1 are the initialization labels of the first image and the second image before mixing, respectively, and are the first cross-entropy loss corresponding to the first image and the second cross-entropy loss corresponding to the second image, respectively, and ⁇ is the proportional coefficient between the area of the mixed region and the area of the second image.
  • Step S360 based on the batch normalized statistical loss and the target cross-entropy loss of each pre-trained model, perform backpropagation training on each batch of the first distillation data to obtain target distillation data.
  • the first loss of a batch of first distillation data for each pre-training model is obtained by combining the target cross-entropy loss and batch normalization statistical loss, and then the first loss corresponding to each pre-training model is linearly integrated , can average the feature deviation produced by each pre-training model, so that the final target distillation data is closer to the real data and more general.
  • the generated target distillation data can match the feature distribution of multiple models, and at the same time, the target distillation data is more robust and has more visual information through data mixing. After the target distillation data is obtained, it can be directly used to compress the pre-trained model or other unidentified models.
  • model compression is performed on each of the pre-trained models. In this way, since the generated target distillation data has more visual information and is universal, it can be directly used in the model compression process, simplifying the compression process.
  • the generated target distillation data can be directly used for compression, without further performing the distillation data generation process.
  • the target distillation data generated by the embodiment of the present disclosure is accurate enough, and it only needs to generate a target distillation data set (including a sufficient amount of target distillation data) once to apply to all pre-trained models and generalize to unidentified models.
  • the model compression operation may include network quantization (Network Quantization), network pruning (Network Pruning), knowledge distillation (Knowledge Distillation) and the like.
  • the pruning process can complete the pruning decision and post-pruning reconstruction based on the target distillation data set;
  • the model quantization process can complete the training quantization process based on the target distillation data set or the calibration process of post-training quantization based on the target distillation data set;
  • knowledge The distillation process can send the target distillation data set to the teacher network and the student network to complete the process of knowledge transfer.
  • the inversion solution space can be reduced, and at the same time, the visual information of the trained target distillation data is more robust and adaptable to Different scales and spatial orientations. Simultaneously combine the batch normalized statistical loss and the target cross-entropy loss to determine the target loss of a batch of first-distilled data through multiple pre-trained models. So that the final target distillation data is closer to the real data.
  • Fig. 4A is a schematic flowchart of the data distillation method provided by the embodiment of the present disclosure.
  • step S330 in each iterative training, for every two batches of the second distillation data
  • the image data are mixed to obtain a batch of first distillation data which can be achieved by the following steps:
  • Step S410 randomly selecting at least one pair of the first image and the second image from each batch of the second distillation data
  • each pair of the first image and the second image is any two second distillation data. Because the first image and the second image are randomly selected and mixed during each round of iteration, the visual information of the target distillation data obtained after training will be more robust and able to adapt to different scales and spatial orientations.
  • Step S420 reducing the size of each of the first images according to a specific ratio
  • the specific ratio is a ratio between the bounding box of the random coverage area on the second image and the size of the original image data.
  • step S430 the reduced first image is randomly overlaid on the corresponding second image to obtain each batch of the first distillation data.
  • the overlaid composite image includes information of two image data.
  • the first image 41 is proportionally reduced to a fixed size, and then the reduced first image 42 is randomly overlaid on the second image 43 .
  • the overlaid composite image 44 will contain all the information of the first image 41 and the second image 43 .
  • the step S430 can be implemented through the following process: according to the specific ratio, randomly generate a mixed area to be covered in the corresponding second image; based on the bounding box of the mixed area, randomly generate A binary mask: superimposing each pixel value of the reduced first image with a corresponding pixel value in the second image by using the binary mask to obtain each batch of the first distillation data.
  • the mixing of any two image data is realized by the following formulas (5) and (6):
  • x l , x r , y d , y u are the boundaries of the left box, right box, upper box, and lower box of the bounding box of the mixed area in turn, and ⁇ ij is the element in the binary mask. If ⁇ ij is located at the boundary The value is 1 if ⁇ ij is outside the bounding box, and 0 if ⁇ ij is outside the bounding box.
  • X 2 , X 1 are the first image and the second image to be mixed respectively
  • g(.) is a linear interpolation function that resizes the first image X 2 to the same size as the bounding box.
  • the first image is scaled down and randomly overlaid into the second image during data mixing, and the mixed composite image will contain information of the first image and the second image.
  • the two mixed image information can be recognized and optimized correctly by the pre-trained model.
  • FIG. 5A is a logic flow diagram of a data distillation method provided by an embodiment of the present disclosure. As shown in FIG. 5A, the method includes at least the following steps:
  • Step S510 using the Gaussian distribution of the original data to initialize a batch of distillation data
  • the distillation process needs to first use a Gaussian distribution to initialize a batch of distillation data (equivalent to the second distillation data), and the Gaussian distribution uses the statistical value of the original data to initialize a batch of distillation data.
  • the distribution information of the original training data is obtained first, and then a batch of distilled data is initialized with the statistical value of the Gaussian distribution.
  • Step S520 randomly sampling three pre-training models from the pre-training model cluster
  • Step S530 using data mixing and feature mixing to train a batch of distillation data through three pre-trained models
  • the embodiment of the present disclosure adopts a stochastic gradient descent method to train a batch of mixed distillation data (equivalent to the first distillation data).
  • Step S540 repeating the above training process until a specified amount of target distillation data is generated.
  • the generated target distillation data is sufficiently accurate and general, the target distillation data generated in the embodiments of the present disclosure can even be generalized to unidentified models. Therefore, when a new model compression requirement is proposed, the generated target distillation data can be directly used for compression.
  • FIG. 5B is an algorithm framework combining data mixing and feature mixing provided by an embodiment of the present disclosure.
  • the solid arrows in the framework reflect the forward training flow
  • the dotted arrows reflect the backward inversion process flow.
  • the data mixing 51 process is firstly performed, and the first image 501 and the second image 502 are randomly selected from a batch of distillation data for mixing to obtain a composite image 503 . Then carry out the process of feature mixing 52, and input a batch of distillation data including the composite image 503 into the pre-training model 504, the pre-training model 505 and the pre-training model 506 respectively (the three models of the example, but not limited, can be multiple ), calculate the target cross-entropy loss 507 and the batch normalization statistical loss 508 after three pre-trained models.
  • each component block of each pre-training model includes an activation layer 61, a batch normalization layer 62 and a convolutional layer 63, and the batch normalization layer 62 counts the mean and variance ( ⁇ , ⁇ 2), the mean and variance stored in all the blocks of each pre-training model can be extracted to determine the above-mentioned batch normalization statistical loss 508 .
  • the target loss is further determined using the target cross-entropy loss 507 and the batch normalization statistical loss 508 determined in the previous iterative training process, so as to optimize the composite image 503 using the target loss. Since the pre-trained model already captures the image category information of the distilled data, knowledge can be retrieved by assigning mixed labels to the synthesized images 503 .
  • the forward training process includes the following steps:
  • Step S5301 performing random data mixing on a batch of distillation data
  • Step S5302 input the mixed batch of distillation data into the three pre-training models respectively, and determine the difference between the current statistic and the original statistic stored in the model;
  • the current statistic is the second statistic of a batch of distillation data
  • the original statistic is the first statistic of the original data stored in the model.
  • Feature Mixing compute statistics on a batch of distilled data in a pretrained model.
  • the first statistics for the original data namely the mean and variance
  • the mean variance of a batch of distillation data in the pre-training model is not much different from the mean variance of the original data, then the batch of distillation data is considered to be very similar to the original data. If the difference is still large, you need to continue to minimize the error between them.
  • Step S5303 using difference backpropagation to calculate the gradient, and performing gradient descent update on the mixed batch of distillation data
  • Step S5304 judging whether the iterative training is completed.
  • the target distillation data can be made to look consistent with the original data, so that the iterative training is completed; otherwise, step S5301 is continued after each training.
  • the distillation data generated by the non-data compression method in the related art can only be used for the compression of this model, and because the distillation process is an irreversible operation, there is a lot of unreasonable visual information, so that the generated distillation data cannot be transferred and needs to be generated repeatedly.
  • the generated target distillation data has more robust visual information and features, so the target distillation data is very versatile and can be used for different models.
  • the embodiment of the present disclosure only needs to generate the target distillation data once, and then it can be used all the time.
  • the embodiments of the present disclosure use multiple pre-trained models to improve the versatility of distillation data generation, and at the same time use data information mixture to generate robust visual information.
  • distillation data sets can be generated in advance and directly used for model compression, such as quantization and pruning.
  • model compression such as quantization and pruning.
  • distillation data needs to be generated before use.
  • the present disclosure further provides a device for data distillation.
  • the device includes each module included, each sub-module included in each module, and each unit, and the processor in the electronic device can It can also be realized by a specific logic circuit; in the process of implementation, the processor can be a central processing unit (Central Processing Unit, CPU), a microprocessor (Micro Processing Unit, MPU), a digital signal processor ( Digital Signal Processor, DSP) or Field Programmable Gate Array (Field Programmable Gate Array, FPGA), etc.
  • CPU Central Processing Unit
  • MPU Micro Processing Unit
  • DSP Digital Signal Processor
  • FPGA Field Programmable Gate Array
  • Fig. 6 is a schematic diagram of the composition and structure of a data distillation device provided by an embodiment of the present disclosure.
  • the device 600 includes a first determination module 610, a second determination module 620, a third determination module 630, a Four determination module 640 and training module 650, wherein:
  • the first determination module 610 is configured to determine at least one batch of first distillation data to be trained; in each batch of the first distillation data, there is at least one mixed data including two kinds of data label information;
  • the second determination module 620 is configured to determine at least two pre-training models; wherein, each of the pre-training models stores the first statistical information of the original data;
  • the third determination module 630 is configured to determine the batch normalized statistical loss of each batch of the first distillation data in the corresponding pre-training model based on the first statistical information in each of the pre-training models;
  • the fourth determination module 640 is configured to determine the target intersection of each batch of the first distillation data in each of the pre-training models based on the initialization label of each data in each batch of the first distillation data entropy loss;
  • the training module 650 is configured to, based on the batch normalized statistical loss and the target cross-entropy loss of each batch of the first distillation data in each of the pre-training models, for each batch of the first distillation data
  • the distillation data is backpropagated for training to obtain the target distillation data.
  • the third determining module 630 includes a first determining submodule and a second determining submodule, wherein: the first determining submodule determines that each batch of the first distillation data is The second statistical information in the pre-training model; the second determination submodule, for each of the pre-training models, determines the batch normalization between the first statistical information and the second statistical information Statistical loss.
  • the second determining module 620 is further configured to randomly select at least two different types of pre-training models from the pre-training model library.
  • the training module 650 includes a third determination submodule, a processing submodule and a training submodule, wherein: the third determination submodule is configured to Normalizing the statistical loss and the target cross-entropy loss to determine the first loss corresponding to the corresponding pre-training model; the processing submodule is configured to average the first loss corresponding to each of the pre-training models to obtain Each batch of the first distillation data undergoes the target loss of the at least two pre-trained models; the training submodule is configured to perform backpropagation on each batch of the first distillation data based on the target loss Training, obtain the target distillation data.
  • the first determination module includes an initialization submodule and a mixing submodule, wherein: the initialization submodule is configured to initialize at least one batch of second distillation data based on the distribution information of the original data;
  • the mixing sub-module is configured to mix every two image data in each batch of the second distillation data to obtain each batch of the first distillation data in each iterative training.
  • the mixing submodule includes a selection unit, a scaling unit, and a covering unit, wherein: the selection unit is configured to randomly select at least one pair of first distillation data from each batch of the second distillation data image and a second image; the scaling unit is configured to reduce the size of each of the first images according to a specific ratio; the covering unit is configured to randomly cover the reduced first image to the corresponding In the second image, each batch of the first distillation data is obtained.
  • the covering unit includes a first generating subunit, a second generating subunit, and a superposition subunit, wherein: the first generating subunit is configured to, according to the specific ratio, in the corresponding The mixed region to be covered is randomly generated in the second image; the second generating subunit is configured to randomly generate a binary mask based on the bounding box of the mixed region; the superimposing subunit is configured to pass the The binary mask superimposes each pixel value of the reduced first image and the corresponding pixel value in the second image to obtain each batch of the first distillation data.
  • each batch of first distillation data includes a synthetic image obtained by mixing the first image and the second image
  • the fourth determining module 640 includes a fourth determining submodule, a fifth determining The submodule and the sixth determination submodule, wherein: the fourth determination submodule is configured to determine the hybrid cross-entropy loss of the composite image based on the initialization label of the first image and the initialization label of the second image ;
  • the fifth determination submodule is configured to determine the accumulation of the other images based on the initialization labels of the other images in each batch of first distillation data except the first image and the second image Cross-entropy loss;
  • the sixth determining submodule is configured to determine the target of each batch of the first distillation data after passing through each of the pre-training models based on the mixed cross-entropy loss and the cumulative cross-entropy loss Cross entropy loss.
  • the fourth determining submodule includes a first determining unit, a second determining unit, and a summing unit, wherein: the first determining unit is configured to initialize labels based on the first image , to determine the first cross-entropy loss after each batch of the first distillation data passes through each of the pre-training models; the second determining unit is configured to determine each batch of the The second cross-entropy loss after the first distillation data passes through each of the pre-training models; the summation unit is configured to calculate the first image based on the synthesis ratio between the first image and the second image. A cross-entropy loss and the second cross-entropy loss are linearly summed to obtain the target cross-entropy loss.
  • the first statistical information includes mean and variance
  • each of the pre-training models includes at least two building blocks
  • each of the building blocks includes a convolutional layer and a batch normalization layer
  • the third determination module 630 also includes an extraction submodule and an accumulation submodule, wherein: the extraction submodule is configured to extract the mean value and variance of the original data from the batch normalization layer of each of the constituent blocks ; The mean value and variance are statistically obtained through the features of the original data extracted by the convolution layer; the accumulation sub-module is configured to be based on the at least two constituent blocks in each of the pre-training models The mean and variance of , determine the first statistical information in each of said pre-trained models.
  • the distribution information is Gaussian distribution information
  • the initialization submodule includes an acquisition unit and an initialization unit, wherein: the acquisition unit is configured to acquire the Gaussian distribution information of the original data; the The initialization unit is configured to initialize at least N initial pixel value matrices based on randomly sampled data from the Gaussian distribution information to obtain each batch of the second distillation data; N is an integer greater than or equal to 2.
  • the training submodule has a third determination unit, an update unit and a training unit, wherein: the third determination unit is configured to determine the The gradient of the target loss with respect to the first distillation data; the update unit is configured to update each data in the first distillation data based on the gradient; the training unit is configured to update after the update The first distillation data obtains the target distillation data when the batch normalization statistical loss and the target cross-entropy loss in each of the pre-training models converge to a stable value.
  • the device further includes a compression module configured to perform model compression on each of the pre-trained models when the quantity of at least one batch of the target distillation data reaches a specific threshold.
  • the description of the above device embodiment is similar to the description of the above method embodiment, and has similar beneficial effects as the method embodiment.
  • the description of the method embodiments of the present disclosure please refer to the description of the method embodiments of the present disclosure for understanding.
  • the above data distillation method is implemented in the form of software function modules and sold or used as an independent product, it can also be stored in a computer-readable storage medium.
  • the essence of the technical solutions of the embodiments of the present disclosure or the part that contributes to the related technologies can be embodied in the form of software products, the computer software products are stored in a storage medium, and include several instructions to make An electronic device (which may be a smart phone with a camera, a tablet computer, etc.) executes all or part of the methods described in various embodiments of the present disclosure.
  • the aforementioned storage medium includes: various media that can store program codes such as U disk, mobile hard disk, read-only memory (Read Only Memory, ROM), magnetic disk or optical disk.
  • program codes such as U disk, mobile hard disk, read-only memory (Read Only Memory, ROM), magnetic disk or optical disk.
  • an embodiment of the present disclosure provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the steps in any one of the methods for data distillation described in the above-mentioned embodiments are implemented.
  • a chip is also provided, the chip includes programmable logic circuits and/or program instructions, and when the chip is running, it is used to implement the data distillation described in any of the above embodiments steps in the method.
  • a computer program product is also provided, and when the computer program product is executed by the processor of the electronic device, it is used to implement the data distillation method in any of the above embodiments. step.
  • An embodiment of the present disclosure further provides a computer program product, the computer program product carries a program code, and instructions included in the program code can be used to execute the steps in any one of the data distillation methods in the above method embodiments.
  • the above-mentioned computer program product may be specifically implemented by means of hardware, software or a combination thereof.
  • the computer program product is embodied as a computer storage medium, and in another optional embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK) etc. wait.
  • An embodiment of the present disclosure also provides a computer program, including computer readable codes.
  • the processor in the electronic device executes any one of the above method embodiments.
  • the data distillation method is not limited to:
  • FIG. 7 is a schematic diagram of hardware entities of an electronic device provided by an embodiment of the present disclosure.
  • the electronic device 700 includes a memory 710 and a processor 720, and the memory 710 stores a A computer program, the processor 720 implements the steps in any one of the data distillation methods in the embodiments of the present disclosure when executing the program.
  • the memory 710 is configured to store instructions and applications executable by the processor 720, and can also cache data to be processed or processed by the processor 720 and various modules in the electronic device (for example, image data, audio data, voice communication data and video data).
  • Communication data which can be realized by flash memory (FLASH) or random access memory (Random Access Memory, RAM).
  • the processor 720 executes the program, the steps of any one of the data distillation methods described above are realized.
  • the processor 720 generally controls the overall operation of the electronic device 700 .
  • the above-mentioned processor can be an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a digital signal processor (Digital Signal Processor, DSP), a digital signal processing device (Digital Signal Processing Device, DSPD), a programmable logic device (Programmable Logic At least one of Device, PLD), Field Programmable Gate Array (Field Programmable Gate Array, FPGA), Central Processing Unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor. Understandably, the electronic device that implements the above processor function may also be other, which is not specifically limited in this embodiment of the present disclosure.
  • the above-mentioned computer storage medium/memory can be read-only memory (Read Only Memory, ROM), programmable read-only memory (Programmable Read-Only Memory, PROM), erasable programmable read-only memory (Erasable Programmable Read-Only Memory, EPROM), Electrically Erasable Programmable Read-Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), Magnetic Random Access Memory (Ferromagnetic Random Access Memory, FRAM), Flash Memory (Flash Memory), Magnetic Surface Memory, CD-ROM, or CD-ROM (Compact Disc Read-Only Memory, CD-ROM) and other memories; it can also be various electronic devices including one or any combination of the above-mentioned memories, such as mobile phones, computers, tablet devices, personal digital assistants wait.
  • the disclosed devices and methods may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division.
  • the coupling, or direct coupling, or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be electrical, mechanical or other forms of.
  • the units described above as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units; they may be located in one place or distributed to multiple network units; Part or all of the units can be selected according to actual needs to achieve the purpose of the solutions of the embodiments of the present disclosure.
  • each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, or each unit may be used as a single unit, or two or more units may be integrated into one unit; the above-mentioned integration
  • the unit can be realized in the form of hardware or in the form of hardware plus software functional unit.
  • the above-mentioned integrated units of the present disclosure are realized in the form of software function modules and sold or used as independent products, they may also be stored in a computer-readable storage medium.
  • the essence of the technical solutions of the embodiments of the present disclosure or the part that contributes to the related technologies can be embodied in the form of software products, the computer software products are stored in a storage medium, and include several instructions to make The equipment automatic test line executes all or part of the methods described in various embodiments of the present disclosure.
  • the aforementioned storage medium includes various media capable of storing program codes such as removable storage devices, ROMs, magnetic disks or optical disks.
  • the methods disclosed in the several method embodiments provided in the present disclosure can be combined arbitrarily to obtain the method embodiments if there is no conflict.
  • the features disclosed in several method or device embodiments provided in the present disclosure may be combined arbitrarily without conflict to obtain method embodiments or device embodiments.
  • At least one batch of first distillation data to be trained is first determined; then at least two pre-training models are determined; and based on the first statistical information in each of the pre-training models, it is determined The batch normalized statistical loss of each batch of the first distillation data in the corresponding pre-training model; then, based on the initialization label of each data in each batch of the first distillation data, determine each batch of the first distillation data The target cross-entropy loss of the distillation data in each of the pre-training models; finally, based on the batch normalized statistical loss and the target cross-entropy of each batch of the first distillation data in each of the pre-training models Loss, performing backpropagation training on each batch of the first distillation data to obtain the target distillation data; in this way, robust visual information is generated through data mixing, while utilizing the characteristics of the original data stored in multiple pre-training models,
  • the trained target distillation data can match the feature distribution of multiple models, so that the target distillation data

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

A data distillation method and apparatus, a device, a storage medium, a computer program, and a product. The method comprises: determining at least one batch of first distillation data to be trained (S110); determining at least two pretraining models (S120); determining a batch normalization statistical loss of each batch of first distillation data in the corresponding pretraining model on the basis of first statistical information in each pretraining model (S130); determining a target cross entropy loss of each batch of first distillation data in each pretraining model on the basis of an initialized label of each piece of data in each batch of first distillation data (S140); performing back propagation training on each batch of first distillation data on the basis of the batch normalization statistical loss and the target cross entropy loss of each batch of first distillation data in each pretraining model, to obtain target distillation data (S150).

Description

数据蒸馏的方法、装置、设备、存储介质、计算机程序及产品Method, device, equipment, storage medium, computer program and product of data distillation
相关申请的交叉引用Cross References to Related Applications
本公开基于申请号为202110994122.7、申请日为2021年08月27日、申请名称为“数据蒸馏的方法、装置、电子设备和存储介质”的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此以全文引入的方式引入本公开。This disclosure is based on the Chinese patent application with the application number 202110994122.7, the application date is August 27, 2021, and the application name is "Data Distillation Method, Device, Electronic Equipment, and Storage Medium", and claims the priority of the Chinese patent application , the entire content of the Chinese patent application is hereby incorporated into this disclosure by way of full text.
技术领域technical field
本公开涉及计算机视觉领域,涉及但不限定于数据蒸馏的方法、装置、设备、存储介质、计算机程序及产品。The present disclosure relates to the field of computer vision, and relates to but not limited to methods, devices, equipment, storage media, computer programs and products of data distillation.
背景技术Background technique
大数据时代深度学习模型运用的越来越频繁,为了将深度学习模型应用到移动设备、传感器等小型设备,有时必须将模型进行压缩裁剪才能部署到小型设备。In the era of big data, deep learning models are used more and more frequently. In order to apply deep learning models to small devices such as mobile devices and sensors, sometimes the models must be compressed and cropped before they can be deployed to small devices.
神经网络的压缩通常需要原始的训练数据,这是因为压缩后的模型一般还需要进行训练才能恢复之前的性能。然而原始数据在有些情况下是具有私密性的,即原始数据面临着无法获得的风险。The compression of neural networks usually requires the original training data, because the compressed model generally needs to be trained to restore the previous performance. However, the original data is private in some cases, that is, the original data faces the risk of being unobtainable.
发明内容Contents of the invention
本公开实施例提供一种数据蒸馏的方法、装置、设备、存储介质、计算机程序及产品。Embodiments of the present disclosure provide a data distillation method, device, device, storage medium, computer program, and product.
本公开实施例的技术方案是这样实现的:The technical scheme of the embodiment of the present disclosure is realized in this way:
第一方面,本公开实施例提供一种数据蒸馏的方法,包括:确定至少一批待训练的第一蒸馏数据;每一批所述第一蒸馏数据中存在至少一个包括两种数据标签信息的混合数据;确定至少两个预训练模型;其中,每一所述预训练模型中存储原始数据的第一统计信息;基于每一所述预训练模型中的第一统计信息,确定每一批所述第一蒸馏数据在相应预训练模型中的批归一化统计损失;基于每一批所述第一蒸馏数据中每一数据的初始化标签,确定每一批所述第一蒸馏数据在每一所述预训练模型中的目标交叉熵损失;基于每一批所述第一蒸馏数据在每一所述预训练模型中的批归一化统计损失和所述目标交叉熵损失,对每一批所述第一蒸馏数据进行反向传播训练,得到目标蒸馏数据。In the first aspect, an embodiment of the present disclosure provides a data distillation method, including: determining at least one batch of first distillation data to be trained; in each batch of the first distillation data, there is at least one mixing data; determining at least two pre-training models; wherein each of the pre-training models stores the first statistical information of the original data; based on the first statistical information of each of the pre-training models, determining each batch of The batch normalized statistical loss of the first distillation data in the corresponding pre-training model; based on the initialization label of each data in each batch of the first distillation data, determine each batch of the first distillation data in each The target cross-entropy loss in the pre-training model; based on the batch-normalized statistical loss and the target cross-entropy loss of each batch of the first distillation data in each of the pre-training models, for each batch The first distillation data is subjected to backpropagation training to obtain target distillation data.
第二方面,本公开实施例提供一种数据蒸馏的装置,包括所述装置包括第一确定模块、第二确定模块、第三确定模块、第四确定模块和训练模块,其中:In the second aspect, an embodiment of the present disclosure provides a device for data distillation, including that the device includes a first determination module, a second determination module, a third determination module, a fourth determination module, and a training module, wherein:
所述第一确定模块,配置为确定至少一批待训练的第一蒸馏数据;每一批所述第一蒸馏数据中存在至少一个包括两种数据标签信息的混合数据;所述第二确定模块,配置为确定至少两个预训练模型;其中,每一所述预训练模型中存储原始数据的第一统计信息;所述第三确定模块,配置为基于每一所述预训练模型中的第一统计信息,确定每一批所述第一蒸馏数据在相应预训练模型中的批归一化统计损失;所述第四确定模块,配置为基于每一批所述第一蒸馏数据中每一数据的初始化标签,确定每一批所述第一蒸馏数据在每一所述预训练模型中的目标交叉熵损失;所述训练模块,配置为基于每一批所述第一蒸馏数据在每一所述预训练模型中的批归一化统计损失和所述目标交叉熵损失,对每一批所述第一蒸馏数据进行反向传播训练,得到目标蒸馏数据。The first determination module is configured to determine at least one batch of first distillation data to be trained; in each batch of the first distillation data, there is at least one mixed data including two kinds of data label information; the second determination module , configured to determine at least two pre-training models; wherein, each of the pre-training models stores the first statistical information of the original data; the third determination module is configured to be based on the first statistical information of each of the pre-training models Statistical information, determining the batch normalized statistical loss of each batch of the first distillation data in the corresponding pre-training model; the fourth determination module is configured to be based on each batch of the first distillation data The initialization label of the data determines the target cross-entropy loss of each batch of the first distillation data in each of the pre-training models; the training module is configured to be based on each batch of the first distillation data in each The batch normalized statistical loss and the target cross-entropy loss in the pre-training model perform backpropagation training on each batch of the first distillation data to obtain the target distillation data.
第三方面,本公开实施例提供一种电子设备,包括存储器和处理器,所述存储器存储有可在处理器上运行的计算机程序,所述处理器执行所述程序时实现上述数据蒸馏的方法中的步骤。In a third aspect, an embodiment of the present disclosure provides an electronic device, including a memory and a processor, the memory stores a computer program that can run on the processor, and the processor implements the above data distillation method when executing the program in the steps.
第四方面,本公开实施例提供一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现上述数据蒸馏的方法中的步骤。In a fourth aspect, an embodiment of the present disclosure provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps in the above data distillation method are implemented.
第五方面,本公开还提供一种计算机程序,包括计算机可读代码,当所述计算机可读代码在电子设备中运行时,所述电子设备中的处理器执行用于实现上述第一方面的实施方式中的步骤。In a fifth aspect, the present disclosure further provides a computer program, including computer readable codes, when the computer readable codes run in an electronic device, a processor in the electronic device executes the method for realizing the above first aspect steps in the implementation.
第六方面,本公开还提供一种计算机程序产品,所述计算机程序产品包括一条或多条指令,所述 一条或多条指令适于由处理器加载并执行上述第一方面中的步骤。In a sixth aspect, the present disclosure further provides a computer program product, the computer program product including one or more instructions, and the one or more instructions are suitable for being loaded by a processor and executing the steps in the first aspect above.
本公开实施例提供的技术方案带来的有益效果至少包括:The beneficial effects brought by the technical solutions provided by the embodiments of the present disclosure at least include:
在本公开实施例中,首先,确定至少一批待训练的第一蒸馏数据;然后确定至少两个预训练模型;再基于每一所述预训练模型中的第一统计信息,确定每一批所述第一蒸馏数据在相应预训练模型中的批归一化统计损失;基于每一批所述第一蒸馏数据中每一数据的初始化标签,确定每一批所述第一蒸馏数据在每一所述预训练模型中的目标交叉熵损失;最后基于每一批所述第一蒸馏数据在每一所述预训练模型中的批归一化统计损失和所述目标交叉熵损失,对每一批所述第一蒸馏数据进行反向传播训练,得到目标蒸馏数据;如此,通过数据混合生成鲁棒的视觉信息,同时利用多个预训练模型中存储的原始数据的特征,使得训练出来的目标蒸馏数据能够匹配出多个模型的特征分布,从而目标蒸馏数据更通用,效果也更好。In the embodiment of the present disclosure, firstly, at least one batch of first distillation data to be trained is determined; then at least two pre-training models are determined; and each batch is determined based on the first statistical information in each of the pre-training models The batch normalized statistical loss of the first distillation data in the corresponding pre-training model; based on the initialization label of each data in each batch of the first distillation data, determine each batch of the first distillation data in each A target cross-entropy loss in the pre-training model; finally, based on the batch normalized statistical loss and the target cross-entropy loss of each batch of the first distillation data in each of the pre-training models, for each A batch of the first distillation data is subjected to backpropagation training to obtain the target distillation data; in this way, robust visual information is generated through data mixing, and at the same time, the features of the original data stored in multiple pre-training models are used to make the trained The target distillation data can match the feature distribution of multiple models, so the target distillation data is more versatile and the effect is better.
附图说明Description of drawings
为了更清楚地说明本公开实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本公开的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图,其中:In order to more clearly illustrate the technical solutions in the embodiments of the present disclosure, the drawings that need to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present disclosure. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings without creative work, wherein:
图1为本公开实施例提供的数据蒸馏的方法的流程示意图;FIG. 1 is a schematic flowchart of a data distillation method provided by an embodiment of the present disclosure;
图2为本公开实施例提供的数据蒸馏的方法的流程示意图;FIG. 2 is a schematic flow chart of a data distillation method provided by an embodiment of the present disclosure;
图3为本公开实施例提供的数据蒸馏的方法的流程示意图;FIG. 3 is a schematic flowchart of a data distillation method provided by an embodiment of the present disclosure;
图4A为本公开实施例提供的数据蒸馏的方法的流程示意图;FIG. 4A is a schematic flow diagram of a data distillation method provided by an embodiment of the present disclosure;
图4B为本公开实施例提供的数据混合方法的示意图;FIG. 4B is a schematic diagram of a data mixing method provided by an embodiment of the present disclosure;
图5A为本公开实施例提供的数据蒸馏的方法的逻辑流程图;FIG. 5A is a logic flow diagram of a data distillation method provided by an embodiment of the present disclosure;
图5B为本公开实施例提供的结合数据混合和特征混合的算法框架;FIG. 5B is an algorithm framework combining data mixing and feature mixing provided by an embodiment of the present disclosure;
图5C为本公开实施例提供的结合数据混合和特征混合的训练流程示意图;FIG. 5C is a schematic diagram of a training process combining data mixing and feature mixing provided by an embodiment of the present disclosure;
图6为本公开实施例提供的一种数据蒸馏的装置的组成结构示意图;FIG. 6 is a schematic diagram of the composition and structure of a data distillation device provided by an embodiment of the present disclosure;
图7为本公开实施例提供的一种电子设备的硬件实体示意图。FIG. 7 is a schematic diagram of a hardware entity of an electronic device provided by an embodiment of the present disclosure.
具体实施方式Detailed ways
为使本公开实施例的目的、技术方案和优点更加清楚,下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本公开一部分实施例,而不是全部的实施例。以下实施例用于说明本公开,但不用来限制本公开的范围。基于本公开中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本公开保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present disclosure clearer, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below in conjunction with the drawings in the embodiments of the present disclosure. Obviously, the described embodiments It is a part of the embodiments of the present disclosure, but not all of them. The following examples are used to illustrate the present disclosure, but not to limit the scope of the present disclosure. Based on the embodiments in the present disclosure, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of the present disclosure.
在以下的描述中,涉及到“一些实施例”,其描述了所有可能实施例的子集,但是可以理解,“一些实施例”可以是所有可能实施例的相同子集或不同子集,并且可以在不冲突的情况下相互结合。In the following description, references to "some embodiments" describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or a different subset of all possible embodiments, and Can be combined with each other without conflict.
需要指出,本公开实施例所涉及的术语“第一\第二\第三”仅仅是是区别类似的对象,不代表针对对象的特定排序,可以理解地,“第一\第二\第三”在允许的情况下可以互换预设的顺序或先后次序,以使这里描述的本公开实施例能够以除了在这里图示或描述的以外的顺序实施。It should be pointed out that the term "first\second\third" involved in the embodiments of the present disclosure is only to distinguish similar objects, and does not represent a specific ordering of objects. Understandably, "first\second\third "Where permitted, the preset order or sequence may be interchanged so that the embodiments of the disclosure described herein can be practiced in an order other than that illustrated or described herein.
本技术领域技术人员可以理解,除非另外定义,这里使用的所有术语(包括技术术语和科学术语),具有与本公开实施例所属领域中的普通技术人员的一般理解相同的意义。还应该理解的是,诸如通用字典中定义的那些术语,应该被理解为具有与现有技术的上下文中的意义一致的意义,并且除非像这里一样被特定定义,否则不会用理想化或过于正式的含义来解释。Those skilled in the art can understand that, unless otherwise defined, all terms (including technical terms and scientific terms) used herein have the same meanings as those of ordinary skill in the art to which the embodiments of the present disclosure belong. It should also be understood that terms, such as those defined in commonly used dictionaries, should be understood to have meanings consistent with their meaning in the context of the prior art, and unless specifically defined as herein, are not intended to be idealized or overly Formal meaning to explain.
本公开实施例提供的方案涉及深度学习技术领域,为了便于理解本公开实施例的方案,首先对相关技术中涉及的名词进行简单说明:The solutions provided by the embodiments of the present disclosure relate to the field of deep learning technology. In order to facilitate the understanding of the solutions of the embodiments of the present disclosure, a brief description of the nouns involved in the related technologies is given first:
人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,企图了解智能的实质,并生产出一种新的能以人类 智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。Artificial Intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can respond in a similar way to human intelligence. Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。Artificial intelligence technology is a comprehensive subject that involves a wide range of fields, including both hardware-level technology and software-level technology. Artificial intelligence basic technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics. Artificial intelligence software technology mainly includes several major directions such as computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
机器学习(Machine Learning,ML)是一门多领域交叉学科,涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。专门研究计算机怎样模拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识结构使之不断改善自身的性能。机器学习是人工智能的核心,是使计算机具有智能的根本途径,其应用遍及人工智能的各个领域。机器学习和深度学习通常包括人工神经网络、置信网络、强化学习、迁移学习、归纳学习、式教学习等技术。Machine Learning (ML) is a multi-field interdisciplinary subject, involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other disciplines. Specializes in the study of how computers simulate or implement human learning behaviors to acquire new knowledge or skills, and reorganize existing knowledge structures to continuously improve their performance. Machine learning is the core of artificial intelligence and the fundamental way to make computers intelligent, and its application pervades all fields of artificial intelligence. Machine learning and deep learning usually include techniques such as artificial neural network, belief network, reinforcement learning, transfer learning, inductive learning, and teaching learning.
数据集蒸馏(Dataset Distillation)与知识蒸馏即将知识从复杂网络迁移到较简单模型不同,该方法保持模型不变,将原始数据集中大量数据集(通常包含数千或数百万图像)的知识压缩到少量合成数据上,同时在合成数据上训练的模型性能与在原始数据集上的模型性能相差无几。Dataset distillation (Dataset Distillation) is different from knowledge distillation, which is to transfer knowledge from complex networks to simpler models. This method keeps the model unchanged and compresses the knowledge of a large number of data sets (usually containing thousands or millions of images) in the original data set. To a small amount of synthetic data, the performance of the model trained on the synthetic data is almost the same as that on the original data set.
相关技术中提供的无数据压缩方法中使用的蒸馏数据是从一个预训练模型中通过蒸馏而训练得到的,这个预训练模型中存储了一些原始训练数据的特征,可以反映原始训练数据的分布性质。于是,当向预训练模型输入任意的初始化图片,可以观察模型输出的蒸馏图片的特征是否与之前的原始训练图片的特征一致。最小化这两个特征的差异(此过程可以称作数据蒸馏),就可以获得与原始训练图片非常接近的蒸馏图片。而蒸馏图片则可以继续用于训练模型来提升压缩模型的性能。The distillation data used in the data-free compression method provided in the related art is trained from a pre-training model through distillation. This pre-training model stores some characteristics of the original training data, which can reflect the distribution properties of the original training data. . Therefore, when any initialization picture is input to the pre-training model, it can be observed whether the characteristics of the distilled picture output by the model are consistent with those of the previous original training picture. By minimizing the difference between these two features (this process can be called data distillation), you can obtain a distilled picture that is very close to the original training picture. The distilled images can continue to be used to train the model to improve the performance of the compressed model.
数据蒸馏过程能够有效摆脱模型压缩过程中对原始训练数据的依赖。但是现有数据蒸馏方法依然存在两个问题:(1)生成的蒸馏数据不通用。原始训练数据可以用于压缩任意的预训练模型,然而,预训练模型A生成的蒸馏数据很难用于压缩预训练模型B,反之同理。(2)生成的蒸馏数据不准确,从预训练模型中生成蒸馏数据相当于求解逆函数的问题。然而,神经网络是高度不可逆的且高度非凸,造成了生成的蒸馏数据不准确,和原始训练图片存在较大差异。The data distillation process can effectively get rid of the dependence on the original training data in the model compression process. However, there are still two problems in the existing data distillation methods: (1) The generated distillation data is not universal. The original training data can be used to compress any pre-trained model, however, the distilled data generated by pre-trained model A is difficult to be used to compress pre-trained model B, and vice versa. (2) The generated distillation data is inaccurate. Generating distillation data from the pre-trained model is equivalent to solving the problem of the inverse function. However, the neural network is highly irreversible and highly non-convex, resulting in inaccurate distillation data generated, which is quite different from the original training pictures.
本公开实施例提供一种数据蒸馏的方法,应用于电子设备。所述电子设备包括但不限于手机、笔记本电脑、平板电脑和掌上上网设备、多媒体设备、流媒体设备、移动互联网设备、可穿戴设备或其他类型的设备。该方法所实现的功能可以通过电子设备中的处理器调用程序代码来实现,当然程序代码可以保存在计算机存储介质中,可见,该电子设备至少包括处理器和存储介质。处理器可以用于进行生成蒸馏数据过程的处理,存储器可以用于存储生成蒸馏数据过程中需要的中间数据以及产生的目标蒸馏数据。An embodiment of the present disclosure provides a method for data distillation, which is applied to an electronic device. The electronic devices include, but are not limited to, mobile phones, laptop computers, tablet computers and PDAs, multimedia devices, streaming media devices, mobile Internet devices, wearable devices or other types of devices. The functions realized by the method can be realized by calling the program codes by the processor in the electronic device, and of course the program codes can be stored in the computer storage medium. It can be seen that the electronic device at least includes a processor and a storage medium. The processor can be used to process the process of generating distillation data, and the memory can be used to store intermediate data required in the process of generating distillation data and generated target distillation data.
图1为本公开实施例提供的数据蒸馏的方法的流程示意图,如图1所示,所述方法至少包括以下步骤:Fig. 1 is a schematic flow chart of a data distillation method provided by an embodiment of the present disclosure. As shown in Fig. 1, the method at least includes the following steps:
步骤S110,确定至少一批待训练的第一蒸馏数据;Step S110, determining at least one batch of first distillation data to be trained;
这里,所述至少一批待训练的第一蒸馏数据指一个或多个批量(batch)的数据图像,批量大小是一个超参数,用于定义在每一次训练过程中预训练模型要处理的样本数。Here, the at least one batch of first distillation data to be trained refers to one or more batches of data images, and the batch size is a hyperparameter used to define the samples to be processed by the pre-training model during each training process number.
每一批所述第一蒸馏数据中存在至少一个包括两种数据标签信息的混合数据。例如混合数据为由两个图像进行混合得到的合成图像,则该合成图像携带了原始两个图像的所有标签信息。所述待训练的第一蒸馏数据可为矩阵形式,可以通过原始数据的分布信息对矩阵中的像素值进行随机初始化。矩阵的行和列的具体取值也可根据实际需要而定。通过这种方式,可快速方便地完成得到待训练的第一蒸馏数据,从而为后续处理提供良好基础。In each batch of the first distillation data, there is at least one mixed data including two kinds of data label information. For example, the mixed data is a composite image obtained by mixing two images, and the composite image carries all label information of the original two images. The first distillation data to be trained may be in the form of a matrix, and the pixel values in the matrix may be randomly initialized through the distribution information of the original data. The specific values of the rows and columns of the matrix can also be determined according to actual needs. In this way, the first distillation data to be trained can be obtained quickly and conveniently, thereby providing a good basis for subsequent processing.
步骤S120,确定至少两个预训练模型;Step S120, determining at least two pre-training models;
这里,所述预训练模型指的是模型在训练开始时可加载与之具有相同网络结构的模型参数,目的是特征迁移使模型效果更佳。Here, the pre-trained model means that the model can be loaded with model parameters having the same network structure as the model at the beginning of training, with the purpose of feature migration to make the model better.
在一些实施方式中,所述预训练模型包括卷积神经网络(Convolutional Neural Network)、递归神经网络(Recursive Neural Network,RNN)、循环神经网络(Recurrent Neural Network,RNN)、 深度神经网络(Deep Neural Network,DNN)中的一个或者多个,本公开实施例对此不作限定。In some embodiments, the pre-training model includes a convolutional neural network (Convolutional Neural Network), a recursive neural network (Recursive Neural Network, RNN), a recurrent neural network (Recurrent Neural Network, RNN), a deep neural network (Deep Neural Network) Network, DNN), which is not limited in this embodiment of the present disclosure.
每一所述预训练模型中存储所述原始数据的第一统计信息,例如均值、标准值和方差等统计值。其中,均值为正态分布曲线的期望值,该值决定了原始数据分布的位置;方差为标准差的平方,标准差决定了原始数据分布的幅度。Each of the pre-trained models stores first statistical information of the original data, such as statistical values such as mean, standard value, and variance. Among them, the mean is the expected value of the normal distribution curve, which determines the position of the original data distribution; the variance is the square of the standard deviation, and the standard deviation determines the magnitude of the original data distribution.
需要说明的是,在每个神经网络中,训练数据的分布特征在其唯一的特征空间中进行编码和处理。因此,在利用原始数据训练预训练模型的过程中,在预训练模型的每一组成块的批归一化层中存储了原始数据的第一统计信息。It should be noted that in each neural network, the distribution features of the training data are encoded and processed in its unique feature space. Therefore, in the process of using the original data to train the pre-training model, the first statistical information of the original data is stored in the batch normalization layer of each component block of the pre-training model.
步骤S130,基于每一所述预训练模型中的第一统计信息,确定每一批所述第一蒸馏数据在相应预训练模型中的批归一化统计损失;Step S130, based on the first statistical information in each pre-training model, determine the batch normalized statistical loss of each batch of the first distillation data in the corresponding pre-training model;
这里,首先通过匹配每一所述预训练模型中的第一统计信息和第一蒸馏数据在每一预训练模型中的第二统计信息,确定相应预训练模型中第一蒸馏数据和原始数据之间的统计损失,进而可以进一步匹配第一蒸馏数据和多个预训练模型的统计特征,将各个预训练模型的通用特征空间结合起来,从而使得训练得到的目标蒸馏数据相比于单模型蒸馏的数据更加通用。Here, firstly, by matching the first statistical information in each pre-training model with the second statistical information of the first distillation data in each pre-training model, the difference between the first distillation data and the original data in the corresponding pre-training model is determined. The statistical loss between the first distillation data and the statistical features of multiple pre-training models can be further matched, and the general feature space of each pre-training model can be combined, so that the target distillation data obtained by training is compared with the single-model distillation. Data is more generic.
需要说明的是,预训练模型中批归一化层的计算的均值与方差都是基于当前批包括的所有第一蒸馏数据。It should be noted that the mean and variance calculated by the batch normalization layer in the pre-training model are based on all the first distillation data included in the current batch.
步骤S140,基于每一批所述第一蒸馏数据中每一数据的初始化标签,确定每一批所述第一蒸馏数据在每一所述预训练模型中的目标交叉熵损失;Step S140, based on the initialization label of each data in each batch of the first distillation data, determine the target cross-entropy loss of each batch of the first distillation data in each of the pre-training models;
这里,所述初始化标签是在利用原始数据的高斯分布初始化第一蒸馏数据时,随机初始化的,例如为一批第一蒸馏数据中每一数据设定1至1000中任一数值,作为该数据的初始化标签。其中,不同数据的初始化标签可以相同,也可以不相同。Here, the initialization tag is randomly initialized when the first distillation data is initialized using the Gaussian distribution of the original data, for example, any value from 1 to 1000 is set for each data in a batch of first distillation data as the data initialization label for . Wherein, the initialization tags of different data may be the same or different.
通过匹配每一数据的初始化标签与经过预训练模型输出的相应数据的预测标签,并将一批第一蒸馏数据中所有数据的匹配结果进行线性求和。可以确定一批所述第一蒸馏数据在每一所述预训练模型中的目标交叉熵损失。By matching the initialization label of each data with the predicted label of the corresponding data output by the pre-trained model, and linearly summing the matching results of all data in a batch of first distilled data. A target cross-entropy loss in each of the pre-trained models may be determined for a batch of the first distilled data.
步骤S150,基于每一批所述第一蒸馏数据在每一所述预训练模型中的批归一化统计损失和所述目标交叉熵损失,对每一批所述第一蒸馏数据进行反向传播训练,得到目标蒸馏数据。Step S150, based on the batch normalized statistical loss and the target cross-entropy loss of each batch of the first distillation data in each of the pre-training models, reverse each batch of the first distillation data Propagate training to obtain target distillation data.
这里,所述目标蒸馏数据为与原始训练图片非常接近的蒸馏图片。通常目标蒸馏数据上有一定的信息,编码了每个类别的判别特征。Here, the target distillation data is a distillation picture that is very close to the original training picture. Usually there is a certain amount of information on the target distillation data, which encodes the discriminative features of each category.
对每一批所述第一蒸馏数据在每一所述预训练模型中的批归一化统计损失和所述目标交叉熵损失进行线性处理,例如先确定每一预训练模型的累计损失再对所有预训练模型进行求和再求平均,可以确定第一蒸馏数据经过所述至少两个预训练模型的目标损失。Perform linear processing on the batch normalized statistical loss and the target cross-entropy loss of each batch of the first distillation data in each of the pre-training models, for example, first determine the cumulative loss of each pre-training model and then All the pre-trained models are summed and then averaged to determine the target loss of the first distillation data passing through the at least two pre-trained models.
在反向传播过程中计算目标损失针对第一蒸馏数据的梯度,进行梯度下降更新如随机梯度下降方法(Stochastic Gradient Descent,SGD),逐步优化第一蒸馏数据,经过两万次左右的迭代训练,最终得到目标蒸馏数据。In the backpropagation process, the target loss is calculated for the gradient of the first distillation data, and the gradient descent update is performed such as the stochastic gradient descent method (Stochastic Gradient Descent, SGD), and the first distillation data is gradually optimized. After about 20,000 iterations of training, Finally, the target distillation data is obtained.
在本公开实施例中,首先,确定至少一批待训练的第一蒸馏数据;然后确定至少两个预训练模型;再基于每一所述预训练模型中的第一统计信息,确定每一批所述第一蒸馏数据在相应预训练模型中的批归一化统计损失;基于每一批所述第一蒸馏数据中每一数据的初始化标签,确定每一批所述第一蒸馏数据在每一所述预训练模型中的目标交叉熵损失;最后基于每一批所述第一蒸馏数据在每一所述预训练模型中的批归一化统计损失和所述目标交叉熵损失,对每一批所述第一蒸馏数据进行反向传播训练,得到目标蒸馏数据;如此,通过数据混合生成鲁棒的视觉信息,同时利用多个预训练模型中存储的原始数据的特征,使得训练出来的目标蒸馏数据能够匹配出多个模型的特征分布,从而目标蒸馏数据更通用,效果也更好。In the embodiment of the present disclosure, firstly, at least one batch of first distillation data to be trained is determined; then at least two pre-training models are determined; and each batch is determined based on the first statistical information in each of the pre-training models The batch normalized statistical loss of the first distillation data in the corresponding pre-training model; based on the initialization label of each data in each batch of the first distillation data, determine each batch of the first distillation data in each A target cross-entropy loss in the pre-training model; finally, based on the batch normalized statistical loss and the target cross-entropy loss of each batch of the first distillation data in each of the pre-training models, for each A batch of the first distillation data is subjected to backpropagation training to obtain the target distillation data; in this way, robust visual information is generated through data mixing, and at the same time, the features of the original data stored in multiple pre-training models are used to make the trained The target distillation data can match the feature distribution of multiple models, so the target distillation data is more versatile and the effect is better.
图2为本公开实施例提供的数据蒸馏的方法的流程示意图,如图2所示,所述方法至少包括以下步骤:Fig. 2 is a schematic flow chart of a data distillation method provided by an embodiment of the present disclosure. As shown in Fig. 2, the method at least includes the following steps:
步骤S210,确定至少一批待训练的第一蒸馏数据;Step S210, determining at least one batch of first distillation data to be trained;
这里,每一批所述第一蒸馏数据中存在至少一个包括两种数据标签信息的混合数据。Here, there is at least one mixed data including two kinds of data label information in each batch of the first distillation data.
步骤S220,从预训练模型库中随机选择至少两个不同类型的预训练模型;Step S220, randomly selecting at least two different types of pre-training models from the pre-training model library;
这里,每一所述预训练模型中存储所述原始数据的第一统计信息。不同类型的预训练模型的网络结构不同。Here, the first statistical information of the original data is stored in each pre-training model. Different types of pre-trained models have different network structures.
值得注意的是,本公开实施例中的特征混合方法需要多个模型的特征信息,但是模型过多会影响优化的速度。因此对与每一批待优化的第一蒸馏数据,只采样一小部分的模型,用于特征混合。实验证明三个模型能带来最好的速度和效果平衡。It is worth noting that the feature mixing method in the embodiment of the present disclosure requires feature information of multiple models, but too many models will affect the speed of optimization. Therefore, for each batch of first distillation data to be optimized, only a small part of the model is sampled for feature mixing. Experiments prove that the three models can bring the best speed and effect balance.
这样,对每一批待训练的第一蒸馏数据,从预训练库中随机采样至少两个不同类型的预训练模型用于特征混合,使得蒸馏出来的数据匹配出任意预训练模型的特征分布,从而能够获得更好的训练速度和效果平衡。In this way, for each batch of first distillation data to be trained, at least two different types of pre-training models are randomly sampled from the pre-training library for feature mixing, so that the distilled data matches the feature distribution of any pre-training model, In this way, a better balance between training speed and effect can be obtained.
步骤S230,确定每一批所述第一蒸馏数据在每一所述预训练模型中的第二统计信息;Step S230, determining the second statistical information of each batch of the first distillation data in each of the pre-training models;
这里,由于在每个神经网络中,训练数据的分布特征在其唯一的特征空间即批归一化(Batch Normalization,BN)层中进行编码和处理。因此,第一蒸馏数据可以合理地模拟预训练网络中原始图像的特征分布。在每一次对第一蒸馏数据进行蒸馏训练后,从每一预训练模型的批归一化层中提取激活平均值和方差作为第一蒸馏数据在相应预训练模型中的第二统计信息。Here, because in each neural network, the distribution features of the training data are encoded and processed in its unique feature space, namely the Batch Normalization (BN) layer. Therefore, the first distilled data can reasonably simulate the feature distribution of the original image in the pre-trained network. After each distillation training on the first distillation data, the activation mean and variance are extracted from the batch normalization layer of each pre-training model as the second statistical information of the first distillation data in the corresponding pre-training model.
步骤S240,针对每一所述预训练模型,确定所述第一统计信息与所述第二统计信息之间的批归一化统计损失;Step S240, determining a batch normalized statistical loss between the first statistical information and the second statistical information for each of the pre-trained models;
这里,通过匹配第一蒸馏数据在每一预训练模型中的第二统计信息与相应预训练模型中存储的第二统计信息,确定针对每一所述预训练模型的批归一化统计损失。Here, the batch normalized statistical loss for each pre-training model is determined by matching the second statistical information in each pre-training model of the first distillation data with the second statistical information stored in the corresponding pre-training model.
在一些实施方式中,所述第一统计信息包括均值和方差,每一所述预训练模型包括至少两个组成块,每一所述组成块包括卷积层和批归一化层,可以通过以下步骤确定每一所述预训练模型中的第一统计信息:首先,从每一所述组成块的批归一化层中提取所述原始数据的均值和方差;所述均值和方差是通过所述卷积层提取的所述原始数据的特征进行统计得到的;然后,基于每一所述预训练模型中所述至少两个组成块的均值和方差,确定每一所述预训练模型中的第一统计信息。这样,通过提取预训练模型中每一组成块中批归一化层存储的均值和方差,确定对应预训练模型的第一统计信息,确保训练得到的目标蒸馏数据可以合理地模拟预训练网络中原始数据的激活分布。In some embodiments, the first statistical information includes mean and variance, each of the pre-training models includes at least two building blocks, and each of the building blocks includes a convolutional layer and a batch normalization layer, which can be obtained by The following steps determine the first statistical information in each of the pre-training models: first, extract the mean and variance of the raw data from the batch normalization layer of each of the constituent blocks; the mean and variance are obtained by The features of the original data extracted by the convolutional layer are statistically obtained; then, based on the mean and variance of the at least two constituent blocks in each of the pre-training models, determine the The first statistic for . In this way, by extracting the mean and variance stored in the batch normalization layer in each component block of the pre-training model, the first statistical information of the corresponding pre-training model is determined to ensure that the target distillation data obtained by training can reasonably simulate the pre-training network. Activation distribution on raw data.
示例地,可以通过以下公式(1)确定针对每一所述预训练模型的批归一化统计损失L BNExemplarily, the batch normalization statistical loss L BN for each of the pre-trained models can be determined by the following formula (1):
Figure PCTCN2022071121-appb-000001
Figure PCTCN2022071121-appb-000001
式中μ i(X),
Figure PCTCN2022071121-appb-000002
分别为第i组成块中第一蒸馏数据X的平均值和方差,
Figure PCTCN2022071121-appb-000003
分别为相应预训练模型的BN层中存储的运行平均值和方差,其中表达式||.||2就是将表达式中两向量作差之后求各分量的平方和的开根号。
where μ i (X),
Figure PCTCN2022071121-appb-000002
are the mean and variance of the first distillation data X in the i-th component block, respectively,
Figure PCTCN2022071121-appb-000003
are the running average and variance stored in the BN layer of the corresponding pre-trained model, respectively, where the expression ||.||2 is the square root of the sum of the squares of each component after the difference between the two vectors in the expression.
步骤S250,基于每一批所述第一蒸馏数据中每一数据的初始化标签,确定每一批所述第一蒸馏数据在每一所述预训练模型中的目标交叉熵损失;Step S250, based on the initialization label of each data in each batch of the first distillation data, determine the target cross-entropy loss of each batch of the first distillation data in each of the pre-training models;
这里,所述目标交叉熵损失为预训练模型的最后一层输出的总损失。Here, the target cross-entropy loss is the total loss output by the last layer of the pre-trained model.
可以通过以下公式(2),确定一批所述第一蒸馏数据在每一所述预训练模型中的目标交叉熵损失L CEThe target cross-entropy loss L CE of a batch of the first distillation data in each of the pre-training models can be determined by the following formula (2):
Figure PCTCN2022071121-appb-000004
Figure PCTCN2022071121-appb-000004
其中,m表示一批第一蒸馏数据的数目即批量大小,y i为每一数据的初始化标签,
Figure PCTCN2022071121-appb-000005
为预训练 模型输出的预测标签。
Among them, m represents the number of a batch of first distillation data, that is, the batch size, and y i is the initialization label of each data,
Figure PCTCN2022071121-appb-000005
Predicted label output for the pretrained model.
值得注意的是,对于一批第一蒸馏数据中的混合数据即由两个图像进行混合得到的合成图像,该合成图像的标签为基于原始两个图像的标签确定的混合标签。It is worth noting that, for the mixed data in a batch of first distillation data, that is, a synthesized image obtained by mixing two images, the label of the synthesized image is a mixed label determined based on the labels of the original two images.
步骤S260,基于每一所述预训练模型的批归一化统计损失和所述目标交叉熵损失,确定相应预训练模型对应的第一损失;Step S260, based on the batch normalized statistical loss of each pre-training model and the target cross-entropy loss, determine the first loss corresponding to the corresponding pre-training model;
这里,针对每一预训练模型,对计算的批归一化统计损失和目标交叉熵损失进行线性求和,得到相应预训练模型对应的第一损失。Here, for each pre-training model, the calculated batch normalization statistical loss and the target cross-entropy loss are linearly summed to obtain the first loss corresponding to the corresponding pre-training model.
步骤S270,对各个所述预训练模型对应的所述第一损失求均值,得到每一批所述第一蒸馏数据经过所述至少两个预训练模型的目标损失;Step S270, calculating the average value of the first loss corresponding to each of the pre-training models to obtain the target loss of each batch of the first distillation data passing through the at least two pre-training models;
这里,先将所有预训练模型对应的第一损失进行求和再求平均后最小化,得到每一批所述第一蒸馏数据经过所述至少两个预训练模型的目标损失。Here, the first losses corresponding to all the pre-training models are first summed, averaged, and then minimized to obtain the target loss of each batch of the first distillation data passing through the at least two pre-trained models.
示例地,考虑一个包括至少两个预训练模型的模型簇M={A 1,A 2,...,A m},可以通过以下公式(3)上述目标损失,旨在同时结合特征混合和数据混合优化每一批第一蒸馏数据: As an example, considering a model cluster M={A 1 ,A 2 ,...,A m } including at least two pre-trained models, the above target loss can be calculated by the following formula (3), aiming to combine feature mixture and Data blending optimizes each batch of first distillation data:
Figure PCTCN2022071121-appb-000006
Figure PCTCN2022071121-appb-000006
其中,
Figure PCTCN2022071121-appb-000007
为每一批第一蒸馏数据
Figure PCTCN2022071121-appb-000008
经过预训练模型A i的批归一化统计损失,
Figure PCTCN2022071121-appb-000009
每一批第一蒸馏数据
Figure PCTCN2022071121-appb-000010
经过预训练模型A i的交叉熵损失。λ 1、λ 2为系数,一般根据实际情况设定;m为预训练模型的数量。
in,
Figure PCTCN2022071121-appb-000007
For each batch of first distillation data
Figure PCTCN2022071121-appb-000008
The batch normalized statistical loss of the pre-trained model A i ,
Figure PCTCN2022071121-appb-000009
First distillation data for each batch
Figure PCTCN2022071121-appb-000010
The cross-entropy loss of the pre-trained model A i . λ 1 and λ 2 are coefficients, which are generally set according to the actual situation; m is the number of pre-trained models.
在一些实施方式中,还可以对第一蒸馏数据施加先验损失,以确保图像大体平滑,其中先验损失定义为第一蒸馏数据与其高斯模糊版本之间的均方误差。因此,结合批归一化统计损失、目标交叉熵损失和先验损失这三种一起确定最终最小化目标作为目标损失。In some embodiments, a prior loss may also be applied to the first distilled data to ensure that the image is generally smooth, wherein the prior loss is defined as the mean square error between the first distilled data and its Gaussian blurred version. Therefore, the final minimization objective is determined as the objective loss by combining batch normalized statistical loss, objective cross-entropy loss and prior loss.
这里,通过匹配第一蒸馏数据和多个预训练模型的特征之间的批归一化统计损失和目标交叉熵损失,以确定第一蒸馏数据经过所述至少两个预训练模型的目标损失。这样,通过确定目标损失将各个预训练模型的通用特征空间结合起来,从而使得训练得到的目标蒸馏数据相比于单模型蒸馏的数据更加通用。Here, the target loss of the first distillation data passing through the at least two pre-training models is determined by matching the batch normalized statistical loss and the target cross-entropy loss between the features of the first distillation data and the plurality of pre-training models. In this way, the general feature space of each pre-trained model is combined by determining the target loss, so that the target distillation data obtained by training is more general than the single-model distillation data.
步骤S280,基于所述目标损失,对每一批所述第一蒸馏数据进行反向传播训练,得到所述目标蒸馏数据。Step S280, based on the target loss, perform backpropagation training on each batch of the first distillation data to obtain the target distillation data.
这里,对每一批第一蒸馏数据的优化会使用随机梯度下降(Stochastic Gradient Descent,SGD)方法,每一批数据从初始化到训练结束会经历两万次迭代。随机梯度下降是一种简单但非常有效的方法,多用于支持向量机、逻辑回归等凸损失函数下的线性分类器的学习。每次只随机取一个维度中的一条数据求梯度,并将结果作为这个维度梯度下降的步长。Here, the optimization of each batch of first distillation data will use the Stochastic Gradient Descent (SGD) method, and each batch of data will go through 20,000 iterations from initialization to the end of training. Stochastic gradient descent is a simple but very effective method, which is mostly used for learning linear classifiers under convex loss functions such as support vector machines and logistic regression. Each time, only a piece of data in one dimension is randomly selected to find the gradient, and the result is used as the step size of the gradient descent of this dimension.
在一些实施方式中,步骤S280可以通过以下过程实现:每一次迭代训练中,确定所述目标损失针对所述第一蒸馏数据的梯度;基于所述梯度,对所述第一蒸馏数据中每一数据进行更新;在更新后的第一蒸馏数据在每一所述预训练模型中的批归一化统计损失和所述目标交叉熵损失收敛到稳定值时,得到所述目标蒸馏数据。这样,在反向传播过程中通过随机梯度下降方法优化第一蒸馏数据,加快每一批第一蒸馏数据的迭代训练过程。In some implementations, step S280 can be implemented through the following process: in each iterative training, determine the gradient of the target loss for the first distillation data; based on the gradient, for each of the first distillation data The data is updated; when the batch normalization statistical loss and the target cross-entropy loss of the updated first distillation data in each of the pre-training models converge to a stable value, the target distillation data is obtained. In this way, the stochastic gradient descent method is used to optimize the first distillation data during the backpropagation process, and the iterative training process of each batch of first distillation data is accelerated.
在本公开实施例中,对每一批待训练的第一蒸馏数据,从预训练库中随机采样至少两个不同类型的预训练模型,同时匹配第一蒸馏数据和多个预训练模型的特征。这样将各个预训练模型的通用特征空间结合起来,使得蒸馏出来的数据匹配出任意预训练模型的特征分布,从而能够获得更好的训练速度和效果平衡。In the embodiment of the present disclosure, for each batch of first distillation data to be trained, at least two different types of pre-training models are randomly sampled from the pre-training library, and the characteristics of the first distillation data and multiple pre-training models are matched at the same time . In this way, the general feature space of each pre-training model is combined, so that the distilled data can match the feature distribution of any pre-training model, so that a better balance between training speed and effect can be obtained.
图3为本公开实施例提供的数据蒸馏的方法的流程示意图,如图3所示,所述方法至少包括以下 步骤:Fig. 3 is a schematic flow chart of a method for data distillation provided by an embodiment of the present disclosure. As shown in Fig. 3, the method at least includes the following steps:
步骤S310,基于原始数据的分布信息,确定至少一批第二蒸馏数据;Step S310, based on the distribution information of the original data, determine at least one batch of second distillation data;
这里,所述原始数据可以为以下至少一种通用数据集中的图像:ImageNet数据集、ResNet50数据集、MobileNet V2数据集等。Here, the original data can be images in at least one of the following general data sets: ImageNet data set, ResNet50 data set, MobileNet V2 data set, etc.
所述原始数据的分布信息可以为正态分布(Normal Distribution),又名高斯分布(Gaussian Distribution)。分布曲线呈钟型,即两头低中间高,且左右对称。若随机变量X服从一个数学期望为μ、方差为σ^2的正态分布,记为N(μ,σ^2)。其概率密度函数为正态分布的期望值μ决定了其位置,其标准差σ决定了分布的幅度。当μ=0,σ=1时的正态分布是标准正态分布。The distribution information of the raw data may be a normal distribution (Normal Distribution), also known as a Gaussian distribution (Gaussian Distribution). The distribution curve is bell-shaped, that is, the two ends are low and the middle is high, and the left and right sides are symmetrical. If the random variable X obeys a normal distribution with a mathematical expectation of μ and a variance of σ^2, it is denoted as N(μ, σ^2). Its probability density function is the expected value μ of the normal distribution determines its position, and its standard deviation σ determines the magnitude of the distribution. The normal distribution when μ=0, σ=1 is the standard normal distribution.
这里,所述分布信息为高斯分布信息,包括原始数据的均值、方差等。对于可见光(Red-Green-Blue,RGB)图像,三个颜色通道的均值和标准差均不一样。如ImageNet数据集中每一图像的均值为[0.485,0.456,0.406],方差为[0.229,0.224,0.225]。Here, the distribution information is Gaussian distribution information, including the mean value, variance, etc. of the original data. For visible light (Red-Green-Blue, RGB) images, the mean and standard deviation of the three color channels are different. For example, the mean value of each image in the ImageNet dataset is [0.485, 0.456, 0.406], and the variance is [0.229, 0.224, 0.225].
在一些实施方式中,上述步骤S210可以通过以下过程实现:获取所述原始数据的高斯分布信息;基于从所述高斯分布信息中随机采样的数据,对至少N个初始像素值矩阵进行初始化,得到每一批第二蒸馏数据;N为大于等于2的整数。这样,从原始数据的高斯分布信息中随机采样数据,并初始化得到可以第二蒸馏数据,弥补不能直接获取原始数据的不足。In some implementations, the above step S210 can be implemented through the following process: obtaining the Gaussian distribution information of the original data; based on the randomly sampled data from the Gaussian distribution information, initializing at least N initial pixel value matrices to obtain Each batch of second distillation data; N is an integer greater than or equal to 2. In this way, the data is randomly sampled from the Gaussian distribution information of the original data, and initialized to obtain the second distillation data, which makes up for the inability to directly obtain the original data.
步骤S320,确定至少两个预训练模型;Step S320, determine at least two pre-training models;
这里,每一所述预训练模型中存储所述原始数据的第一统计信息。Here, the first statistical information of the original data is stored in each pre-training model.
步骤S330,在每一次迭代训练中,对每一批第二蒸馏数据中每两个图像数据进行混合,得到一批第一蒸馏数据;Step S330, in each iterative training, mix every two image data in each batch of second distillation data to obtain a batch of first distillation data;
这里,在每一迭代训练中,随机选取任意两个图像数据进行数据混合,使得混合后的图像包含两个图像数据的信息,从而生成的目标蒸馏数据更加鲁棒并且拥有更多视觉信息,能够适应不同尺度和空间方位。因此生成的目标蒸馏数据用于模型压缩更加有效。Here, in each iterative training, any two image data are randomly selected for data mixing, so that the mixed image contains the information of the two image data, so that the generated target distillation data is more robust and has more visual information, which can Adapt to different scales and spatial orientations. Therefore, the generated target distillation data is more effective for model compression.
值得注意的是,上述步骤S320和步骤S330的执行顺序不限定,可以先执行步骤S330,也可以先执行步骤S320,也可以同时执行。It should be noted that, the execution order of the above step S320 and step S330 is not limited, step S330 may be executed first, step S320 may be executed first, or may be executed at the same time.
步骤S340,基于每一所述预训练模型中的第一统计信息,确定每一批所述第一蒸馏数据在相应预训练模型中的批归一化统计损失;Step S340, based on the first statistical information in each pre-training model, determine the batch normalized statistical loss of each batch of the first distillation data in the corresponding pre-training model;
步骤S350,基于每一批所述第一蒸馏数据中每一数据的初始化标签,确定每一批所述第一蒸馏数据在每一所述预训练模型中的目标交叉熵损失;Step S350, based on the initialization label of each data in each batch of the first distillation data, determine the target cross-entropy loss of each batch of the first distillation data in each of the pre-training models;
这里,所述每一批第一蒸馏数据中包括由第一图像和第二图像混合得到的合成图像。也就是说,从一批第二蒸馏数据中随机选择一对第一图像和第二图像进行混合得到合成图像,其中第一图像和第二图像分别有各自对应的初始化标签。Here, each batch of first distillation data includes a composite image obtained by mixing the first image and the second image. That is to say, a pair of the first image and the second image is randomly selected from a batch of second distillation data and mixed to obtain a composite image, wherein the first image and the second image respectively have corresponding initialization labels.
可以通过以下步骤确定每一批所述第一蒸馏数据在每一所述预训练模型中的目标交叉熵损失:The target cross-entropy loss of each batch of the first distillation data in each of the pre-training models can be determined by the following steps:
步骤S3501,基于所述第一图像的初始化标签和所述第二图像的初始化标签,确定所述合成图像的混合交叉熵损失;Step S3501, based on the initialization label of the first image and the initialization label of the second image, determine the hybrid cross-entropy loss of the composite image;
这里,通过第一图像和第二图像各自的初始化标签确定合成图像的混合交叉熵损失,可以确保原始第一图像和第二图像的信息能够被预训练模型识别并正确优化。Here, the hybrid cross-entropy loss of the synthetic image is determined by the respective initialization labels of the first image and the second image, which can ensure that the information of the original first image and the second image can be recognized and correctly optimized by the pre-trained model.
步骤S3502,基于所述每一批第一蒸馏数据中除所述第一图像和所述第二图像之外的其他图像的初始化标签,确定所述其他图像的累计交叉熵损失;Step S3502, based on the initialization labels of other images in each batch of first distillation data except the first image and the second image, determine the cumulative cross-entropy loss of the other images;
这里,计算累计交叉熵损失的过程可以参照公式(2),只是其中变量i的取值为除所述第一图像和所述第二图像之外的其他图像。Here, the process of calculating the cumulative cross-entropy loss may refer to formula (2), except that the value of the variable i is other images than the first image and the second image.
步骤S3502,基于所述混合交叉熵损失和所述累计交叉熵损失,确定每一批所述第一蒸馏数据经过每一所述预训练模型后的目标交叉熵损失。Step S3502, based on the mixed cross-entropy loss and the cumulative cross-entropy loss, determine the target cross-entropy loss after each batch of the first distillation data passes through each of the pre-trained models.
这里,对一批第一蒸馏数据中合成图像的混合交叉熵损失和其他图像的累计交叉熵损失求和后最小化,得到每一批所述第一蒸馏数据经过每一所述预训练模型后的目标交叉熵损失。Here, the mixed cross-entropy loss of the synthetic image in a batch of first distillation data and the cumulative cross-entropy loss of other images are summed and then minimized to obtain each batch of the first distillation data after passing through each of the pre-trained models The target cross-entropy loss.
在一些实施方式中,上述步骤S3501可以通过以下过程实现:基于所述第一图像的初始化标签, 确定每一批所述第一蒸馏数据通过每一所述预训练模型后的第一交叉熵损失;基于所述第二图像的初始化标签,确定每一批所述第一蒸馏数据通过每一所述预训练模型后的第二交叉熵损失;基于第一图像与所述第二图像之间的合成比例,对所述第一交叉熵损失和所述第二交叉损失进行线性求和,得到所述混合交叉熵损失。在实施中,可以通过以下公式(4)确定上述混合交叉熵损失
Figure PCTCN2022071121-appb-000011
In some implementations, the above step S3501 can be implemented through the following process: based on the initialization label of the first image, determine the first cross-entropy loss after each batch of the first distillation data passes through each of the pre-training models ; Based on the initialization label of the second image, determine the second cross-entropy loss after each batch of the first distillation data passes through each of the pre-training models; based on the difference between the first image and the second image Combining ratios, linearly summing the first cross-entropy loss and the second cross-entropy loss to obtain the mixed cross-entropy loss. In practice, the above mixed cross-entropy loss can be determined by the following formula (4):
Figure PCTCN2022071121-appb-000011
Figure PCTCN2022071121-appb-000012
Figure PCTCN2022071121-appb-000012
其中,
Figure PCTCN2022071121-appb-000013
为混合后的第一蒸馏数据,
Figure PCTCN2022071121-appb-000014
Figure PCTCN2022071121-appb-000015
经过某个预训练模型的混合交叉熵损失,Y 2和Y 1分别为混合前第一图像和第二图像的初始化标签,
Figure PCTCN2022071121-appb-000016
Figure PCTCN2022071121-appb-000017
分别为第一图像对应的第一交叉熵损失和第二图像对应的第二交叉熵损失,β为混合区域的面积与第二图像的面积之间的比例系数。
in,
Figure PCTCN2022071121-appb-000013
is the first distillation data after mixing,
Figure PCTCN2022071121-appb-000014
for
Figure PCTCN2022071121-appb-000015
After the mixed cross-entropy loss of a pre-trained model, Y 2 and Y 1 are the initialization labels of the first image and the second image before mixing, respectively,
Figure PCTCN2022071121-appb-000016
and
Figure PCTCN2022071121-appb-000017
are the first cross-entropy loss corresponding to the first image and the second cross-entropy loss corresponding to the second image, respectively, and β is the proportional coefficient between the area of the mixed region and the area of the second image.
这样,通过使用混合数据前各图像的初始化标签确定混合交叉熵损失,进而结合其他图像的累加交叉熵损失确定目标交叉熵损失,有助于在训练时生成仍然具有辨别力的鲁棒特性,同时能够防止不正确的反演解,合成具有精确标签信息的目标蒸馏数据。In this way, by using the initialization labels of each image before mixing data to determine the mixed cross-entropy loss, and then combined with the accumulated cross-entropy loss of other images to determine the target cross-entropy loss, it helps to generate robust features that are still discriminative during training, while It can prevent incorrect inversion solutions and synthesize target distillation data with accurate label information.
步骤S360,基于每一所述预训练模型的批归一化统计损失和所述目标交叉熵损失,对每一批所述第一蒸馏数据进行反向传播训练,得到目标蒸馏数据。Step S360, based on the batch normalized statistical loss and the target cross-entropy loss of each pre-trained model, perform backpropagation training on each batch of the first distillation data to obtain target distillation data.
这里,通过结合目标交叉熵损失和批归一化统计损失得到一批第一蒸馏数据针对每一预训练模型的第一损失,进而对每一所述预训练模型对应的第一损失进行线性整合,可以平均每个预训练模型产生的特征偏差,从而使得最终得到的目标蒸馏数据更接近真实数据且更通用。Here, the first loss of a batch of first distillation data for each pre-training model is obtained by combining the target cross-entropy loss and batch normalization statistical loss, and then the first loss corresponding to each pre-training model is linearly integrated , can average the feature deviation produced by each pre-training model, so that the final target distillation data is closer to the real data and more general.
这里,由于利用多个预训练模型的特征,使得生成的目标蒸馏数据可以匹配多个模型的特征分布,同时通过数据混合使得目标蒸馏数据更加鲁棒且拥有更多视觉信息。在得到目标蒸馏数据之后,可以直接用于对预训练模型或其他未识别的模型进行压缩。Here, due to the use of the features of multiple pre-trained models, the generated target distillation data can match the feature distribution of multiple models, and at the same time, the target distillation data is more robust and has more visual information through data mixing. After the target distillation data is obtained, it can be directly used to compress the pre-trained model or other unidentified models.
在一些实施方式中,在至少一批所述目标蒸馏数据的数量达到特定阈值的情况下,对每一所述预训练模型进行模型压缩。这样,由于生成的目标蒸馏数据拥有更多视觉信息且通用,可以直接用于模型压缩过程,简化压缩流程。In some implementations, when the quantity of at least one batch of the target distillation data reaches a specific threshold, model compression is performed on each of the pre-trained models. In this way, since the generated target distillation data has more visual information and is universal, it can be directly used in the model compression process, simplifying the compression process.
在一些实施方式中,在一个新模型压缩需要提出的情况下,可以直接利用已生成的目标蒸馏数据进行压缩,无需再进行蒸馏数据的生成过程。本公开实施例生成的目标蒸馏数据足够准确,只需要生成一次目标蒸馏数据集(包括足够量的目标蒸馏数据)就可以适用所有预训练模型以及泛化到未识别的模型中。In some implementations, when a new model compression needs to be proposed, the generated target distillation data can be directly used for compression, without further performing the distillation data generation process. The target distillation data generated by the embodiment of the present disclosure is accurate enough, and it only needs to generate a target distillation data set (including a sufficient amount of target distillation data) once to apply to all pre-trained models and generalize to unidentified models.
在一些实施方式中,所述模型压缩操作可以包括网络量化(Network Quantization)、网络剪枝(Network Pruning)、知识蒸馏(Knowledge Distillation)等。剪枝技术过程可以基于目标蒸馏数据集完成剪枝决策和剪枝后重建;模型量化过程可以基于目标蒸馏数据集来完成训练中量化过程或者基于目标蒸馏数据集实现后训练量化的校准过程;知识蒸馏过程则可以将目标蒸馏数据集分别送入教师网络和学生网络完成知识迁移的过程。In some implementations, the model compression operation may include network quantization (Network Quantization), network pruning (Network Pruning), knowledge distillation (Knowledge Distillation) and the like. The pruning process can complete the pruning decision and post-pruning reconstruction based on the target distillation data set; the model quantization process can complete the training quantization process based on the target distillation data set or the calibration process of post-training quantization based on the target distillation data set; knowledge The distillation process can send the target distillation data set to the teacher network and the student network to complete the process of knowledge transfer.
在本公开实施例中,通过在每一次迭代训练中对一批第一蒸馏数据进行随机数据混合,可以缩小反演解空间,同时使得训练出来的目标蒸馏数据的视觉信息更加鲁棒,能够适应不同尺度和空间方位。同时结合批归一化统计损失和目标交叉熵损失确定一批第一蒸馏数据通过多个预训练模型的目标损失。从而使得最终得到的目标蒸馏数据更接近真实数据。In the embodiment of the present disclosure, by randomly mixing a batch of first distillation data in each iterative training, the inversion solution space can be reduced, and at the same time, the visual information of the trained target distillation data is more robust and adaptable to Different scales and spatial orientations. Simultaneously combine the batch normalized statistical loss and the target cross-entropy loss to determine the target loss of a batch of first-distilled data through multiple pre-trained models. So that the final target distillation data is closer to the real data.
基于图3,图4A为本公开实施例提供的数据蒸馏的方法的流程示意图,如图4A所示,上述步骤S330“在每一次迭代训练中,对每一批第二蒸馏数据中每两个图像数据进行混合,得到一批第一蒸馏数据”可以通过以下步骤实现:Based on Fig. 3, Fig. 4A is a schematic flowchart of the data distillation method provided by the embodiment of the present disclosure. As shown in Fig. 4A, the above step S330 "in each iterative training, for every two batches of the second distillation data The image data are mixed to obtain a batch of first distillation data" which can be achieved by the following steps:
步骤S410,从每一批所述第二蒸馏数据中随机选取至少一对第一图像和第二图像;Step S410, randomly selecting at least one pair of the first image and the second image from each batch of the second distillation data;
这里,每一对第一图像和第二图像是任意的两个第二蒸馏数据。因为在每轮迭代过程中,第一图像和第二图像都是随机选取并混合的,因此经过训练得到的目标蒸馏数据的视觉信息会更鲁棒,能够适应不同尺度和空间方位。Here, each pair of the first image and the second image is any two second distillation data. Because the first image and the second image are randomly selected and mixed during each round of iteration, the visual information of the target distillation data obtained after training will be more robust and able to adapt to different scales and spatial orientations.
步骤S420,将每一所述第一图像的尺寸按照特定比例缩小;Step S420, reducing the size of each of the first images according to a specific ratio;
这里,所述特定比例为第二图像上随机覆盖区域的边界框与原始图像数据的尺寸大小之间的比值。Here, the specific ratio is a ratio between the bounding box of the random coverage area on the second image and the size of the original image data.
步骤S430,将缩小后的所述第一图像随机覆盖到对应的所述第二图像中,得到每一批所述第一蒸馏数据。In step S430, the reduced first image is randomly overlaid on the corresponding second image to obtain each batch of the first distillation data.
这里,通过将缩小后的第一图像覆盖的第二图像的任意区域,使得覆盖后的合成图像包括两个图像数据的信息。Here, by covering any area of the second image with the reduced first image, the overlaid composite image includes information of two image data.
如图4B所示,将第一图像41进行等比例缩小到固定尺寸,然后将缩小后的第一图像42随机覆盖到第二图像43上。此时覆盖后的合成图像44将包含第一图像41和第二图像43的所有信息。As shown in FIG. 4B , the first image 41 is proportionally reduced to a fixed size, and then the reduced first image 42 is randomly overlaid on the second image 43 . At this time, the overlaid composite image 44 will contain all the information of the first image 41 and the second image 43 .
在一些实施方式中,所述步骤S430可以通过以下过程实现:按照所述特定比例,在对应的所述第二图像中随机生成待覆盖的混合区域;基于所述混合区域的边界框,随机生成二进制掩码;通过所述二进制掩码对缩小后的所述第一图像的每一像素值和所述第二图像中对应的像素值进行叠加,得到每一批所述第一蒸馏数据。在实施中,通过以下公式(5)和(6)实现任意两个图像数据的混合:In some implementation manners, the step S430 can be implemented through the following process: according to the specific ratio, randomly generate a mixed area to be covered in the corresponding second image; based on the bounding box of the mixed area, randomly generate A binary mask: superimposing each pixel value of the reduced first image with a corresponding pixel value in the second image by using the binary mask to obtain each batch of the first distillation data. In implementation, the mixing of any two image data is realized by the following formulas (5) and (6):
Figure PCTCN2022071121-appb-000018
Figure PCTCN2022071121-appb-000018
其中,x l,x r,y d,y u依次是混合区域的边界框的左框、右框、上框、下框的边界,α ij为二进制掩码中的元素,如果α ij位于边界框中则取值为1,如果α ij位于边界框之外则取值为0。 Among them, x l , x r , y d , y u are the boundaries of the left box, right box, upper box, and lower box of the bounding box of the mixed area in turn, and α ij is the element in the binary mask. If α ij is located at the boundary The value is 1 if α ij is outside the bounding box, and 0 if α ij is outside the bounding box.
Figure PCTCN2022071121-appb-000019
Figure PCTCN2022071121-appb-000019
其中,X 2,X 1分别是待混合的第一图像和第二图像,
Figure PCTCN2022071121-appb-000020
为将缩小后的第一图像覆盖到第二图像上的合成图像,g(.)是一个线性插值函数,可以将第一图像X 2调整为与边界框相同的大小。
Wherein, X 2 , X 1 are the first image and the second image to be mixed respectively,
Figure PCTCN2022071121-appb-000020
To overlay the downscaled first image onto the composite image on the second image, g(.) is a linear interpolation function that resizes the first image X 2 to the same size as the bounding box.
这样,通过随机生成二进制掩码叠加缩小后的第一图像的像素值和第二图像的像素值,实现一批第二蒸馏数据中随机的两张图像的混合,使得混合数据即一批第一蒸馏数据经过预训练模型训练优化时仍然具有辨别力。In this way, by randomly generating a binary mask and superimposing the pixel values of the reduced first image and the pixel values of the second image, the random mixing of two images in a batch of second distillation data is realized, so that the mixed data is a batch of first Distilled data is still discriminative when optimized by pretrained model training.
在本公开实施例中,在数据混合中将第一图像进行等比例缩小并随机覆盖到第二图像中,此时混合后的合成图像将包含第一图像和第二图像的信息。在训练一批第一蒸馏数据的过程中,确保两个混合的图像信息能够被预训练模型识别并正确优化。In the embodiment of the present disclosure, the first image is scaled down and randomly overlaid into the second image during data mixing, and the mixed composite image will contain information of the first image and the second image. In the process of training a batch of first-distilled data, it is ensured that the two mixed image information can be recognized and optimized correctly by the pre-trained model.
下面结合一个具体实施例对上述数据蒸馏的方法进行说明,然而值得注意的是,该具体实施例仅是为了更好地说明本公开,并不构成对本公开的不当限定。The above data distillation method will be described below in conjunction with a specific embodiment, however, it should be noted that this specific embodiment is only for better illustrating the present disclosure, and does not constitute an improper limitation of the present disclosure.
图5A为本公开实施例提供的数据蒸馏的方法的逻辑流程图,如图5A所示,所述方法至少包括以下步骤:FIG. 5A is a logic flow diagram of a data distillation method provided by an embodiment of the present disclosure. As shown in FIG. 5A, the method includes at least the following steps:
步骤S510,用原始数据的高斯分布初始化一批蒸馏数据;Step S510, using the Gaussian distribution of the original data to initialize a batch of distillation data;
这里,蒸馏过程需要用先使用高斯分布初始化一批蒸馏数据(相当于第二蒸馏数据),高斯分布采用原始数据的统计值对一批蒸馏数据进行初始化。Here, the distillation process needs to first use a Gaussian distribution to initialize a batch of distillation data (equivalent to the second distillation data), and the Gaussian distribution uses the statistical value of the original data to initialize a batch of distillation data.
在实施中,首先获取原始训练数据的分布信息,然后利用高斯分布的统计值对一批蒸馏数据进行初始化。In the implementation, the distribution information of the original training data is obtained first, and then a batch of distilled data is initialized with the statistical value of the Gaussian distribution.
步骤S520,从预训练模型簇里面随机采样三个预训练模型;Step S520, randomly sampling three pre-training models from the pre-training model cluster;
步骤S530,使用数据混合和特征混合,通过三个预训练模型训练一批蒸馏数据;Step S530, using data mixing and feature mixing to train a batch of distillation data through three pre-trained models;
这里,本公开实施例采用随机梯度下降方法来训练混合后的一批蒸馏数据(相当于第一蒸馏数据)。Here, the embodiment of the present disclosure adopts a stochastic gradient descent method to train a batch of mixed distillation data (equivalent to the first distillation data).
步骤S540,重复上述训练过程,直到生成指定量的目标蒸馏数据。Step S540, repeating the above training process until a specified amount of target distillation data is generated.
这里,若完成所有目标蒸馏数据的生成则退出,反之则重复步骤S520至步骤S530。Here, if the generation of all target distillation data is completed, exit, otherwise, repeat steps S520 to S530.
由于生成的目标蒸馏数据足够准确且通用,本公开实施例生成的目标蒸馏数据甚至可以泛化到未识别过的模型中。因此,当一个新的模型压缩需求提出时,可以直接用已生成的目标蒸馏数据进行压缩。Since the generated target distillation data is sufficiently accurate and general, the target distillation data generated in the embodiments of the present disclosure can even be generalized to unidentified models. Therefore, when a new model compression requirement is proposed, the generated target distillation data can be directly used for compression.
图5B为本公开实施例提供的结合数据混合和特征混合的算法框架,如图5B所示,该框架中的实线箭头体现了前向训练流向,虚线箭头体现了后向反演过程的流向。训练流程中会有数据混合51和特征混合52两大流程。其中数据混合51用于生成鲁棒的视觉信息,特征混合52用于生成通用可泛化的图片。FIG. 5B is an algorithm framework combining data mixing and feature mixing provided by an embodiment of the present disclosure. As shown in FIG. 5B , the solid arrows in the framework reflect the forward training flow, and the dotted arrows reflect the backward inversion process flow. . There are two major processes of data mixing 51 and feature mixing 52 in the training process. Among them, the data mixture 51 is used to generate robust visual information, and the feature mixture 52 is used to generate generalizable images.
在前向训练过程中,首先进行数据混合51流程,从一批蒸馏数据中随机选取第一图像501和第二图像502进行混合,得到合成图像503。然后进行特征混合52流程,将包括合成图像503在内的一批蒸馏数据分别输入到预训练模型504、预训练模型505和预训练模型506(示例的三个模型,但不限定,可以为多个)中,计算经过三个预训练模型之后的目标交叉熵损失507和批归一化统计损失508。其中,每一预训练模型的每个组成块中均包括激活层61、批归一化层62和卷积层63,批归一化层62中统计了一批蒸馏数据的均值和方差(μ,σ2),可以将每一预训练模型的所有组成块中存储的均值和方差,提取出来确定上述批归一化统计损失508。In the forward training process, the data mixing 51 process is firstly performed, and the first image 501 and the second image 502 are randomly selected from a batch of distillation data for mixing to obtain a composite image 503 . Then carry out the process of feature mixing 52, and input a batch of distillation data including the composite image 503 into the pre-training model 504, the pre-training model 505 and the pre-training model 506 respectively (the three models of the example, but not limited, can be multiple ), calculate the target cross-entropy loss 507 and the batch normalization statistical loss 508 after three pre-trained models. Wherein, each component block of each pre-training model includes an activation layer 61, a batch normalization layer 62 and a convolutional layer 63, and the batch normalization layer 62 counts the mean and variance (μ , σ2), the mean and variance stored in all the blocks of each pre-training model can be extracted to determine the above-mentioned batch normalization statistical loss 508 .
在后向反演过程中,利用上一次迭代训练过程确定的目标交叉熵损失507和批归一化统计损失508进一步确定目标损失,以利用目标损失优化合成图像503。由于预训练模型已经捕获了蒸馏数据的图像类别信息,因此可以通过为合成图像503分配混合标签来反演知识。In the backward inversion process, the target loss is further determined using the target cross-entropy loss 507 and the batch normalization statistical loss 508 determined in the previous iterative training process, so as to optimize the composite image 503 using the target loss. Since the pre-trained model already captures the image category information of the distilled data, knowledge can be retrieved by assigning mixed labels to the synthesized images 503 .
如图5C所示,该前向训练流程包括以下步骤:As shown in Figure 5C, the forward training process includes the following steps:
步骤S5301,将一批蒸馏数据进行随机数据混合;Step S5301, performing random data mixing on a batch of distillation data;
在数据混合中,从初始化后的一批蒸馏数据中任取两个图像数据进行随机混合。此时合成图像将包含两个图像数据的信息。在训练混合后的一批蒸馏数据的过程中,确保两个图像数据的信息能够被模型识别并正确优化。In data mixing, two image data are randomly selected from the initialized batch of distilled data for random mixing. At this time, the composite image will contain the information of the two image data. In the process of training the mixed batch of distilled data, it is ensured that the information of the two image data can be recognized by the model and optimized correctly.
步骤S5302,将混合后的一批蒸馏数据分别输入三个预训练模型中,确定当前统计量与模型存储的原始统计量之间的差异;Step S5302, input the mixed batch of distillation data into the three pre-training models respectively, and determine the difference between the current statistic and the original statistic stored in the model;
这里,当前统计量为一批蒸馏数据的第二统计信息,原始统计量为模型中存储的原始数据的第一统计信息。Here, the current statistic is the second statistic of a batch of distillation data, and the original statistic is the first statistic of the original data stored in the model.
在特征混合中,计算一批蒸馏数据在预训练模型中的统计信息。预训练模型的每一层中都存储了用于原始数据的第一统计信息即均值和方差。当输入一批蒸馏数据时,如果一批蒸馏数据在预训练模型中的均值方差和原始数据的均值方差相差不大,那么认为一批蒸馏数据和原始数据已经非常相似。若相差仍然很大,则需要继续最小化它们之间的误差。In Feature Mixing, compute statistics on a batch of distilled data in a pretrained model. In each layer of the pretrained model, the first statistics for the original data, namely the mean and variance, are stored. When a batch of distillation data is input, if the mean variance of a batch of distillation data in the pre-training model is not much different from the mean variance of the original data, then the batch of distillation data is considered to be very similar to the original data. If the difference is still large, you need to continue to minimize the error between them.
值得注意的是,在特征混合中需要同时匹配一批蒸馏数据和三个预训练模型的特征,这使得生成出来的目标蒸馏数据相比于单模型蒸馏能够更加通用。本公开实施例利用多个预训练模型的特征,使得生成的目标蒸馏数据匹配出多个模型的特征分布。这使得目标蒸馏数据更通用,效果也更好。It is worth noting that in feature mixing, it is necessary to match a batch of distillation data and the features of the three pre-trained models at the same time, which makes the generated target distillation data more versatile than single-model distillation. The embodiment of the present disclosure utilizes the features of multiple pre-trained models, so that the generated target distillation data matches the feature distribution of multiple models. This makes target distillation data more general and effective.
步骤S5303,利用差异反向传播计算梯度,对混合后的一批蒸馏数据进行梯度下降更新;Step S5303, using difference backpropagation to calculate the gradient, and performing gradient descent update on the mixed batch of distillation data;
步骤S5304,判断迭代训练是否完毕。Step S5304, judging whether the iterative training is completed.
这里,在迭代两万轮左右的情况下可以使得目标蒸馏数据与原始数据看起来一致,从而迭代训练完毕;否则,每一次训练之后继续执行步骤S5301。Here, in the case of iterating about 20,000 rounds, the target distillation data can be made to look consistent with the original data, so that the iterative training is completed; otherwise, step S5301 is continued after each training.
相关技术中无数据压缩方法生成的蒸馏数据只能用于本模型的压缩,而且由于蒸馏过程是不可逆运算,存在很多不合理的视觉信息,使得生成的蒸馏数据无法迁移,需要反复生成。The distillation data generated by the non-data compression method in the related art can only be used for the compression of this model, and because the distillation process is an irreversible operation, there is a lot of unreasonable visual information, so that the generated distillation data cannot be transferred and needs to be generated repeatedly.
本公开实施例通过使用数据混合和特征混合训练蒸馏数据,使得生成的目标蒸馏数据具有更加鲁 棒的视觉信息和特征,因此该目标蒸馏数据非常通用,可以用于不同模型。同时本公开实施例只需生成一次目标蒸馏数据就可以一直使用。In the embodiments of the present disclosure, by using data mixing and feature mixing to train distillation data, the generated target distillation data has more robust visual information and features, so the target distillation data is very versatile and can be used for different models. At the same time, the embodiment of the present disclosure only needs to generate the target distillation data once, and then it can be used all the time.
本公开实施例利用多个预训练模型提升蒸馏数据生成时的通用性,同时利用数据信息混合生成鲁棒视觉信息。在实际使用中,对于常见的原始数据集,可以提前生成蒸馏数据集,直接用于模型压缩,如量化和剪枝。对于未见过的原始数据集,则需要生成蒸馏数据才能使用。The embodiments of the present disclosure use multiple pre-trained models to improve the versatility of distillation data generation, and at the same time use data information mixture to generate robust visual information. In practical use, for common raw data sets, distillation data sets can be generated in advance and directly used for model compression, such as quantization and pruning. For unseen raw data sets, distillation data needs to be generated before use.
基于前述的实施例,本公开实施例再提供一种数据蒸馏的装置,所述装置包括所包括的各模块、以及各模块所包括的各子模块以及各单元,可以通过电子设备中的处理器来实现;当然也可通过具体的逻辑电路实现;在实施的过程中,处理器可以为中央处理器(Central Processing Unit,CPU)、微处理器(Micro Processing Unit,MPU)、数字信号处理器(Digital Signal Processor,DSP)或现场可编程门阵列(Field Programmable Gate Array,FPGA)等。Based on the aforementioned embodiments, the present disclosure further provides a device for data distillation. The device includes each module included, each sub-module included in each module, and each unit, and the processor in the electronic device can It can also be realized by a specific logic circuit; in the process of implementation, the processor can be a central processing unit (Central Processing Unit, CPU), a microprocessor (Micro Processing Unit, MPU), a digital signal processor ( Digital Signal Processor, DSP) or Field Programmable Gate Array (Field Programmable Gate Array, FPGA), etc.
图6为本公开实施例提供的一种数据蒸馏的装置的组成结构示意图,如图6所示,所述装置600包括第一确定模块610、第二确定模块620、第三确定模块630、第四确定模块640和训练模块650,其中:Fig. 6 is a schematic diagram of the composition and structure of a data distillation device provided by an embodiment of the present disclosure. As shown in Fig. 6, the device 600 includes a first determination module 610, a second determination module 620, a third determination module 630, a Four determination module 640 and training module 650, wherein:
所述第一确定模块610,配置为确定至少一批待训练的第一蒸馏数据;每一批所述第一蒸馏数据中存在至少一个包括两种数据标签信息的混合数据;The first determination module 610 is configured to determine at least one batch of first distillation data to be trained; in each batch of the first distillation data, there is at least one mixed data including two kinds of data label information;
所述第二确定模块620,配置为确定至少两个预训练模型;其中,每一所述预训练模型中存储原始数据的第一统计信息;The second determination module 620 is configured to determine at least two pre-training models; wherein, each of the pre-training models stores the first statistical information of the original data;
所述第三确定模块630,配置为基于每一所述预训练模型中的第一统计信息,确定每一批所述第一蒸馏数据在相应预训练模型中的批归一化统计损失;The third determination module 630 is configured to determine the batch normalized statistical loss of each batch of the first distillation data in the corresponding pre-training model based on the first statistical information in each of the pre-training models;
所述第四确定模块640,配置为基于每一批所述第一蒸馏数据中每一数据的初始化标签,确定每一批所述第一蒸馏数据在每一所述预训练模型中的目标交叉熵损失;The fourth determination module 640 is configured to determine the target intersection of each batch of the first distillation data in each of the pre-training models based on the initialization label of each data in each batch of the first distillation data entropy loss;
所述训练模块650,配置为基于每一批所述第一蒸馏数据在每一所述预训练模型中的批归一化统计损失和所述目标交叉熵损失,对每一批所述第一蒸馏数据进行反向传播训练,得到目标蒸馏数据。The training module 650 is configured to, based on the batch normalized statistical loss and the target cross-entropy loss of each batch of the first distillation data in each of the pre-training models, for each batch of the first distillation data The distillation data is backpropagated for training to obtain the target distillation data.
在一些可能的实施例中,所述第三确定模块630包括第一确定子模块和第二确定子模块,其中:所述第一确定子模块,确定每一批所述第一蒸馏数据在每一所述预训练模型中的第二统计信息;所述第二确定子模块,针对每一所述预训练模型,确定所述第一统计信息与所述第二统计信息之间的批归一化统计损失。In some possible embodiments, the third determining module 630 includes a first determining submodule and a second determining submodule, wherein: the first determining submodule determines that each batch of the first distillation data is The second statistical information in the pre-training model; the second determination submodule, for each of the pre-training models, determines the batch normalization between the first statistical information and the second statistical information Statistical loss.
在一些可能的实施例中,所述第二确定模块620还配置为从预训练模型库中随机选择至少两个不同类型的预训练模型。In some possible embodiments, the second determining module 620 is further configured to randomly select at least two different types of pre-training models from the pre-training model library.
在一些可能的实施例中,所述训练模块650包括第三确定子模块、处理子模块和训练子模块,其中:所述第三确定子模块,配置为基于每一所述预训练模型的批归一化统计损失和所述目标交叉熵损失,确定相应预训练模型对应的第一损失;所述处理子模块,配置为对各个所述预训练模型对应的所述第一损失求均值,得到每一批所述第一蒸馏数据经过所述至少两个预训练模型的目标损失;所述训练子模块,配置为基于所述目标损失,对每一批所述第一蒸馏数据进行反向传播训练,得到所述目标蒸馏数据。In some possible embodiments, the training module 650 includes a third determination submodule, a processing submodule and a training submodule, wherein: the third determination submodule is configured to Normalizing the statistical loss and the target cross-entropy loss to determine the first loss corresponding to the corresponding pre-training model; the processing submodule is configured to average the first loss corresponding to each of the pre-training models to obtain Each batch of the first distillation data undergoes the target loss of the at least two pre-trained models; the training submodule is configured to perform backpropagation on each batch of the first distillation data based on the target loss Training, obtain the target distillation data.
在一些可能的实施例中,所述第一确定模块包括初始化子模块和混合子模块,其中:所述初始化子模块,配置为基于原始数据的分布信息,初始化至少一批第二蒸馏数据;所述混合子模块,配置为在每一次迭代训练中,对每一批所述第二蒸馏数据中每两个图像数据进行混合,得到每一批所述第一蒸馏数据。In some possible embodiments, the first determination module includes an initialization submodule and a mixing submodule, wherein: the initialization submodule is configured to initialize at least one batch of second distillation data based on the distribution information of the original data; The mixing sub-module is configured to mix every two image data in each batch of the second distillation data to obtain each batch of the first distillation data in each iterative training.
在一些可能的实施例中,所述混合子模块包括选取单元、缩放单元和覆盖单元,其中:所述选取单元,配置为从每一批所述第二蒸馏数据中随机选取至少一对第一图像和第二图像;所述缩放单元,配置为将每一所述第一图像的尺寸按照特定比例缩小;所述覆盖单元,配置为将缩小后的所述第一图像随机覆盖到对应的所述第二图像中,得到每一批所述第一蒸馏数据。In some possible embodiments, the mixing submodule includes a selection unit, a scaling unit, and a covering unit, wherein: the selection unit is configured to randomly select at least one pair of first distillation data from each batch of the second distillation data image and a second image; the scaling unit is configured to reduce the size of each of the first images according to a specific ratio; the covering unit is configured to randomly cover the reduced first image to the corresponding In the second image, each batch of the first distillation data is obtained.
在一些可能的实施例中,所述覆盖单元包括第一生成子单元、第二生成子单元和叠加子单元,其中:所述第一生成子单元,配置为按照所述特定比例,在对应的所述第二图像中随机生成待覆盖的混 合区域;所述第二生成子单元,配置为基于所述混合区域的边界框,随机生成二进制掩码;所述叠加子单元,配置为通过所述二进制掩码对缩小后的所述第一图像的每一像素值和所述第二图像中对应的像素值进行叠加,得到每一批所述第一蒸馏数据。In some possible embodiments, the covering unit includes a first generating subunit, a second generating subunit, and a superposition subunit, wherein: the first generating subunit is configured to, according to the specific ratio, in the corresponding The mixed region to be covered is randomly generated in the second image; the second generating subunit is configured to randomly generate a binary mask based on the bounding box of the mixed region; the superimposing subunit is configured to pass the The binary mask superimposes each pixel value of the reduced first image and the corresponding pixel value in the second image to obtain each batch of the first distillation data.
在一些可能的实施例中,所述每一批第一蒸馏数据中包括由第一图像和第二图像混合得到的合成图像,所述第四确定模块640包括第四确定子模块、第五确定子模块和第六确定子模块,其中:所述第四确定子模块,配置为基于所述第一图像的初始化标签和所述第二图像的初始化标签,确定所述合成图像的混合交叉熵损失;所述第五确定子模块,配置为基于所述每一批第一蒸馏数据中除所述第一图像和所述第二图像之外的其他图像的初始化标签,确定所述其他图像的累计交叉熵损失;所述第六确定子模块,配置为基于所述混合交叉熵损失和所述累计交叉熵损失,确定每一批所述第一蒸馏数据经过每一所述预训练模型后的目标交叉熵损失。In some possible embodiments, each batch of first distillation data includes a synthetic image obtained by mixing the first image and the second image, and the fourth determining module 640 includes a fourth determining submodule, a fifth determining The submodule and the sixth determination submodule, wherein: the fourth determination submodule is configured to determine the hybrid cross-entropy loss of the composite image based on the initialization label of the first image and the initialization label of the second image ; The fifth determination submodule is configured to determine the accumulation of the other images based on the initialization labels of the other images in each batch of first distillation data except the first image and the second image Cross-entropy loss; the sixth determining submodule is configured to determine the target of each batch of the first distillation data after passing through each of the pre-training models based on the mixed cross-entropy loss and the cumulative cross-entropy loss Cross entropy loss.
在一些可能的实施例中,所述第四确定子模块包括第一确定单元、第二确定单元和求和单元,其中:所述第一确定单元,配置为基于所述第一图像的初始化标签,确定每一批所述第一蒸馏数据通过每一所述预训练模型后的第一交叉熵损失;第二确定单元,配置为基于所述第二图像的初始化标签,确定每一批所述第一蒸馏数据通过每一所述预训练模型后的第二交叉熵损失;所述求和单元,配置为基于所述第一图像与所述第二图像之间的合成比例,对所述第一交叉熵损失和所述第二交叉损失进行线性求和,得到所述目标交叉熵损失。In some possible embodiments, the fourth determining submodule includes a first determining unit, a second determining unit, and a summing unit, wherein: the first determining unit is configured to initialize labels based on the first image , to determine the first cross-entropy loss after each batch of the first distillation data passes through each of the pre-training models; the second determining unit is configured to determine each batch of the The second cross-entropy loss after the first distillation data passes through each of the pre-training models; the summation unit is configured to calculate the first image based on the synthesis ratio between the first image and the second image. A cross-entropy loss and the second cross-entropy loss are linearly summed to obtain the target cross-entropy loss.
在一些可能的实施例中,所述第一统计信息包括均值和方差,每一所述预训练模型包括至少两个组成块,每一所述组成块包括卷积层和批归一化层,所述第三确定模块630还包括提取子模块和累加子模块,其中:所述提取子模块,配置为从每一所述组成块的批归一化层中提取所述原始数据的均值和方差;所述均值和方差是通过所述卷积层提取的所述原始数据的特征进行统计得到的;所述累加子模块,配置为基于每一所述预训练模型中所述至少两个组成块的均值和方差,确定每一所述预训练模型中的第一统计信息。In some possible embodiments, the first statistical information includes mean and variance, each of the pre-training models includes at least two building blocks, and each of the building blocks includes a convolutional layer and a batch normalization layer, The third determination module 630 also includes an extraction submodule and an accumulation submodule, wherein: the extraction submodule is configured to extract the mean value and variance of the original data from the batch normalization layer of each of the constituent blocks ; The mean value and variance are statistically obtained through the features of the original data extracted by the convolution layer; the accumulation sub-module is configured to be based on the at least two constituent blocks in each of the pre-training models The mean and variance of , determine the first statistical information in each of said pre-trained models.
在一些可能的实施例中,所述分布信息为高斯分布信息,所述初始化子模块包括获取单元和初始化单元,其中:所述获取单元,配置为获取所述原始数据的高斯分布信息;所述初始化单元,配置为基于从所述高斯分布信息中随机采样的数据,对至少N个初始像素值矩阵进行初始化,得到每批所述第二蒸馏数据;N为大于等于2的整数。In some possible embodiments, the distribution information is Gaussian distribution information, and the initialization submodule includes an acquisition unit and an initialization unit, wherein: the acquisition unit is configured to acquire the Gaussian distribution information of the original data; the The initialization unit is configured to initialize at least N initial pixel value matrices based on randomly sampled data from the Gaussian distribution information to obtain each batch of the second distillation data; N is an integer greater than or equal to 2.
在一些可能的实施例中,所述训练子模块第三确定单元、更新单元和训练单元,其中:所述第三确定单元,配置为在反向传播过程中的每一次迭代训练中,确定所述目标损失针对所述第一蒸馏数据的梯度;所述更新单元,配置为基于所述梯度,对所述第一蒸馏数据中每一数据进行更新;所述训练单元,配置为在更新后的第一蒸馏数据在每一所述预训练模型中的批归一化统计损失和所述目标交叉熵损失收敛到稳定值时,得到所述目标蒸馏数据。In some possible embodiments, the training submodule has a third determination unit, an update unit and a training unit, wherein: the third determination unit is configured to determine the The gradient of the target loss with respect to the first distillation data; the update unit is configured to update each data in the first distillation data based on the gradient; the training unit is configured to update after the update The first distillation data obtains the target distillation data when the batch normalization statistical loss and the target cross-entropy loss in each of the pre-training models converge to a stable value.
在一些可能的实施例中,所述装置还包括压缩模块,配置为在至少一批所述目标蒸馏数据的数量达到特定阈值的情况下,对每一所述预训练模型进行模型压缩。In some possible embodiments, the device further includes a compression module configured to perform model compression on each of the pre-trained models when the quantity of at least one batch of the target distillation data reaches a specific threshold.
这里需要指出的是:以上装置实施例的描述,与上述方法实施例的描述是类似的,具有同方法实施例相似的有益效果。对于本公开装置实施例中未披露的技术细节,请参照本公开方法实施例的描述而理解。It should be noted here that: the description of the above device embodiment is similar to the description of the above method embodiment, and has similar beneficial effects as the method embodiment. For technical details not disclosed in the device embodiments of the present disclosure, please refer to the description of the method embodiments of the present disclosure for understanding.
需要说明的是,本公开实施例中,如果以软件功能模块的形式实现上述数据蒸馏的方法,并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。基于这样的理解,本公开实施例的技术方案本质上或者说对相关技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得电子设备(可以是具有摄像头的智能手机、平板电脑等)执行本公开各个实施例所述方法的全部或部分。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read Only Memory,ROM)、磁碟或者光盘等各种可以存储程序代码的介质。这样,本公开实施例不限制于任何预设的硬件和软件结合。It should be noted that, in the embodiments of the present disclosure, if the above data distillation method is implemented in the form of software function modules and sold or used as an independent product, it can also be stored in a computer-readable storage medium. Based on this understanding, the essence of the technical solutions of the embodiments of the present disclosure or the part that contributes to the related technologies can be embodied in the form of software products, the computer software products are stored in a storage medium, and include several instructions to make An electronic device (which may be a smart phone with a camera, a tablet computer, etc.) executes all or part of the methods described in various embodiments of the present disclosure. The aforementioned storage medium includes: various media that can store program codes such as U disk, mobile hard disk, read-only memory (Read Only Memory, ROM), magnetic disk or optical disk. As such, embodiments of the present disclosure are not limited to any predetermined combination of hardware and software.
对应地,本公开实施例提供一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现上述实施例中任一所述数据蒸馏的方法中的步骤。对应地,本公开实施例中,还提 供了一种芯片,所述芯片包括可编程逻辑电路和/或程序指令,当所述芯片运行时,用于实现上述实施例中任一所述数据蒸馏的方法中的步骤。对应地,本公开实施例中,还提供了一种计算机程序产品,当该计算机程序产品被电子设备的处理器执行时,其用于实现上述实施例中任一所述数据蒸馏的方法中的步骤。Correspondingly, an embodiment of the present disclosure provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the steps in any one of the methods for data distillation described in the above-mentioned embodiments are implemented. Correspondingly, in the embodiments of the present disclosure, a chip is also provided, the chip includes programmable logic circuits and/or program instructions, and when the chip is running, it is used to implement the data distillation described in any of the above embodiments steps in the method. Correspondingly, in the embodiments of the present disclosure, a computer program product is also provided, and when the computer program product is executed by the processor of the electronic device, it is used to implement the data distillation method in any of the above embodiments. step.
本公开实施例还提供一种计算机程序产品,该计算机程序产品承载有程序代码,所述程序代码包括的指令可用于执行上述方法实施例中任一所述数据蒸馏方法中的步骤。其中,上述计算机程序产品可以具体通过硬件、软件或其结合的方式实现。在一个可选实施例中,所述计算机程序产品具体体现为计算机存储介质,在另一个可选实施例中,计算机程序产品具体体现为软件产品,例如软件开发包(Software Development Kit,SDK)等等。An embodiment of the present disclosure further provides a computer program product, the computer program product carries a program code, and instructions included in the program code can be used to execute the steps in any one of the data distillation methods in the above method embodiments. Wherein, the above-mentioned computer program product may be specifically implemented by means of hardware, software or a combination thereof. In an optional embodiment, the computer program product is embodied as a computer storage medium, and in another optional embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK) etc. wait.
本公开实施例还提供一种计算机程序,包括计算机可读代码,当所述计算机可读代码在电子设备中运行时,所述电子设备中的处理器执行用于实现上述方法实施例中任一所述数据蒸馏方法。An embodiment of the present disclosure also provides a computer program, including computer readable codes. When the computer readable codes are run in an electronic device, the processor in the electronic device executes any one of the above method embodiments. The data distillation method.
基于同一技术构思,本公开实施例提供一种电子设备,配置为实施上述方法实施例记载的数据蒸馏的方法。图7为本公开实施例提供的一种电子设备的硬件实体示意图,如图7所示,所述电子设备700包括存储器710和处理器720,所述存储器710存储有可在处理器720上运行的计算机程序,所述处理器720执行所述程序时实现本公开实施例任一所述数据蒸馏的方法中的步骤。Based on the same technical concept, an embodiment of the present disclosure provides an electronic device configured to implement the data distillation method described in the above method embodiment. FIG. 7 is a schematic diagram of hardware entities of an electronic device provided by an embodiment of the present disclosure. As shown in FIG. 7 , the electronic device 700 includes a memory 710 and a processor 720, and the memory 710 stores a A computer program, the processor 720 implements the steps in any one of the data distillation methods in the embodiments of the present disclosure when executing the program.
存储器710配置为存储由处理器720可执行的指令和应用,还可以缓存待处理器720以及电子设备中各模块待处理或已经处理的数据(例如,图像数据、音频数据、语音通信数据和视频通信数据),可以通过闪存(FLASH)或随机访问存储器(Random Access Memory,RAM)实现。The memory 710 is configured to store instructions and applications executable by the processor 720, and can also cache data to be processed or processed by the processor 720 and various modules in the electronic device (for example, image data, audio data, voice communication data and video data). Communication data), which can be realized by flash memory (FLASH) or random access memory (Random Access Memory, RAM).
处理器720执行程序时实现上述任一项的数据蒸馏的方法的步骤。处理器720通常控制电子设备700的总体操作。When the processor 720 executes the program, the steps of any one of the data distillation methods described above are realized. The processor 720 generally controls the overall operation of the electronic device 700 .
上述处理器可以为特定用途集成电路(Application Specific Integrated Circuit,ASIC)、数字信号处理器(Digital Signal Processor,DSP)、数字信号处理装置(Digital Signal Processing Device,DSPD)、可编程逻辑装置(Programmable Logic Device,PLD)、现场可编程门阵列(Field Programmable Gate Array,FPGA)、中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器中的至少一种。可以理解地,实现上述处理器功能的电子器件还可以为其它,本公开实施例不作具体限定。The above-mentioned processor can be an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a digital signal processor (Digital Signal Processor, DSP), a digital signal processing device (Digital Signal Processing Device, DSPD), a programmable logic device (Programmable Logic At least one of Device, PLD), Field Programmable Gate Array (Field Programmable Gate Array, FPGA), Central Processing Unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor. Understandably, the electronic device that implements the above processor function may also be other, which is not specifically limited in this embodiment of the present disclosure.
上述计算机存储介质/存储器可以是只读存储器(Read Only Memory,ROM)、可编程只读存储器(Programmable Read-Only Memory,PROM)、可擦除可编程只读存储器(Erasable Programmable Read-Only Memory,EPROM)、电可擦除可编程只读存储器(Electrically Erasable Programmable Read-Only Memory,EEPROM)、磁性随机存取存储器(Ferromagnetic Random Access Memory,FRAM)、快闪存储器(Flash Memory)、磁表面存储器、光盘、或只读光盘(Compact Disc Read-Only Memory,CD-ROM)等存储器;也可以是包括上述存储器之一或任意组合的各种电子设备,如移动电话、计算机、平板设备、个人数字助理等。The above-mentioned computer storage medium/memory can be read-only memory (Read Only Memory, ROM), programmable read-only memory (Programmable Read-Only Memory, PROM), erasable programmable read-only memory (Erasable Programmable Read-Only Memory, EPROM), Electrically Erasable Programmable Read-Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), Magnetic Random Access Memory (Ferromagnetic Random Access Memory, FRAM), Flash Memory (Flash Memory), Magnetic Surface Memory, CD-ROM, or CD-ROM (Compact Disc Read-Only Memory, CD-ROM) and other memories; it can also be various electronic devices including one or any combination of the above-mentioned memories, such as mobile phones, computers, tablet devices, personal digital assistants wait.
这里需要指出的是:以上存储介质和设备实施例的描述,与上述方法实施例的描述是类似的,具有同方法实施例相似的有益效果。对于本公开存储介质和设备实施例中未披露的技术细节,请参照本公开方法实施例的描述而理解。It should be pointed out here that: the descriptions of the above storage medium and device embodiments are similar to the descriptions of the above method embodiments, and have similar beneficial effects to those of the method embodiments. For technical details not disclosed in the storage medium and device embodiments of the present disclosure, please refer to the description of the method embodiments of the present disclosure for understanding.
应理解,说明书通篇中提到的“一个实施例”或“一实施例”意味着与实施例有关的特定特征、结构或特性包括在本公开的至少一个实施例中。因此,在整个说明书各处出现的“在一个实施例中”或“在一实施例中”未必一定指相同的实施例。此外,这些预设的特征、结构或特性可以任意适合的方式结合在一个或多个实施例中。应理解,在本公开的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本公开实施例的实施过程构成任何限定。上述本公开实施例序号仅仅为了描述,不代表实施例的优劣。It should be understood that reference throughout the specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic related to the embodiment is included in at least one embodiment of the present disclosure. Thus, appearances of "in one embodiment" or "in an embodiment" in various places throughout the specification are not necessarily referring to the same embodiment. Furthermore, the predetermined features, structures or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that in various embodiments of the present disclosure, the sequence numbers of the above-mentioned processes do not mean the order of execution, and the execution order of the processes should be determined by their functions and internal logic, rather than by the embodiments of the present disclosure. The implementation process constitutes any limitation. The serial numbers of the above-mentioned embodiments of the present disclosure are for description only, and do not represent the advantages and disadvantages of the embodiments.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存 在另外的相同要素。It should be noted that, in this document, the term "comprising", "comprising" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article or apparatus comprising a set of elements includes not only those elements, It also includes other elements not expressly listed, or elements inherent in the process, method, article, or device. Without further limitations, an element defined by the phrase "comprising a ..." does not preclude the presence of additional identical elements in the process, method, article, or apparatus comprising that element.
在本公开所提供的几个实施例中,应该理解到,所揭露的设备和方法,可以通过其它的方式实现。以上所描述的设备实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,如:多个单元或组件可以结合,或可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的各组成部分相互之间的耦合、或直接耦合、或通信连接可以是通过一些接口,设备或单元的间接耦合或通信连接,可以是电性的、机械的或其它形式的。上述作为分离部件说明的单元可以是、或也可以不是物理上分开的,作为单元显示的部件可以是、或也可以不是物理单元;既可以位于一个地方,也可以分布到多个网络单元上;可以根据实际的需要选择其中的部分或全部单元来实现本公开实施例方案的目的。In the several embodiments provided in the present disclosure, it should be understood that the disclosed devices and methods may be implemented in other ways. The device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods, such as: multiple units or components can be combined, or May be integrated into another system, or some features may be ignored, or not implemented. In addition, the coupling, or direct coupling, or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be electrical, mechanical or other forms of. The units described above as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units; they may be located in one place or distributed to multiple network units; Part or all of the units can be selected according to actual needs to achieve the purpose of the solutions of the embodiments of the present disclosure.
另外,在本公开各实施例中的各功能单元可以全部集成在一个处理单元中,也可以是各单元分别单独作为一个单元,也可以两个或两个以上单元集成在一个单元中;上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能单元的形式实现。或者,本公开上述集成的单元如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。基于这样的理解,本公开实施例的技术方案本质上或者说对相关技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得设备自动测试线执行本公开各个实施例所述方法的全部或部分。而前述的存储介质包括:移动存储设备、ROM、磁碟或者光盘等各种可以存储程序代码的介质。本公开所提供的几个方法实施例中所揭露的方法,在不冲突的情况下可以任意组合,得到方法实施例。本公开所提供的几个方法或设备实施例中所揭露的特征,在不冲突的情况下可以任意组合,得到方法实施例或设备实施例。以上所述,仅为本公开的实施方式,但本公开的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本公开揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本公开的保护范围之内。因此,本公开的保护范围应以所述权利要求的保护范围为准。In addition, each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, or each unit may be used as a single unit, or two or more units may be integrated into one unit; the above-mentioned integration The unit can be realized in the form of hardware or in the form of hardware plus software functional unit. Alternatively, if the above-mentioned integrated units of the present disclosure are realized in the form of software function modules and sold or used as independent products, they may also be stored in a computer-readable storage medium. Based on this understanding, the essence of the technical solutions of the embodiments of the present disclosure or the part that contributes to the related technologies can be embodied in the form of software products, the computer software products are stored in a storage medium, and include several instructions to make The equipment automatic test line executes all or part of the methods described in various embodiments of the present disclosure. The aforementioned storage medium includes various media capable of storing program codes such as removable storage devices, ROMs, magnetic disks or optical disks. The methods disclosed in the several method embodiments provided in the present disclosure can be combined arbitrarily to obtain the method embodiments if there is no conflict. The features disclosed in several method or device embodiments provided in the present disclosure may be combined arbitrarily without conflict to obtain method embodiments or device embodiments. The above is only the embodiment of the present disclosure, but the scope of protection of the present disclosure is not limited thereto. Anyone skilled in the art can easily think of changes or substitutions within the technical scope of the present disclosure, and should within the protection scope of the present disclosure. Therefore, the protection scope of the present disclosure should be determined by the protection scope of the claims.
工业实用性Industrial Applicability
本公开实施例公开的数据蒸馏的方法,首先确定至少一批待训练的第一蒸馏数据;然后确定至少两个预训练模型;再基于每一所述预训练模型中的第一统计信息,确定每一批所述第一蒸馏数据在相应预训练模型中的批归一化统计损失;再基于每一批所述第一蒸馏数据中每一数据的初始化标签,确定每一批所述第一蒸馏数据在每一所述预训练模型中的目标交叉熵损失;最后基于每一批所述第一蒸馏数据在每一所述预训练模型中的批归一化统计损失和所述目标交叉熵损失,对每一批所述第一蒸馏数据进行反向传播训练,得到目标蒸馏数据;如此,通过数据混合生成鲁棒的视觉信息,同时利用多个预训练模型中存储的原始数据的特征,使得训练出来的目标蒸馏数据能够匹配出多个模型的特征分布,从而目标蒸馏数据更通用,效果也更好。In the method of data distillation disclosed in the embodiments of the present disclosure, at least one batch of first distillation data to be trained is first determined; then at least two pre-training models are determined; and based on the first statistical information in each of the pre-training models, it is determined The batch normalized statistical loss of each batch of the first distillation data in the corresponding pre-training model; then, based on the initialization label of each data in each batch of the first distillation data, determine each batch of the first distillation data The target cross-entropy loss of the distillation data in each of the pre-training models; finally, based on the batch normalized statistical loss and the target cross-entropy of each batch of the first distillation data in each of the pre-training models Loss, performing backpropagation training on each batch of the first distillation data to obtain the target distillation data; in this way, robust visual information is generated through data mixing, while utilizing the characteristics of the original data stored in multiple pre-training models, The trained target distillation data can match the feature distribution of multiple models, so that the target distillation data is more versatile and the effect is better.

Claims (30)

  1. 一种数据蒸馏的方法,所述方法包括:A method of data distillation, the method comprising:
    确定至少一批待训练的第一蒸馏数据;每一批所述第一蒸馏数据中存在至少一个包括两种数据标签信息的混合数据;Determine at least one batch of first distillation data to be trained; in each batch of the first distillation data, there is at least one mixed data including two kinds of data label information;
    确定至少两个预训练模型;其中,每一所述预训练模型中存储原始数据的第一统计信息;Determine at least two pre-training models; wherein, each of the pre-training models stores the first statistical information of the original data;
    基于每一所述预训练模型中的第一统计信息,确定每一批所述第一蒸馏数据在相应预训练模型中的批归一化统计损失;determining a batch normalized statistical loss of each batch of the first distillation data in the corresponding pre-training model based on the first statistical information in each of the pre-training models;
    基于每一批所述第一蒸馏数据中每一数据的初始化标签,确定每一批所述第一蒸馏数据在每一所述预训练模型中的目标交叉熵损失;determining a target cross-entropy loss in each of the pre-trained models for each of the batches of the first distillation data based on an initialization label for each of the data in each of the batches of the first distillation data;
    基于每一批所述第一蒸馏数据在每一所述预训练模型中的批归一化统计损失和所述目标交叉熵损失,对每一批所述第一蒸馏数据进行反向传播训练,得到目标蒸馏数据。performing backpropagation training on each batch of the first distillation data based on the batch normalized statistical loss and the target cross-entropy loss in each batch of the first distillation data in each of the pre-training models, Obtain the target distillation data.
  2. 如权利要求1所述的方法,其中,所述基于每一所述预训练模型中的第一统计信息,确定每一批所述第一蒸馏数据在相应预训练模型中的批归一化统计损失,包括:The method according to claim 1, wherein, based on the first statistical information in each of the pre-training models, determining the batch normalization statistics of each batch of the first distillation data in the corresponding pre-training model loss, including:
    确定每一批所述第一蒸馏数据在每一所述预训练模型中的第二统计信息;determining second statistics for each batch of said first distillation data in each of said pre-trained models;
    针对每一所述预训练模型,确定所述第一统计信息与所述第二统计信息之间的批归一化统计损失。For each of the pre-trained models, a batch normalized statistical loss between the first statistical information and the second statistical information is determined.
  3. 如权利要求1或2所述的方法,其中,所述确定至少两个预训练模型,包括:The method according to claim 1 or 2, wherein said determining at least two pre-training models comprises:
    从预训练模型库中随机选择至少两个不同类型的预训练模型。Randomly select at least two pre-trained models of different types from the pre-trained model library.
  4. 如权利要求1至3任一项所述的方法,其中,所述基于每一批所述第一蒸馏数据在每一所述预训练模型中的批归一化统计损失和所述目标交叉熵损失,对每一批所述第一蒸馏数据进行反向传播训练,得到目标蒸馏数据,包括:The method according to any one of claims 1 to 3, wherein the batch normalized statistical loss and the target cross-entropy in each of the pre-training models are based on each batch of the first distillation data Loss, performing backpropagation training on each batch of the first distillation data to obtain the target distillation data, including:
    基于每一所述预训练模型的批归一化统计损失和所述目标交叉熵损失,确定相应预训练模型对应的第一损失;Based on the batch normalization statistical loss and the target cross-entropy loss of each of the pre-training models, determine the first loss corresponding to the corresponding pre-training model;
    对各个所述预训练模型对应的所述第一损失求均值,得到每一批所述第一蒸馏数据经过所述至少两个预训练模型的目标损失;averaging the first losses corresponding to each of the pre-training models to obtain the target loss of each batch of the first distillation data passing through the at least two pre-training models;
    基于所述目标损失,对每一批所述第一蒸馏数据进行反向传播训练,得到所述目标蒸馏数据。Based on the target loss, perform backpropagation training on each batch of the first distillation data to obtain the target distillation data.
  5. 如权利要求1至4任一项所述的方法,其中,所述确定至少一批待训练的第一蒸馏数据,包括:The method according to any one of claims 1 to 4, wherein said determining at least one batch of first distillation data to be trained comprises:
    基于原始数据的分布信息,初始化至少一批第二蒸馏数据;Initializing at least one batch of second distillation data based on the distribution information of the original data;
    在每一次迭代训练中,对每一批所述第二蒸馏数据中每两个图像数据进行混合,得到每一批所述第一蒸馏数据。In each iterative training, every two image data in each batch of the second distillation data are mixed to obtain each batch of the first distillation data.
  6. 如权利要求5所述的方法,其中,所述对每一批所述第二蒸馏数据中每两个图像数据进行混合,得到每一批所述第一蒸馏数据,包括:The method according to claim 5, wherein said mixing every two image data in each batch of said second distillation data to obtain each batch of said first distillation data comprises:
    从每一批所述第二蒸馏数据中随机选取至少一对第一图像和第二图像;randomly selecting at least one pair of first and second images from each batch of said second distillation data;
    将每一所述第一图像的尺寸按照特定比例缩小;reducing the size of each of the first images according to a specific ratio;
    将缩小后的所述第一图像随机覆盖到对应的所述第二图像中,得到每一批所述第一蒸馏数据。The reduced first image is randomly overlaid on the corresponding second image to obtain each batch of the first distillation data.
  7. 如权利要求6所述的方法,其中,所述将缩小后的所述第一图像覆盖到对应的所述第二图像中,得到每一批所述第一蒸馏数据,包括:The method according to claim 6, wherein said overlaying the reduced first image onto the corresponding second image to obtain each batch of said first distillation data comprises:
    按照所述特定比例,在对应的所述第二图像中随机生成待覆盖的混合区域;Randomly generating a mixed area to be covered in the corresponding second image according to the specific ratio;
    基于所述混合区域的边界框,随机生成二进制掩码;randomly generating a binary mask based on the bounding box of the blended region;
    通过所述二进制掩码对缩小后的所述第一图像的每一像素值和所述第二图像中对应的像素值进行叠加,得到每一批所述第一蒸馏数据。Each batch of the first distillation data is obtained by superimposing each pixel value of the reduced first image with a corresponding pixel value in the second image through the binary mask.
  8. 如权利要求1至7任一项所述的方法,其中,所述每一批第一蒸馏数据中包括由第一图像和第二图像混合得到的合成图像,所述基于每一批所述第一蒸馏数据中每一数据的初始化标签,确定每一批所述第一蒸馏数据在每一所述预训练模型中的目标交叉熵损失,包括:The method according to any one of claims 1 to 7, wherein said each batch of first distillation data includes a synthetic image obtained by mixing a first image and a second image, and said based on each batch of said first An initialization label for each data in the distillation data, determining the target cross-entropy loss of each batch of the first distillation data in each of the pre-training models, including:
    基于所述第一图像的初始化标签和所述第二图像的初始化标签,确定所述合成图像的混合交叉熵损失;determining a hybrid cross-entropy loss for the composite image based on the initialization label of the first image and the initialization label of the second image;
    基于所述每一批第一蒸馏数据中除所述第一图像和所述第二图像之外的其他图像的初始化标签,确定所述其他图像的累计交叉熵损失;Determining cumulative cross-entropy losses for images other than the first image and the second image in each batch of first distillation data based on initialization labels for the other images;
    基于所述混合交叉熵损失和所述累计交叉熵损失,确定每一批所述第一蒸馏数据经过每一所述预训练模型后的目标交叉熵损失。Based on the mixed cross-entropy loss and the cumulative cross-entropy loss, a target cross-entropy loss after each batch of the first distillation data passes through each of the pre-trained models is determined.
  9. 如权利要求8所述的方法,其中,所述基于所述第一图像的初始化标签和所述第二图像的初始化标签,确定所述合成图像的混合交叉熵损失,包括:The method according to claim 8, wherein said determining the hybrid cross-entropy loss of said composite image based on the initialization label of said first image and the initialization label of said second image comprises:
    基于所述第一图像的初始化标签,确定每一批所述第一蒸馏数据通过每一所述预训练模型后的第一交叉熵损失;Based on the initialization label of the first image, determine the first cross-entropy loss after each batch of the first distillation data passes through each of the pre-training models;
    基于所述第二图像的初始化标签,确定每一批所述第一蒸馏数据通过每一所述预训练模型后的第二交叉熵损失;Based on the initialization label of the second image, determine a second cross-entropy loss after each batch of the first distillation data passes through each of the pre-training models;
    基于所述第一图像与所述第二图像之间的合成比例,对所述第一交叉熵损失和所述第二交叉损失进行线性求和,得到所述合成图像的混合交叉熵损失。Based on the synthesis ratio between the first image and the second image, linearly sum the first cross-entropy loss and the second cross-entropy loss to obtain a mixed cross-entropy loss of the synthesized image.
  10. 如权利要求1至9任一项所述的方法,其中,所述第一统计信息包括均值和方差,每一所述预训练模型包括至少两个组成块,每一所述组成块包括卷积层和批归一化层,所述方法还包括:The method according to any one of claims 1 to 9, wherein said first statistical information includes mean and variance, each of said pre-trained models includes at least two building blocks, each of said building blocks includes convolution layer and a batch normalization layer, the method further comprising:
    从每一所述组成块的批归一化层中提取所述原始数据的均值和方差;所述均值和方差是通过所述卷积层提取的所述原始数据的特征进行统计得到的;Extracting the mean value and variance of the original data from the batch normalization layer of each of the constituent blocks; the mean value and variance are statistically obtained through the characteristics of the original data extracted by the convolution layer;
    基于每一所述预训练模型中所述至少两个组成块的均值和方差,确定每一所述预训练模型中的第一统计信息。Determining first statistical information in each of said pre-trained models based on the mean and variance of said at least two constituent blocks in each of said pre-trained models.
  11. 如权利要求5所述的方法,其中,所述分布信息为高斯分布信息,所述基于原始数据的分布信息,初始化至少一批第二蒸馏数据,包括:The method according to claim 5, wherein the distribution information is Gaussian distribution information, and the distribution information based on the original data is used to initialize at least one batch of second distillation data, including:
    获取所述原始数据的高斯分布信息;Obtaining Gaussian distribution information of the raw data;
    基于从所述高斯分布信息中随机采样的数据,对至少N个初始像素值矩阵进行初始化,得到每一批所述第二蒸馏数据;N为大于等于2的整数。Initialize at least N initial pixel value matrices based on randomly sampled data from the Gaussian distribution information to obtain each batch of second distillation data; N is an integer greater than or equal to 2.
  12. 如权利要求4所述的方法,其中,所述基于所述目标损失,对每一批所述第一蒸馏数据进行反向传播训练,得到所述目标蒸馏数据,包括:The method according to claim 4, wherein, based on the target loss, performing backpropagation training on each batch of the first distillation data to obtain the target distillation data includes:
    在反向传播过程中的每一次迭代训练中,确定所述目标损失针对所述第一蒸馏数据的梯度;In each iterative training in the backpropagation process, determining the gradient of the target loss with respect to the first distillation data;
    基于所述梯度,对所述第一蒸馏数据中每一数据进行更新;updating each of the first distillation data based on the gradient;
    在更新后的所述第一蒸馏数据在每一所述预训练模型中的批归一化统计损失和所述目标交叉熵损失收敛到稳定值时,得到所述目标蒸馏数据。The target distillation data is obtained when the batch normalization statistical loss and the target cross-entropy loss in each of the pre-training models of the updated first distillation data converge to a stable value.
  13. 如权利1至12任一项所述的方法,其中,所述方法还包括:The method according to any one of claims 1 to 12, wherein the method further comprises:
    在至少一批所述目标蒸馏数据的数量达到特定阈值的情况下,对每一所述预训练模型进行模型压缩。When the quantity of at least one batch of the target distillation data reaches a specific threshold, perform model compression on each of the pre-trained models.
  14. 一种数据蒸馏的装置,所述装置包括第一确定模块、第二确定模块、第三确定模块、第四确定模块和训练模块,其中:A device for data distillation, the device includes a first determination module, a second determination module, a third determination module, a fourth determination module and a training module, wherein:
    所述第一确定模块,配置为确定至少一批待训练的第一蒸馏数据;每一批所述第一蒸馏数据中存在至少一个包括两种数据标签信息的混合数据;The first determination module is configured to determine at least one batch of first distillation data to be trained; in each batch of the first distillation data, there is at least one mixed data including two kinds of data label information;
    所述第二确定模块,配置为确定至少两个预训练模型;其中,每一所述预训练模型中存储原始数据的第一统计信息;The second determination module is configured to determine at least two pre-training models; wherein, each of the pre-training models stores the first statistical information of the original data;
    所述第三确定模块,配置为基于每一所述预训练模型中的第一统计信息,确定每一批所述第一蒸馏数据在相应预训练模型中的批归一化统计损失;The third determination module is configured to determine the batch normalized statistical loss of each batch of the first distillation data in the corresponding pre-training model based on the first statistical information in each of the pre-training models;
    所述第四确定模块,配置为基于每一批所述第一蒸馏数据中每一数据的初始化标签,确定每一批所述第一蒸馏数据在每一所述预训练模型中的目标交叉熵损失;The fourth determination module is configured to determine the target cross entropy of each batch of the first distillation data in each of the pre-training models based on the initialization label of each data in each batch of the first distillation data loss;
    所述训练模块,配置为基于每一批所述第一蒸馏数据在每一所述预训练模型中的批归一化统计损失和所述目标交叉熵损失,对每一批所述第一蒸馏数据进行反向传播训练,得到目标蒸馏数据。The training module is configured to, based on the batch normalized statistical loss and the target cross-entropy loss of each batch of the first distillation data in each of the pre-training models, for each batch of the first distillation data The data is backpropagated for training to obtain the target distillation data.
  15. 如权利要求14所述的装置,其中,所述第三确定模块包括:The apparatus according to claim 14, wherein the third determining module comprises:
    第一确定子模块,确定每一批所述第一蒸馏数据在每一所述预训练模型中的第二统计信息;The first determination submodule determines the second statistical information of each batch of the first distillation data in each of the pre-training models;
    第二确定子模块,针对每一所述预训练模型,确定所述第一统计信息与所述第二统计信息之间的批归一化统计损失。The second determination submodule determines, for each of the pre-trained models, a batch normalized statistical loss between the first statistical information and the second statistical information.
  16. 如权利要求14或15所述的装置,其中,所述第二确定模块还配置为从预训练模型库中随机选择至少两个不同类型的预训练模型。The device according to claim 14 or 15, wherein the second determining module is further configured to randomly select at least two different types of pre-training models from a pre-training model library.
  17. 如权利要求14至16任一项所述的装置,其中,所述训练模块包括:The device according to any one of claims 14 to 16, wherein the training module comprises:
    第三确定子模块,配置为基于每一所述预训练模型的批归一化统计损失和所述目标交叉熵损失,确定相应预训练模型对应的第一损失;The third determining submodule is configured to determine the first loss corresponding to the corresponding pre-training model based on the batch normalized statistical loss of each of the pre-training models and the target cross-entropy loss;
    处理子模块,配置为对各个所述预训练模型对应的所述第一损失求均值,得到每一批所述第一蒸馏数据经过所述至少两个预训练模型的目标损失;The processing submodule is configured to average the first losses corresponding to each of the pre-training models to obtain the target loss of each batch of the first distillation data after passing through the at least two pre-training models;
    训练子模块,配置为基于所述目标损失,对每一批所述第一蒸馏数据进行反向传播训练,得到所述目标蒸馏数据。The training submodule is configured to perform backpropagation training on each batch of the first distillation data based on the target loss to obtain the target distillation data.
  18. 如权利要求14至17任一项所述的装置,其中,所述第一确定模块包括:The device according to any one of claims 14 to 17, wherein the first determining module comprises:
    初始化子模块,配置为基于原始数据的分布信息,初始化至少一批第二蒸馏数据;The initialization submodule is configured to initialize at least one batch of second distillation data based on the distribution information of the original data;
    混合子模块,配置为在每一次迭代训练中,对每一批所述第二蒸馏数据中每两个图像数据进行混合,得到每一批所述第一蒸馏数据。The mixing sub-module is configured to, in each iterative training, mix every two image data in each batch of the second distillation data to obtain each batch of the first distillation data.
  19. 如权利要求18所述的装置,其中,所述混合子模块包括:The apparatus of claim 18, wherein the hybrid submodule comprises:
    选取单元,配置为从每一批所述第二蒸馏数据中随机选取至少一对第一图像和第二图像;a selection unit configured to randomly select at least one pair of first image and second image from each batch of the second distillation data;
    缩放单元,配置为将每一所述第一图像的尺寸按照特定比例缩小;a scaling unit configured to reduce the size of each of the first images according to a specific ratio;
    覆盖单元,配置为将缩小后的所述第一图像随机覆盖到对应的所述第二图像中,得到每一批所述第一蒸馏数据。The covering unit is configured to randomly cover the reduced first image into the corresponding second image to obtain each batch of the first distillation data.
  20. 如权利要求19所述的装置,其中,所述覆盖单元包括:The apparatus of claim 19, wherein the covering unit comprises:
    第一生成子单元,配置为按照所述特定比例,在对应的所述第二图像中随机生成待覆盖的混合区域;The first generating subunit is configured to randomly generate a mixed area to be covered in the corresponding second image according to the specific ratio;
    第二生成子单元,配置为基于所述混合区域的边界框,随机生成二进制掩码;The second generation subunit is configured to randomly generate a binary mask based on the bounding box of the mixed region;
    叠加子单元,配置为通过所述二进制掩码对缩小后的所述第一图像的每一像素值和所述第二图像中对应的像素值进行叠加,得到每一批所述第一蒸馏数据。The superposition subunit is configured to superimpose each pixel value of the reduced first image and the corresponding pixel value in the second image through the binary mask to obtain each batch of the first distillation data .
  21. 如权利要求14至20任一项所述的装置,其中,所述每一批第一蒸馏数据中包括由第一图像和第二图像混合得到的合成图像,所述第四确定模块包括:The device according to any one of claims 14 to 20, wherein each batch of first distillation data includes a composite image obtained by mixing the first image and the second image, and the fourth determination module includes:
    第四确定子模块,配置为基于所述第一图像的初始化标签和所述第二图像的初始化标签,确定所述合成图像的混合交叉熵损失;The fourth determining submodule is configured to determine the hybrid cross-entropy loss of the composite image based on the initialization label of the first image and the initialization label of the second image;
    第五确定子模块,配置为基于所述每一批第一蒸馏数据中除所述第一图像和所述第二图像之外的其他图像的初始化标签,确定所述其他图像的累计交叉熵损失;The fifth determination sub-module is configured to determine the cumulative cross-entropy loss of the other images based on the initialization labels of the images other than the first image and the second image in each batch of first distillation data ;
    第六确定子模块,配置为基于所述混合交叉熵损失和所述累计交叉熵损失,确定每一批所述第一蒸馏数据经过每一所述预训练模型后的目标交叉熵损失。The sixth determining submodule is configured to determine the target cross-entropy loss of each batch of the first distillation data after passing through each of the pre-trained models based on the mixed cross-entropy loss and the cumulative cross-entropy loss.
  22. 如权利要求21所述的装置,其中,所述第四确定子模块包括:The apparatus according to claim 21, wherein the fourth determining submodule comprises:
    第一确定单元,配置为基于所述第一图像的初始化标签,确定每一批所述第一蒸馏数据通过每一所述预训练模型后的第一交叉熵损失;The first determination unit is configured to determine the first cross-entropy loss of each batch of the first distillation data after passing through each of the pre-training models based on the initialization label of the first image;
    第二确定单元,配置为基于所述第二图像的初始化标签,确定每一批所述第一蒸馏数据通过每一所述预训练模型后的第二交叉熵损失;The second determination unit is configured to determine a second cross-entropy loss after each batch of the first distillation data passes through each of the pre-training models based on the initialization label of the second image;
    求和单元,配置为基于所述第一图像与所述第二图像之间的合成比例,对所述第一交叉熵损失和所述第二交叉损失进行线性求和,得到所述目标交叉熵损失。A summation unit configured to linearly sum the first cross-entropy loss and the second cross-entropy loss based on the synthesis ratio between the first image and the second image to obtain the target cross-entropy loss.
  23. 如权利要求14至22任一项所述的装置,其中,所述第一统计信息包括均值和方差,每一所述预训练模型包括至少两个组成块,每一所述组成块包括卷积层和批归一化层,所述第三确定模块还包括:The apparatus according to any one of claims 14 to 22, wherein said first statistical information includes mean and variance, each of said pre-trained models includes at least two building blocks, each of said building blocks includes convolution layer and batch normalization layer, the third determination module further includes:
    提取子模块,配置为从每一所述组成块的批归一化层中提取所述原始数据的均值和方差;所述均值和方差是通过所述卷积层提取的所述原始数据的特征进行统计得到的;an extraction sub-module configured to extract the mean and variance of the raw data from the batch normalization layer of each of the constituent blocks; the mean and variance are features of the raw data extracted by the convolutional layer obtained through statistics;
    累加子模块,配置为基于每一所述预训练模型中所述至少两个组成块的均值和方差,确定每一所述预训练模型中的第一统计信息。The accumulating submodule is configured to determine the first statistical information in each of the pre-trained models based on the mean and variance of the at least two constituent blocks in each of the pre-trained models.
  24. 如权利要求19所述的装置,其中,所述分布信息为高斯分布信息,所述初始化子模块包括:The device according to claim 19, wherein the distribution information is Gaussian distribution information, and the initialization submodule comprises:
    获取单元,配置为获取所述原始数据的高斯分布信息;An acquisition unit configured to acquire Gaussian distribution information of the original data;
    初始化单元,配置为基于从所述高斯分布信息中随机采样的数据,对至少N个初始像素值矩阵进行初始化,得到每批所述第二蒸馏数据;N为大于等于2的整数。The initialization unit is configured to initialize at least N initial pixel value matrices based on randomly sampled data from the Gaussian distribution information to obtain each batch of the second distillation data; N is an integer greater than or equal to 2.
  25. 如权利要求18所述的装置,其中,所述训练子模块:The apparatus of claim 18, wherein the training submodule:
    第三确定单元,配置为在反向传播过程中的每一次迭代训练中,确定所述目标损失针对所述第一蒸馏数据的梯度;A third determining unit configured to determine the gradient of the target loss with respect to the first distillation data in each iterative training in the backpropagation process;
    更新单元,配置为基于所述梯度,对所述第一蒸馏数据中每一数据进行更新;an updating unit configured to update each of the first distillation data based on the gradient;
    训练单元,配置为在更新后的第一蒸馏数据在每一所述预训练模型中的批归一化统计损失和所述目标交叉熵损失收敛到稳定值时,得到所述目标蒸馏数据。The training unit is configured to obtain the target distillation data when the batch normalization statistical loss and the target cross-entropy loss in each of the pre-training models of the updated first distillation data converge to a stable value.
  26. 如权利要求14至25任一项所述的装置,其中,所述装置还包括:压缩模块,配置为在至少一批所述目标蒸馏数据的数量达到特定阈值的情况下,对每一所述预训练模型进行模型压缩。The device according to any one of claims 14 to 25, wherein the device further comprises: a compression module configured to, when the quantity of at least one batch of the target distillation data reaches a specific threshold, Pretrained model for model compression.
  27. 一种电子设备,包括存储器和处理器,所述存储器存储有可在处理器上运行的计算机程序,所述处理器执行所述程序时实现权利要求1至13任一项所述方法中的步骤。An electronic device, comprising a memory and a processor, the memory stores a computer program that can run on the processor, and the processor implements the steps in the method of any one of claims 1 to 13 when executing the program .
  28. 一种计算机可读存储介质,其上存储有计算机程序,其中,该计算机程序被处理器执行时实现权利要求1至13中任一项所述方法中的步骤。A computer-readable storage medium, on which a computer program is stored, wherein, when the computer program is executed by a processor, the steps in the method of any one of claims 1 to 13 are realized.
  29. 一种计算机程序,包括计算机可读代码,当所述计算机可读代码在电子设备中运行时,所述电子设备中的处理器执行用于实现权利要求1至13中任一项所述的方法。A computer program, comprising computer readable code, when the computer readable code is run in the electronic device, the processor in the electronic device executes the method for implementing any one of claims 1 to 13 .
  30. 一种计算机程序产品,所述计算机程序产品包括一条或多条指令,所述一条或多条指令适于由处理器加载并执行如权利要求1至13任一项所述方法中的步骤。A computer program product comprising one or more instructions adapted to be loaded by a processor to execute the steps in the method of any one of claims 1 to 13.
PCT/CN2022/071121 2021-08-27 2022-01-10 Data distillation method and apparatus, device, storage medium, computer program, and product WO2023024406A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110994122.7A CN113762368A (en) 2021-08-27 2021-08-27 Method, device, electronic equipment and storage medium for data distillation
CN202110994122.7 2021-08-27

Publications (1)

Publication Number Publication Date
WO2023024406A1 true WO2023024406A1 (en) 2023-03-02

Family

ID=78791493

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/071121 WO2023024406A1 (en) 2021-08-27 2022-01-10 Data distillation method and apparatus, device, storage medium, computer program, and product

Country Status (2)

Country Link
CN (1) CN113762368A (en)
WO (1) WO2023024406A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116486285A (en) * 2023-03-15 2023-07-25 中国矿业大学 Aerial image target detection method based on class mask distillation
CN117576518A (en) * 2024-01-15 2024-02-20 第六镜科技(成都)有限公司 Image distillation method, apparatus, electronic device, and computer-readable storage medium

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113762368A (en) * 2021-08-27 2021-12-07 北京市商汤科技开发有限公司 Method, device, electronic equipment and storage medium for data distillation
CN116630724B (en) * 2023-07-24 2023-10-10 美智纵横科技有限责任公司 Data model generation method, image processing method, device and chip

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111008693A (en) * 2019-11-29 2020-04-14 深动科技(北京)有限公司 Network model construction method, system and medium based on data compression
US20200272940A1 (en) * 2019-02-25 2020-08-27 Salesforce.Com, Inc. Data privacy protected machine learning systems
CN111860572A (en) * 2020-06-04 2020-10-30 北京百度网讯科技有限公司 Data set distillation method, device, electronic equipment and storage medium
CN112446476A (en) * 2019-09-04 2021-03-05 华为技术有限公司 Neural network model compression method, device, storage medium and chip
CN112766463A (en) * 2021-01-25 2021-05-07 上海有个机器人有限公司 Method for optimizing neural network model based on knowledge distillation technology
CN113762368A (en) * 2021-08-27 2021-12-07 北京市商汤科技开发有限公司 Method, device, electronic equipment and storage medium for data distillation

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111738436B (en) * 2020-06-28 2023-07-18 电子科技大学中山学院 Model distillation method and device, electronic equipment and storage medium
CN111950638B (en) * 2020-08-14 2024-02-06 厦门美图之家科技有限公司 Image classification method and device based on model distillation and electronic equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200272940A1 (en) * 2019-02-25 2020-08-27 Salesforce.Com, Inc. Data privacy protected machine learning systems
CN112446476A (en) * 2019-09-04 2021-03-05 华为技术有限公司 Neural network model compression method, device, storage medium and chip
CN111008693A (en) * 2019-11-29 2020-04-14 深动科技(北京)有限公司 Network model construction method, system and medium based on data compression
CN111860572A (en) * 2020-06-04 2020-10-30 北京百度网讯科技有限公司 Data set distillation method, device, electronic equipment and storage medium
CN112766463A (en) * 2021-01-25 2021-05-07 上海有个机器人有限公司 Method for optimizing neural network model based on knowledge distillation technology
CN113762368A (en) * 2021-08-27 2021-12-07 北京市商汤科技开发有限公司 Method, device, electronic equipment and storage medium for data distillation

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116486285A (en) * 2023-03-15 2023-07-25 中国矿业大学 Aerial image target detection method based on class mask distillation
CN116486285B (en) * 2023-03-15 2024-03-19 中国矿业大学 Aerial image target detection method based on class mask distillation
CN117576518A (en) * 2024-01-15 2024-02-20 第六镜科技(成都)有限公司 Image distillation method, apparatus, electronic device, and computer-readable storage medium
CN117576518B (en) * 2024-01-15 2024-04-23 第六镜科技(成都)有限公司 Image distillation method, apparatus, electronic device, and computer-readable storage medium

Also Published As

Publication number Publication date
CN113762368A (en) 2021-12-07

Similar Documents

Publication Publication Date Title
WO2023024406A1 (en) Data distillation method and apparatus, device, storage medium, computer program, and product
US11636283B2 (en) Committed information rate variational autoencoders
WO2021042828A1 (en) Neural network model compression method and apparatus, and storage medium and chip
US20200210773A1 (en) Neural network for image multi-label identification, related method, medium and device
CN109711422B (en) Image data processing method, image data processing device, image data model building method, image data model building device, computer equipment and storage medium
CN111104962A (en) Semantic segmentation method and device for image, electronic equipment and readable storage medium
WO2021018163A1 (en) Neural network search method and apparatus
CN112651438A (en) Multi-class image classification method and device, terminal equipment and storage medium
EP3701429A1 (en) Auto-regressive neural network systems with a soft attention mechanism using support data patches
AU2021354030B2 (en) Processing images using self-attention based neural networks
US20230070008A1 (en) Generating three-dimensional object models from two-dimensional images
CN110751649B (en) Video quality evaluation method and device, electronic equipment and storage medium
CN113761153B (en) Picture-based question-answering processing method and device, readable medium and electronic equipment
CN112052808A (en) Human face living body detection method, device and equipment for refining depth map and storage medium
CN115311504B (en) Weak supervision positioning method and device based on attention relocation
Ji et al. ColorFormer: Image colorization via color memory assisted hybrid-attention transformer
CN114821058A (en) Image semantic segmentation method and device, electronic equipment and storage medium
CN114925320B (en) Data processing method and related device
CN112989843A (en) Intention recognition method and device, computing equipment and storage medium
CN111814884A (en) Target detection network model upgrading method based on deformable convolution
Athar et al. Deep neural networks for blind image quality assessment: addressing the data challenge
CN114912540A (en) Transfer learning method, device, equipment and storage medium
WO2021179117A1 (en) Method and apparatus for searching number of neural network channels
CN114493674A (en) Advertisement click rate prediction model and method
CN114155388A (en) Image recognition method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE