CN111860572B

CN111860572B - Data set distillation method, device, electronic equipment and storage medium

Info

Publication number: CN111860572B
Application number: CN202010498711.1A
Authority: CN
Inventors: 彭启明; 路华; 曾凯; 罗斌
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-06-04
Filing date: 2020-06-04
Publication date: 2024-01-26
Anticipated expiration: 2040-06-04
Also published as: CN111860572A

Abstract

The application discloses a data set distillation method, a data set distillation device, electronic equipment and a storage medium, and relates to the field of deep learning, wherein the method can comprise the following steps: randomly initializing N distilled data with respect to an original data set to be processed, wherein N is a positive integer greater than one, and performing the following predetermined processing: training a data authenticity judging model by using the original data set and the N pieces of distillation data, and updating the N pieces of distillation data according to the data authenticity judging model; training a classification model by using the N distilled data, wherein the classification model is a classification model corresponding to a classification task corresponding to the original data set, and updating the N distilled data according to the original data set and the classification model; and if the termination condition is met, taking the latest N pieces of distillation data as the required data set distillation results, otherwise, repeatedly executing the preset processing. By applying the scheme, the distillation effect and the like can be improved.

Description

Data set distillation method, device, electronic equipment and storage medium

Technical Field

The present application relates to computer application technologies, and in particular, to a data set distillation method, apparatus, electronic device, and storage medium in the field of deep learning.

Background

With the continuous improvement of computing power, deep learning techniques have been rapidly developed. Generally, a large-scale data set is required to be used as a training set for training a deep learning model.

The increase of the data volume is accompanied by a series of problems, such as that a large amount of redundant information is usually present in a large-scale data set, and the excessive redundant information can cause deviation between the learning result and the expected result of the model, and further greatly increase the training time of the model.

For this reason, distillation of the data set can be performed, but at present, no better implementation mode is available, for example, an active learning mode is mostly adopted at present, a part of data subset with the largest lifting of the model is screened from the unlabeled original data set in each period and sent to a labeling expert for labeling, so that a model effect as good as possible can be obtained on the basis of the least data, and the purpose of data compression is indirectly realized.

Disclosure of Invention

The application provides a data set distillation method, a data set distillation device, an electronic device and a storage medium.

A data set distillation method comprising:

randomly initializing N distilled data with respect to an original data set to be processed, wherein N is a positive integer greater than one, and performing the following predetermined processing:

training a data authenticity judging model by using the original data set and the N pieces of distillation data, and updating the N pieces of distillation data according to the data authenticity judging model;

training a classification model by using N distilled data, wherein the classification model is a classification model corresponding to classification tasks corresponding to the original data set, and updating the N distilled data according to the original data set and the classification model;

and if the termination condition is met, taking the latest N pieces of distillation data as the required data set distillation results, otherwise, repeatedly executing the preset processing.

A dataset distillation apparatus comprising: an initialization module and a distillation module;

the initialization module is used for randomly initializing N distilled data aiming at an original data set to be processed, wherein N is a positive integer greater than one;

the distillation module is used for executing the following predetermined treatment: training a data authenticity judging model by using the original data set and the N pieces of distillation data, and updating the N pieces of distillation data according to the data authenticity judging model; training a classification model by using N distilled data, wherein the classification model is a classification model corresponding to classification tasks corresponding to the original data set, and updating the N distilled data according to the original data set and the classification model; and if the termination condition is met, taking the latest N pieces of distillation data as the required data set distillation results, otherwise, repeatedly executing the preset processing.

An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.

A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method as described above.

A computer program product comprising a computer program which, when executed by a processor, implements a method as described above.

One embodiment of the above application has the following advantages or benefits: the method comprises the steps of randomly initializing a plurality of distilled data, distilling knowledge in an original data set into the distilled data, so that redundant information in the original data set is reduced, distilled data with reduced data size is obtained, the process can be completed quickly and automatically, labor and time cost are saved, and the like. It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for better understanding of the present solution and do not constitute a limitation of the present application. Wherein:

FIG. 1 is a flow chart of a first embodiment of a data set distillation method described herein;

FIG. 2 is a flow chart of a second embodiment of a data set distillation method described herein;

FIG. 3 is a schematic diagram showing the composition of an embodiment of a dataset distillation device 30 according to the present application;

fig. 4 is a block diagram of an electronic device according to a method according to an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present application to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In addition, it should be understood that the term "and/or" herein is merely one association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.

FIG. 1 is a flow chart of a first embodiment of a data set distillation method described herein. As shown in fig. 1, the following detailed implementation is included.

At 101, N distilled data are randomly initialized for the raw data set to be processed, N being a positive integer greater than one.

In 102, the following predetermined processing is performed: training a data authenticity judging model by using the original data set and the N pieces of distillation data, and updating the N pieces of distillation data according to the data authenticity judging model; and training a classification model by using the N distilled data, wherein the classification model is a classification model corresponding to a classification task corresponding to the original data set, and updating the N distilled data according to the original data set and the classification model.

In 103, if it is determined that the termination condition is met, the N distillation data obtained latest are used as the required data set distillation results, otherwise, the predetermined process described in 102 is repeatedly performed.

In this embodiment, for the raw data set to be processed, N pieces of distillation data may be initialized at first at random, where N is a positive integer greater than one, and the specific value may be determined according to actual needs. The distillation data may be in the form of a rectangle, and the random initialization of the distillation data refers to the random initialization of the element values in the matrix to which the distillation data corresponds. The specific values of the rows and columns of the matrix can also be determined according to actual needs. In this way, the initialization of the distillation data can be completed quickly and conveniently, thereby providing a good basis for subsequent processing.

Thereafter, the predetermined process may be performed, wherein the data authenticity discrimination model may be first trained using the raw data set and the N distilled data. If the first label is assigned to each original data in the original data set, the first label is used for identifying the corresponding data as the original data, the second label is assigned to each distilled data, the second label is used for identifying the corresponding data as the distilled data, and then the data authenticity judging model can be trained according to each original data, each distilled data and the corresponding label. And taking the original data and the distilled data as training data, and training the data authenticity discrimination model according to the corresponding labels.

The specific form of the first tag and the second tag is not limited, and for example, the first tag may be 0, the second tag may be 1, or the first tag may be 1, and the second tag may be 0. The data authenticity discriminant model may be used to evaluate the authenticity of the input data.

Further, the N distilled data may be updated according to the data authenticity discrimination model. Wherein, for each distilled data, the following treatments can be performed separately: the distilled data is input into a data authenticity discrimination model, and gradient updating is performed on the distilled data according to an output loss (loss) value.

For example, for the distillation data a, the data authenticity judging model may be input to obtain the output of the data authenticity judging model, the output is usually a loss value, the obtained loss value may be used to derive the distillation data a, and the distillation data a may be updated by using a gradient descent algorithm according to the derivation result, which is specifically implemented in the prior art.

By the method, the distilled data can be more similar to the real original data, so that the distilled data can keep the characteristics of the real original data as far as possible, the distillation effect is improved, conditions are created for analyzing the data characteristics of the distilled data, the training effect in the process of model training by using the distilled data is improved, and the like.

The classification model may then be trained using the N distillation data (updated N distillation data), the classification model being one corresponding to the classification task corresponding to the original dataset. If the distillation data are respectively assigned with a classification label, the classification label is used for identifying the classification to which the corresponding distillation data belong, and then a classification model can be trained according to the distillation data and the corresponding classification label.

For example, if the classification task corresponding to the original dataset is to distinguish whether the animal in the picture is a cat or a dog, the classification label may be a cat or a dog, and in general, all classification labels in the classification task need to be covered by the classification labels corresponding to the N distilled data.

Further, the N distilled data may be updated according to the original data set and the classification model. If the raw data in the raw data set can be respectively input into the classification model, the gradient update is carried out on the N distilled data according to the output loss values. The classification model can be used for evaluating all the original data to obtain an evaluation result of the model on the original data, and then the gradient descent algorithm is used for updating the distilled data.

How to perform gradient update on the N distilled data according to each loss value outputted is not limited. For example, assuming that 100 pieces of original data exist in the original data set, namely, 1-100 pieces of original data, respectively, the outputs corresponding to 1-10 pieces of original data can be obtained, the 10 loss values can be added, the distillation data can be derived according to the addition result, gradient update can be performed according to the result of the derivation, then the outputs corresponding to 11-20 pieces of original data can be obtained, the 10 loss values can be added, the distillation data can be derived according to the result of the addition, gradient update can be performed according to the result of the derivation, the outputs corresponding to 21-30 pieces of original data can be obtained, and the like, and the above processing is repeated.

After updating the N distilled data according to the original data set and the classification model, the value of the set count parameter may be increased by one, if the value of the count parameter after being increased is equal to the predetermined threshold, it may be determined that the termination condition is met, so that the N distilled data obtained newly are used as the distillation result of the required data set, otherwise, the predetermined process described in 102 may be repeatedly performed, and the initial value of the count parameter is 0. The predetermined process is continuously performed until the termination condition is met. The specific value of the predetermined threshold value can be determined according to actual needs.

According to the existing active learning mode, the selected data are all subsets of the original data set, namely all data belonging to the original data set, which also limits the compression capacity of the data set to a certain extent. In addition, assuming that redundant information exists between two or more pieces of text in the original data set, one piece of text can be used to summarize the two or more pieces of text, and the summarized piece of text does not belong to the original data set, in theory, the redundant information in the data set can be reduced to the greatest extent, and the effect can be achieved by adopting the method described in the embodiment.

In the method described in the above embodiment, a plurality of distillation data may be initialized randomly, knowledge in the original data set may be distilled into the distillation data to the greatest extent, so that redundant information in the original data set may be reduced, and distillation data with reduced data size may be obtained.

After the distillation data is obtained, the distillation data can be used for training the model, the model performance similar to that when the original data set is used for training the model can be obtained, the model performance is ensured, and meanwhile, the model training speed and the like can be obviously improved.

Based on the above description, FIG. 2 is a flow chart of a second embodiment of the dataset distillation method described herein. As shown in fig. 2, the following detailed implementation is included.

In 201, a raw data set to be processed is acquired.

At 202, N distilled data are randomly initialized, N being a positive integer greater than one.

The distillation data may be in the form of a rectangle, with random initializations of the element values in the matrix.

At 203, a data authenticity discrimination model is trained using the raw dataset and the N distilled data.

If the first label is assigned to each original data in the original data set, the first label is used for identifying the corresponding data as the original data, and the second label is assigned to each distilled data, the second label is used for identifying the corresponding data as the distilled data, and then the data authenticity judging model can be trained according to each original data, each distilled data and the corresponding label.

In 204, the distillation data is input into the data authenticity discrimination model for each distillation data, and the distillation data is gradient-updated according to the output loss value.

At 205, a classification model is trained using the N distillation data.

If the distillation data are respectively assigned with a classification label, the classification label is used for identifying the classification to which the corresponding distillation data belong, and a classification model is trained according to the distillation data and the corresponding classification label.

At 206, each raw data in the raw data set is input into a classification model, and each distilled data is gradient-updated based on each loss value output.

In 207, the value of the set count parameter is incremented by one, it is determined whether the value of the incremented count parameter is equal to a predetermined threshold, if so, 208 is executed, otherwise, 203 is repeated.

At 208, the flow is terminated using the N distillation data obtained up to date as the distillation result of the desired data set.

In the process flow, the distillation data used in each step are the latest distillation data obtained after the previous treatment.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may take other order or occur simultaneously in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required in the present application. In addition, parts of a certain embodiment, which are not described in detail, may be referred to the related description of other embodiments, and will not be described in detail.

The foregoing is a description of embodiments of the method, and the following further describes embodiments of the device.

Fig. 3 is a schematic diagram showing the composition and structure of an embodiment of a dataset distillation apparatus 30 according to the present application. As shown in fig. 3, includes: an initialization module 301 and a distillation module 302.

An initialization module 301 is configured to randomly initialize N distillation data for an original data set to be processed, where N is a positive integer greater than one.

A distillation module 302 for performing the following predetermined processes: training a data authenticity judging model by using the original data set and the N pieces of distillation data, and updating the N pieces of distillation data according to the data authenticity judging model; training a classification model by using the N distilled data, wherein the classification model is a classification model corresponding to a classification task corresponding to the original data set, and updating the N distilled data according to the original data set and the classification model; and if the termination condition is met, taking the latest N pieces of distillation data as the required data set distillation results, otherwise, repeatedly executing the preset processing.

Wherein the distillation data may be in rectangular form, the initialization module 301 may randomly initialize the element values in the matrix.

When training the data authenticity discrimination model by using the original data set and the N pieces of distilled data, the distillation module 302 may assign a first tag to each piece of original data in the original data set, where the first tag is used to identify the corresponding piece of data as the original data, and assign a second tag to each piece of distilled data, where the second tag is used to identify the corresponding piece of data as the distilled data, and train the data authenticity discrimination model according to each piece of original data, each piece of distilled data, and the corresponding tag.

When updating the N distillation data according to the data authenticity discrimination model, the distillation module 302 may perform the following processing for any one distillation data: inputting the distilled data into a data authenticity judging model, and carrying out gradient updating on the distilled data according to the output loss value.

When training the classification model using N pieces of distillation data, the distillation module 302 may assign a classification label to each piece of distillation data, where the classification label is used to identify the classification to which the corresponding piece of distillation data belongs, and train the classification model according to each piece of distillation data and the corresponding classification label.

When updating the N pieces of distillation data according to the original data set and the classification model, the distillation module 302 may input each piece of original data in the original data set into the classification model, and perform gradient update on the N pieces of distillation data according to each loss value output.

After updating the N distillation data according to the original data set and the classification model, the distillation module 302 may further increment the value of the set count parameter by one, and if the incremented value of the count parameter is equal to the predetermined threshold, it may determine that the termination condition is met, thereby taking the N distillation data obtained newly as the distillation result of the required data set, and if not, the initial value of the count parameter is 0, and otherwise, the predetermined process may be repeatedly performed.

The specific workflow of the embodiment of the apparatus shown in fig. 3 is referred to the related description in the foregoing method embodiment, and will not be repeated.

In a word, by adopting the scheme of the embodiment of the application device, a plurality of distilled data can be randomly initialized, knowledge in an original data set is distilled to the greatest extent into the distilled data, so that redundant information in the original data set is reduced, distilled data with reduced data quantity is obtained, the process can be automatically completed quickly, labor and time cost are saved, and the like; after the distillation data is obtained, the distillation data can be used for training the model, the model performance similar to that when the original data set is used for training the model can be obtained, the model performance is ensured, and meanwhile, the model training speed and the like can be obviously improved.

According to embodiments of the present application, an electronic device and a readable storage medium are also provided.

As shown in fig. 4, is a block diagram of an electronic device according to a method according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the application described and/or claimed herein.

As shown in fig. 4, the electronic device includes: one or more processors Y01, memory Y02, and interfaces for connecting the components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the electronic device, including instructions stored in or on memory to display graphical information of a graphical user interface on an external input/output device, such as a display device coupled to the interface. In other embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple electronic devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). In fig. 4, a processor Y01 is taken as an example.

The memory Y02 is a non-transitory computer readable storage medium provided in the present application. Wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the methods provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the methods provided herein.

The memory Y02 serves as a non-transitory computer readable storage medium, and may be used to store a non-transitory software program, a non-transitory computer executable program, and modules, such as program instructions/modules corresponding to the methods in the embodiments of the present application. The processor Y01 executes various functional applications of the server and data processing, i.e., implements the methods in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory Y02.

The memory Y02 may include a memory program area that may store an operating system, at least one application program required for functions, and a memory data area; the storage data area may store data created according to the use of the electronic device, etc. In addition, memory Y02 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory Y02 may optionally include memory located remotely from processor Y01, which may be connected to the electronic device via a network. Examples of such networks include, but are not limited to, the internet, intranets, blockchain networks, local area networks, mobile communication networks, and combinations thereof.

The electronic device may further include: an input device Y03 and an output device Y04. The processor Y01, memory Y02, input device Y03, and output device Y04 may be connected by a bus or otherwise, for example in fig. 4.

The input device Y03 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device, such as a touch screen, keypad, mouse, trackpad, touchpad, pointer stick, one or more mouse buttons, trackball, joystick, and like input devices. The output means Y04 may include a display device, an auxiliary lighting means, a tactile feedback means (e.g., a vibration motor), and the like. The display device may include, but is not limited to, a liquid crystal display, a light emitting diode display, and a plasma display. In some implementations, the display device may be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific integrated circuitry, computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

These computing programs (also referred to as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. The terms "machine-readable medium" and "computer-readable medium" as used herein refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable logic devices) for providing machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a cathode ray tube or a liquid crystal display monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local area networks, wide area networks, blockchain networks, and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, provided that the desired results of the technical solutions disclosed in the present application can be achieved, and are not limited herein.

The above embodiments do not limit the scope of the application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application are intended to be included within the scope of the present application.

Claims

1. A data set distillation method comprising:

training a data authenticity discrimination model using the raw dataset and the N distilled data, comprising: respectively allocating a first label for each piece of original data in the original data set, wherein the first label is used for identifying the corresponding piece of data as the original data, respectively allocating a second label for each piece of distilled data, wherein the second label is used for identifying the corresponding piece of data as the distilled data, and training the data authenticity judging model according to each piece of original data, each piece of distilled data and the corresponding label; updating the N distillation data according to the data authenticity judging model;

2. The method of claim 1, wherein the randomly initializing N distillation data comprises: the distillation data is in a matrix form, and element values in the matrix are randomly initialized.

3. The method of claim 1, wherein the updating the N distilled data according to the data authenticity discrimination model comprises:

for any distilled data, the following treatments were performed separately: inputting the distillation data into the data authenticity judging model, and carrying out gradient updating on the distillation data according to the output loss value.

4. The method of claim 1, wherein the training a classification model using N distillation data comprises:

respectively assigning a classification label to each distillation data, wherein the classification label is used for identifying the classification to which the corresponding distillation data belongs; and training the classification model according to each distillation data and the corresponding classification label.

5. The method of claim 1, wherein the updating the N distillation data according to the raw dataset and the classification model comprises:

and respectively inputting each original data in the original data set into the classification model, and carrying out gradient update on the N distilled data according to each loss value output.

6. The method of claim 1, further comprising: after updating the N distillation data according to the original data set and the classification model, adding one to the value of the set counting parameter;

the determining that the termination condition is met includes: if the value of the counting parameter after adding one is equal to the preset threshold value, determining that the ending condition is met, wherein the initial value of the counting parameter is 0.

7. A dataset distillation apparatus comprising: an initialization module and a distillation module;

the distillation module is used for executing the following predetermined treatment: training a data authenticity judging model by using the original data set and the N pieces of distillation data, and updating the N pieces of distillation data according to the data authenticity judging model; training a classification model by using N distilled data, wherein the classification model is a classification model corresponding to classification tasks corresponding to the original data set, and updating the N distilled data according to the original data set and the classification model; if the termination condition is met, N pieces of distillation data which are obtained latest are used as required data set distillation results, otherwise, the preset processing is repeatedly executed;

the distillation module respectively allocates a first label for each piece of original data in the original data set, wherein the first label is used for identifying the corresponding piece of data as the original data, respectively allocates a second label for each piece of distillation data, and the second label is used for identifying the corresponding piece of data as the distillation data, and trains the data authenticity judging model according to each piece of original data, each piece of distillation data and the corresponding label.

8. The apparatus of claim 7, wherein the distillation data is in a matrix form, and the initialization module randomly initializes the element values in the matrix.

9. The apparatus of claim 7, wherein the distillation module performs the following processing for any one of the distillation data: inputting the distillation data into the data authenticity judging model, and carrying out gradient updating on the distillation data according to the output loss value.

10. The apparatus of claim 7, wherein the distillation module assigns a classification tag to each distillation data, the classification tag identifying a class to which the corresponding distillation data belongs, and trains the classification model based on each distillation data and the corresponding classification tag.

11. The apparatus of claim 7, wherein the distillation module inputs each raw data in the raw data set into the classification model, and performs gradient update on the N distilled data according to each loss value output.

12. The apparatus of claim 7, wherein the distillation module is further configured to increment the set count parameter by one after updating the N distillation data according to the raw data set and the classification model, and determine that a termination condition is met if the incremented count parameter is equal to a predetermined threshold, and the initial count parameter is 0.

13. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

14. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-6.