CN111126573A

CN111126573A - Model distillation improvement method and device based on individual learning and storage medium

Info

Publication number: CN111126573A
Application number: CN201911387562.5A
Authority: CN
Inventors: 尉桦; 李一力; 邵新庆; 刘强; 徐�明
Original assignee: Shenzhen ZNV Technology Co Ltd; Nanjing ZNV Software Co Ltd
Current assignee: Shenzhen ZNV Technology Co Ltd; Nanjing ZNV Software Co Ltd
Priority date: 2019-12-27
Filing date: 2019-12-27
Publication date: 2020-05-08
Anticipated expiration: 2039-12-27
Also published as: CN111126573B

Abstract

The invention discloses a model distillation improvement method based on individual learning, which comprises the following steps: inputting a preselected training set into a teacher network, and generating a representative sample and a non-representative sample after being screened by the teacher network; determining a first loss function for the student network to process representative samples and a second loss function for the student network to process non-representative samples; and inputting the preselected training set into a student network, and training through the first loss function and the second loss function to obtain a trained student network. The invention also discloses an intelligent device and a computer readable storage medium. By adopting different loss functions to train the representative samples and the non-representative samples, the characterization capability of the student network on the category to which the representative samples belong is improved, and the characterization capability of the student network on the category to which a single sample belongs is further improved.

Description

Model distillation improvement method and device based on individual learning and storage medium

Technical Field

The invention relates to the technical field of deep learning, in particular to a model distillation improvement method based on individual learning, intelligent equipment and a computer readable storage medium.

Background

In recent years, deep learning has been receiving more and more attention and has been successfully applied to many fields. However, as the flexibility of the mobile terminal device and the universality of the application range thereof are continuously improved, deeper and deeper network models also put higher requirements on the computing power and the storage capacity of the computer device. Therefore, how to compress the deep neural network model to adapt to more portable mobile devices and real-time application scenarios becomes a research hotspot in many fields.

Knowledge distillation is a major research direction for model compression, and its main idea is the ability to train a small model (student network) to learn a large model (teacher network). Currently, there are several main learning methods: the first method is to learn based on the cross entropy loss of the traditional 0, 1 label structure; the second is the ability to use soft tags of the student network to help the student network learn the teacher network; the third is to use the student network directly to fit an intermediate layer of the teacher network; the third is to use the student network to learn a plurality of middle layers in the teacher network to improve the learning ability of the student network; the fourth method is to learn the relationship among a plurality of samples in a teacher network through a student network, however, the distillation methods are usually based on a single sample in a training set for distillation, and for some recognition and classification tasks, especially for a face recognition task, whether a certain picture belongs to a certain person is concerned. Thus, the current stage of distillation methods suffers from poor characterization ability for the class to which the individual samples belong.

The above is only for the purpose of assisting understanding of the technical aspects of the present invention, and does not represent an admission that the above is prior art.

Disclosure of Invention

The invention mainly aims to provide a model distillation improving method based on individual learning, intelligent equipment and a computer readable storage medium, and aims to solve the problem that the distillation learning method in the prior art has poor characterization capability on the category of a single sample.

To achieve the above object, the present invention provides a model distillation improvement method based on individual learning, the method comprising the steps of:

inputting a preselected training set into a teacher network, and generating a representative sample and a non-representative sample after being screened by the teacher network;

determining a first loss function for the student network to process representative samples and a second loss function for the student network to process non-representative samples;

and inputting the preselected training set into a student network, and training through the first loss function and the second loss function to obtain a trained student network.

Optionally, the step of generating a representative sample and a non-representative sample after the teacher network screening comprises:

calculating a first Euclidean distance according to the sample characteristics of two samples of different classes, and calculating a second Euclidean distance according to the sample characteristic of one sample and the sample central characteristic of the class to which the sample belongs;

determining a selection factor of the sample according to the minimum first Euclidean distance and the second Euclidean distance;

and screening out representative samples and non-representative samples in the preselected training set according to the selection factor.

Optionally, the step of screening out the representative samples and the non-representative samples in the training set according to the selection factor includes:

judging whether the selection factor is smaller than a preset threshold value or not;

if the selection factor is smaller than a preset threshold value, judging that the current sample is a representative sample;

and if the selection factor is larger than or equal to a preset threshold value, judging that the current sample is a non-representative sample.

Optionally, the step of determining a first loss function for processing representative samples and a second loss function for processing non-representative samples of the student network is followed by:

determining a first loss function of the student network for processing the representative sample according to the loss function of the teacher network, the characteristic layer of the teacher network and the characteristic layer of the student network.

Optionally, the step of determining, according to the loss function of the teacher network, the feature layer of the teacher network, and the feature layer of the student network, a first loss function of the student network for processing the representative sample includes:

calculating the fitting degree of the characteristic layer of the student network fitting the characteristic layer of the teacher network as a first part of a first loss function, and taking the loss function same as that of the teacher network as a second part of the first loss function;

determining a sum of a product of a first parameter and the first portion and a product of a second parameter and the second portion as the first loss function; the first parameter and the second parameter are both larger than zero, the sum of the first parameter and the second parameter is 1, and the first parameter is larger than the second parameter.

Optionally, the step of determining a first loss function for processing representative samples and a second loss function for processing non-representative samples of the student network comprises:

the same loss function as the teacher network is determined as a second loss function for the student network to process the non-representative samples.

Optionally, the step of determining a first loss function for processing the representative sample and a second loss function for processing the non-representative sample of the student network further comprises:

define the feature layer of the teacher's network as f_TThe loss function of the student network is f_SThe first parameter is lambda;

if the loss function of the teacher network is determined to be cosface, the first loss function is determined to be lambada x sigma_k‖f_T-f_S‖²+ (1- λ). times.cosface, and the second loss function is determined to be cosface.

Optionally, the step of obtaining the trained student network through the training of the first loss function and the second loss function includes:

adjusting parameters of the student network by using the first loss function and the second loss function, wherein in the training process of adjusting the parameters of the student network to train the student network, the weight vector of the student network loss layer is consistent with the weight vector of the teacher network loss layer;

and when the parameters of the student network are adjusted to the minimum loss function, the trained student network is obtained.

In addition, to achieve the above object, the present invention further provides an intelligent device, which includes a memory, a processor and a model distillation improvement program based on individual learning stored on and executable on the processor, wherein the processor implements the steps of the model distillation improvement method based on individual learning as described above when executing the model distillation improvement program based on individual learning.

Further, to achieve the above object, the present invention also provides a computer-readable storage medium having stored thereon a model distillation improvement program based on individual learning, which when executed by a processor, realizes the steps of the model distillation improvement method based on individual learning as described above.

In the embodiment of the invention, the representative samples and the non-representative samples in the preselected training set are screened out through the teacher network, then the preselected training set is input into the student network, the student network trains the representative samples and the non-representative samples in the preselected training set respectively by adopting different loss functions, so that the problem that samples far away from the sample center and samples close to the sample center in the same type of samples are treated equally by adopting the same loss function and the class to which the samples far away from the sample center are not easy to distinguish is avoided, and the characterization capability of the class to which a single sample belongs is improved.

Drawings

Fig. 1 is a schematic structural diagram of an intelligent device in a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a schematic flow diagram of a first embodiment of a model distillation modification method based on individual learning according to the present invention;

FIG. 3 is a diagram showing the overall architecture of a teacher's network in an embodiment of the present invention model distillation improvement method based on individual learning;

FIG. 4 is a diagram illustrating the overall architecture of a student network in an embodiment of the model distillation improvement method based on individual learning according to the present invention;

FIG. 5 is a schematic flow chart diagram of a second embodiment of the model distillation modification method based on individual learning according to the present invention;

the implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Inputting a preselected training set into a teacher network, and generating a representative sample and a non-representative sample after being screened by the teacher network; determining a first loss function for the student network to process representative samples and a second loss function for the student network to process non-representative samples; and inputting the preselected training set into a student network, and training through the first loss function and the second loss function to obtain a trained student network.

Most of the current knowledge distillation methods are based on distillation of a single sample in a training set, namely most of the current knowledge distillation methods are only used for training the characterization capability of a student network learning teacher network on the single sample, but the characterization capability on the class to which the single sample belongs is poor. The invention provides a model distillation improvement method based on individual learning, intelligent equipment and a computer readable storage medium, wherein representative samples and non-representative samples in a preselected training set are screened out through a trained teacher network model, different loss functions are determined according to the screened representative samples and the non-representative samples to train the preselected training set input to a student network, the preselected training set is not trained by directly adopting the same loss function, the problem that the category of the representative samples is difficult to judge by adopting the same loss function training can be avoided, and the characterization capability of the category of a single sample is improved.

Referring to fig. 1, fig. 1 is a schematic structural diagram of an intelligent device in a hardware operating environment according to an embodiment of the present invention.

The intelligent device of the embodiment of the invention can be a PC, a smart phone, a tablet personal computer or other terminal devices with image recognition and/or processing functions, such as a bar code scanner and a face-brushing payment device of a supermarket.

As shown in fig. 1, the smart device may include: a communication bus 1002, a processor 1001, such as a CPU, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.

Optionally, the smart device may further include a camera, a sensor, an audio circuit, a WiFi module, and the like. The sensors may include light sensors as well as other sensors, among others. Specifically, the optical sensor may include an ambient optical sensor and a proximity sensor, wherein the ambient optical sensor may start a camera to collect a target image when detecting a target (e.g., a person, a human face, or a human gesture), and the proximity sensor may perform image recognition and processing on the collected target image to realize target detection when the target approaches the target detection device; the audio circuit can collect voice information of a target, can control the intelligent equipment to execute corresponding operation according to the voice information of the target (such as identifying Zhang III), and can also be used for matching the target to compare the voice matched target with a detection result of a target detection device so as to verify the reliability of the detection result of the device and the like; and the WiFi module can be used for connecting the target intelligent device with the terminal, uploading a target detection result to the terminal equipment, controlling the terminal to execute corresponding functional operation and the like. Of course, the smart device may also be configured with other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, an infrared sensor, and a temperature sensor, which are not described herein again.

Those skilled in the art will appreciate that the smart device architecture shown in FIG. 1 does not constitute a limitation of the smart device and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.

As shown in fig. 1, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and an individual learning-based model distillation improvement program.

In the terminal shown in fig. 1, the network interface 1004 is mainly used for connecting to a backend server and performing data communication with the backend server; the user interface 1003 is mainly used for connecting a client (user side) and performing data communication with the client; and the processor 1001 may be configured to invoke the individual learning based model distillation improvement program stored in the memory 1005 and perform the following operations:

Alternatively, the processor 1001 may call an individual learning based model distillation improvement program stored in the memory 1005, and also perform the following operations:

Alternatively, the processor 1001 calls the model distillation improvement program based on individual learning stored in the memory 1005 and performs the following operations:

Alternatively, the processor 1001 may invoke an individual learning based model distillation improvement program in the memory 1005, further performing the following operations:

Referring to fig. 2, fig. 2 is a schematic flow chart of a first embodiment of the model distillation improvement method based on individual learning according to the present invention. In this embodiment, the step of the model distillation improvement method based on individual learning includes:

step S10: inputting a preselected training set into a teacher network, and generating a representative sample and a non-representative sample after the preselected training set is screened by the teacher network;

in this embodiment, the preselected training set is input to a teacher network, and after being screened by the teacher network, the samples in the preselected training set can be distinguished into representative samples and non-representative samples. For different target detection tasks, the preselected training set may be a sample image set containing portrait information (such as a human face, a gesture, a posture and the like), a sample image set containing vehicle information (such as a license plate, appearance information of a vehicle and the like), a sample image set containing barcode information (such as a commodity barcode scanned by a barcode scanner in a supermarket or a payment two-dimensional code of a user and the like), and the like. The number of the samples in the preselected training set can be determined in a user-defined mode according to specific requirements, or can be determined according to training effects of a teacher network and/or a student network (for example, when the training effects of the teacher network and/or the student network are not good, the training samples in the training set can be properly increased, the samples with poor effects can be removed while the training samples are increased, and when the training effects are good, the current sample amount can be kept or the sample amount can be properly reduced, and the like). After the preselected training set is determined by combining the sample information and the number of samples in the training set, because some samples (such as face images with low resolution and the like in a face recognition task) which are difficult to identify the class of the preselected training set or other samples which need special processing always exist in the training set, the preselected training set is input into a trained teacher network, and representative samples and non-representative samples in the preselected training set can be screened out according to a feature layer of the teacher network. Before screening, a teacher network is constructed, and the teacher network is trained through a loss function of the teacher network. In one embodiment, the overall architecture of the teacher network is shown in FIG. 3. By adopting the resnet50 network as the network structure of the teacher network, the cosface can realize the maximization of the difference between classes and the minimization of the difference in classes through the normalization and the maximization of the cosine decision boundary, so that the cosface is used as the loss function of the teacher network. For some recognition tasks, a sample is usually represented by using a one-dimensional feature, a layer representing the sample in the teacher network is recorded as a feature layer, and the feature layer takes 256-dimensional features, that is, the network output of resnet50 is set as 256-dimensional feature output; the teacher network training system is characterized in that a loss layer for training the teacher network is arranged behind a characteristic layer of the teacher network, the loss layer is located on the last layer of the teacher network structure, and parameters of other layers of the student network can be adjusted through a loss function of the loss layer. The cosface loss used in this example is calculated as follows:

wherein the content of the first and second substances,

w and x are both over-normalized, N represents the total number of current training samples, x_iThe representation corresponds to the label y_iThe ith training sample (same label y)_iPossibly containing many samples x_i)，W_jIs a weight vector of class j, θ_jIs W_jAnd current sample x_iThe included angle of (a). N represents the total number of samples of the currently trained mini-batch, s represents a scaling scale, m represents the included angle of the central vectors of the two classes, in the embodiment, s is 64, m is 0.35, and the modular length is 1 because W and x are both normalized. That is, when using the resnet50 network, the network output is set to 256-dimensional features, and the feature level is denoted as f_TAnd training the teacher network by using cosface loss until the teacher network converges, so as to obtain a trained teacher network model. After a teacher network model is trained, a preselected training set is input into a teacher network, samples which are difficult to distinguish the types of samples are screened from the preselected training set aiming at different recognition or classification target detection tasks, namely samples with low output probability in the same type of samples are taken as representative samples, if ten thousand pieces of face information are input, wherein the ten thousand pieces of face information can comprise faces of one thousand persons, one type of face image which is three pieces of face image and four pieces of face image in one thousand persons of faces is screened as a representative sample, then other samples except the representative samples in the preselected training set are screened as non-representative samples, and the representative samples are marked. The representative samples may not be samples with categories that are not easily distinguished, but may also be other samples that are difficult to process for different tasks (e.g., samples with large noise or samples with large size), and the samples of this category are screened out as the representative samples.

Step S20: determining a first loss function for the student network to process representative samples and a second loss function for the student network to process non-representative samples;

after the representative samples and the non-representative samples in the preselected training set are screened out through the teacher network, a first loss function used by the student network for processing the representative samples and a second loss function used for processing the non-representative samples are determined. In some recognition and classification tasks, due to differences among samples, when similar samples are trained by using the same loss function, the classes of samples with larger differences (relatively smaller probability) in the similar samples are not easy to recognize, so that the samples are screened out to be used as representative samples, and the representative capacity of the representative samples can be improved by adopting different loss functions for training. Because the loss function reflects the degree of fitting of the model to the data, the worse the fitting, the larger the value of the loss function should be, and meanwhile, when the loss function is larger, the expected corresponding gradient is also larger, the determination of the loss function mainly has two requirements, namely, the real label of the problem to be solved can be reflected, and the loss function has reasonable gradient to facilitate the solution, so that the weight and the parameters are updated. For different target detection tasks, different Loss functions can be selected correspondingly, for example, a regression Loss function such as Mean Square Error (MSE) or Mean Absolute Error (MAE) can be selected for the regression task, and a classification Loss function such as cross entropy Loss function or Hinge Loss function can be selected for the classification task. In the embodiment, for the task of classifying or identifying the samples, a first loss function for processing the representative samples is determined according to the loss function of the teacher network, the characteristic layer of the teacher network and the characteristic layer of the student network, and the same loss function as the teacher network is used as a second loss function for processing the non-representative samples by the student network. Because the teacher network is trained by adopting the cosface loss function, the second loss function of the student network for processing the non-representative sample can be determined as the cosface loss function, and the processing of the representative sample can be determined by combining the learning condition of the student network learning teacher network (the fitting degree of the feature layer of the student network and the feature layer of the teacher network) and the cosface loss. In this embodiment, MobileNetV3 is selected as the network structure of the student network, and the feature layer of MobileNetV3 is designed as a 256-dimensional vectorAnd the loss function selects distillation loss + cosface loss to train the student network. The distillation loss represents the learning condition of a student network learning teacher network, and the overall structure of the student network is shown in FIG. 4. Specifically, the determining process of the first loss function includes: calculating a degree of fitting of a feature layer of a student network to a feature layer of a teacher network as a first part of a first loss function, taking a loss function with the teacher network as a second part of the first loss function, and determining a sum of a product of a first parameter and the first part and a product of a second parameter and the second part as the first loss function; the first parameter and the second parameter are both greater than zero and the sum of the first parameter and the second parameter is 1, the first parameter being greater than the second parameter. When defining the feature level of the teacher network as f_TThe loss function of the student network is f_SIf the first parameter is λ, if the loss function of the teacher network is determined to be cosface, the first loss function of the student network may be determined to be λ × Σ_k‖f_T-f_S‖²And + (1-lambda) multiplied by cosface, determining a second loss function of the student network as cosface, wherein the first half part of the formula represents the fitting degree of the characteristic layer of the student network and the characteristic layer of the teacher network, and the rear part of the formula is cosface loss. Since it is desirable that the sample feature layers of the student network are more likely to fit the feature layers of the teacher network during training of the representative sample, the first half is weighted more than the second half, i.e., λ is a value greater than 0.5 and less than 1, e.g., λ is 0.8 in this embodiment. After a first loss function of the student network for processing the representative sample and a second loss function of the student network for processing the non-representative sample are determined, the determined loss functions (including the first loss function and the second loss function) of the student network are added to the last layer of the student network model to be used as a loss layer of the student network, and parameters of the student network are trained.

Step S30: and inputting the preselected training set into a student network, and training through the first loss function and the second loss function to obtain a trained student network.

Inputting the preselected training set marked to represent the samples into the student network, and training through the first loss function and the second loss function to obtain the trained student network. The training process comprises: the representative samples in the training set are trained through the first loss function, the non-representative samples in the training set are trained through the second loss function, parameters of the student network are adjusted by combining the first loss function and the second loss function, and when the parameters of the student network are adjusted to be the minimum loss function, the trained student network can be obtained. In the training process of training the student network by adjusting the parameters of the student network, the weight vector of the student network loss layer is consistent with the weight vector of the teacher network loss layer. Specifically, after a preselected training set is input into a student network, output information (such as face classification information) of the student network is obtained through forward propagation of the student network, a loss function of the student network is calculated according to the output information to evaluate the difference between a predicted value and an actual value of the output of the student network, when the difference is large, the loss function is large, and at the moment, a gradient decreasing formula is utilized according to the total loss function, the loss function is fed back layer by layer, so that parameters (such as weight) of the student network are adjusted, a backward propagation mechanism is formed, and the output of the next round is more accurate. When the parameters of the student network are adjusted until the loss function does not decrease any more (the loss function is equal to zero or reaches the minimum value), the predicted value output by the student network is close to the actual value, and the trained student network can be obtained by stopping training. In addition, the distribution of the classes is uniform due to the weight vector obtained by teacher network learning, so that the weight vector of the student network loss layer is kept consistent with the weight vector of the teacher network loss layer in the training process, namely when the loss function of the teacher network is cosface, the weight vector in the student network loss layer (including the first loss function and the second loss function) is kept consistent with the weight vector in the teacher network cosface loss, and the purpose of the student network learning teacher network individual characterization capability is achieved through the weight matrix.

In the embodiment, the representative samples and the non-representative samples in the preselected training set are screened out by inputting the preselected training set into the teacher network, and then the representative samples and the non-representative samples are respectively trained by determining reasonable loss functions according to a specific target detection task and the corresponding training set. The representative samples and the non-representative samples are trained by adopting different loss functions instead of directly training the samples in the training set by adopting the same loss function, so that the problem that the representative samples are poor in characterization capability of the student network can be avoided, the characterization capability of the student network on the category to which the representative samples belong is improved, and the characterization capability of the student network on the category to which a single sample belongs is further improved.

Referring to fig. 5, fig. 5 is a schematic flow chart of a second embodiment of the model distillation improvement method based on individual learning according to the present invention. In this embodiment, the step of the model distillation improvement method based on individual learning includes:

step S11: inputting a preselected training set into a teacher network, and determining a selection factor of a sample according to the characteristics of the sample output by the teacher network;

step S12: screening out representative samples and non-representative samples in the preselected training set according to the selection factors;

step S13: determining a first loss function for the student network to process representative samples and a second loss function for the student network to process non-representative samples;

step S14: and inputting the preselected training set into a student network, and training through the first loss function and the second loss function to obtain a trained student network.

In the embodiment, after the pre-selected training set is input into the teacher network, the selection factor of the sample is determined according to the sample characteristics output by the teacher network. Specifically, according to the sample characteristics output by the teacher network, the sample center characteristics of each class sample and the sample characteristics of each sample in the same class or different classes can be obtained. Calculating a first Euclidean distance according to the sample characteristics of two samples of different classes, calculating a second Euclidean distance according to the sample characteristics of one sample of the two samples of different classes and the sample center characteristics of the class to which the sample belongs, determining a selection factor of the sample according to the first Euclidean distance and the second Euclidean distance, and determining the selection factor of the sample as the first Euclidean distanceQuotient to the second euclidean distance. Specifically, in one embodiment, the minimum Euclidean distance is determined as the first Euclidean distance by calculating the Euclidean distances of the sample features of the samples of different classes, and the calculation formula is mind (f)_Ti,f_Tj) Wherein f is_TiIs a 256-dimensional feature, f, of the i-th sample of the teacher network_TjIs 256 dimensional characteristics of the jth sample of the class to which the teacher network non-i belongs; and the Euclidean distance between the sample and the sample center can be calculated according to the sample characteristic of one sample and the sample center characteristic of the class to which the sample belongs, and the calculation formula is d (f)_Ti，FM_i) Wherein, FM_iSample center feature (center vector), FM, for the class to which sample i belongs_iIs calculated by the formula

n is the total number of samples of the class to which the sample i belongs; selecting factor of sample i

Is determined as

After determining a selection factor, screening out a representative sample and a non-representative sample according to the selection factor, wherein the screening mode can be that a proper threshold value is preset, judging whether the selection factor is smaller than a preset threshold value, and if so, judging that the current sample is the representative sample; and when the selection factor is larger than or equal to a preset threshold value, judging the current sample as a non-representative sample. The preset threshold may be set manually, or may be set according to the specific output condition of the teacher network. In this embodiment, the threshold ε is chosen as a statistical result, determined by the training set and the teacher's network. Inputting all training sets into a teacher network to calculate selection factors of all training samples, after eliminating error results, taking the value of the selection factor accounting for 30% of the total number of the training sets as a preset threshold value for screening representative samples, so that the representative samples are far away from the sample center of the sample of the own category but far away from the sample center of the sample of the other categoryThe characteristic of the sample being close in distance. After the representative samples are screened out, a first loss function of the student network for processing the representative samples and a second loss function of the student network for processing the non-representative samples are determined, then a preselected training set is input into the student network, and the trained student network is obtained through teacher network training.

According to the method, a preselected training set is input into a teacher network, selection factors of samples are determined according to a feature layer of the teacher network, representative samples in the training set are screened out according to the selection factors, a first loss function for processing the representative samples and a second loss function for processing non-representative samples are determined, the representative samples and the non-representative samples in the training set input into the student network are respectively trained to obtain a trained student network, samples which are difficult to distinguish the classes are screened out to be used as representative samples, the representative samples are trained through a specific loss function, the representation of the representative samples on the classes to which the representative samples belong can be improved, and therefore the representation capability of the student network on the classes to which the single samples belong is improved.

In addition, the embodiment of the present invention also provides an intelligent device, which includes a memory, a processor and an individual learning based model distillation improvement program stored on the processor and operable on the processor, wherein the processor implements the steps of the individual learning based model distillation improvement method when executing the individual learning based model distillation improvement program. The intelligent device may be a face recognition device or a mobile terminal device (such as a mobile phone, a computer, or a tablet) with a face recognition function, or other terminal devices (such as an intelligent door lock or a supermarket face-brushing payment device).

Furthermore, an embodiment of the present invention also provides a computer-readable storage medium, on which a model distillation improvement program based on individual learning is stored, which when executed by a processor implements the steps of the model distillation improvement method based on individual learning as described above.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, a television, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A model distillation improvement method based on individual learning, characterized in that the model distillation improvement method based on individual learning comprises the following steps:

2. The method of claim 1, wherein the step of generating representative samples and non-representative samples after screening by the teacher network comprises:

3. The method of claim 2, wherein the step of screening representative and non-representative samples in the training set based on the selection factor comprises:

4. The method of model distillation improvement based on individual learning of claim 1, wherein the step of determining a first loss function for processing representative samples and a second loss function for processing non-representative samples of a student network comprises:

5. The method of model distillation improvement based on individual learning of claim 4, wherein the step of determining a first loss function for the student network to process the representative sample based on the loss function of the teacher network, the feature layer of the teacher network, and the feature layer of the student network comprises:

6. The method of model distillation improvement based on individual learning of claim 1, wherein the step of determining a first loss function for processing representative samples and a second loss function for processing non-representative samples of a student network comprises:

7. The method of model distillation improvement based on individual learning of claim 5 or 6, wherein the step of determining a first loss function for processing representative samples and a second loss function for processing non-representative samples of a student network comprises:

8. The method of claim 1, wherein the step of training the trained student network with the first and second loss functions comprises:

9. An intelligent device, comprising a memory, a processor, and an individual learning based model distillation improvement program stored on the memory and executable on the processor, the processor implementing the steps of the individual learning based model distillation improvement method of any one of claims 1-8 when executing the individual learning based model distillation improvement program.

10. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon an individual learning based model distillation improvement program which, when executed by a processor, implements the steps of the individual learning based model distillation improvement method according to any one of claims 1 to 8.