CN112749800A

CN112749800A - Neural network model training method, device and storage medium

Info

Publication number: CN112749800A
Application number: CN202110004383.XA
Authority: CN
Inventors: 黄高; 王朝飞; 宋士吉; 杨琪森
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2021-01-04
Filing date: 2021-01-04
Publication date: 2021-05-04

Abstract

A neural network model training method, a device and a storage medium are provided, wherein the neural network model training method comprises the following steps: in the training process of the teacher network, the preselected teacher networks with different time nodes are respectively saved as process models; integrating the stored multiple process models to form a new teacher network; and training the student network by using the new teacher network.

Description

Neural network model training method, device and storage medium

Technical Field

The present disclosure relates to the field of neural network model compression, and in particular, to a neural network model training method, apparatus, and storage medium.

Background

Compared with a large deep neural network model, the performance of the lightweight neural network model is generally poor, and some applications with high performance requirements are difficult to meet. Model compression is the most common method of this problem, and generally includes methods of model pruning, parameter quantification, knowledge distillation, and the like.

Knowledge distillation is a concept proposed by Hinton in 2015, and aims to achieve the purpose of migrating the knowledge of a teacher network to a student network by introducing the knowledge of a pre-trained teacher network (generally a large network with superior performance and high complexity) as a part of a training loss function for constructing the student network (a lightweight network to be deployed at an application end with poor performance and low complexity). With respect to the method of knowledge distillation, over the years of development, many researchers have proposed various ways to represent the knowledge of the teacher's network, including methods of matching softened class labels (i.e., soft labels) of the teacher's network and the student's network, intermediate layer features, attention maps, relationships between instances or relationships between layers in the network structure, and the like. However, in these methods, the knowledge learned by the student network is only the knowledge of the trained teacher network, and does not include the knowledge in the training process of the teacher network itself, and the knowledge migration is not complete enough.

Disclosure of Invention

The following is a summary of the subject matter described in detail herein. This summary is not intended to limit the scope of the claims.

The embodiment of the application provides a neural network model training method which can expand the knowledge that a student network can learn.

The embodiment of the application provides a neural network model training method, which comprises the following steps:

in the training process of the teacher network, the preselected teacher networks with different time nodes are respectively saved as process models;

integrating the stored multiple process models to form a new teacher network;

and training a student network by using the new teacher network.

The embodiment of the application also provides a neural network model training device, which comprises a memory and a processor, wherein the memory is used for storing a program for training the neural network model; the processor is used for reading the program for training the neural network model and executing the neural network model training method.

Embodiments of the present application further provide a computer-readable storage medium storing computer-executable instructions for performing the neural network model training method described above.

The technical scheme of the embodiment of the application not only can enable students to learn the knowledge after the teacher network is solidified, but also can learn the process experience of the teacher network during learning, so that the knowledge transfer is more complete, and the students can learn richer and more generalized knowledge through the network.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. Other advantages of the present application may be realized and attained by the instrumentalities and combinations particularly pointed out in the specification and the drawings.

Other aspects will be apparent upon reading and understanding the attached drawings and detailed description.

Drawings

The accompanying drawings are included to provide an understanding of the present disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the examples serve to explain the principles of the disclosure and not to limit the disclosure.

FIG. 1 is a flow chart of a neural network model training method in an embodiment of the present application;

FIG. 2 is a second flowchart of a neural network model training method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an embodiment of a neural network model training apparatus;

fig. 4 is a second schematic diagram of the neural network model training apparatus in the embodiment of the present application.

Detailed Description

The present application describes embodiments, but the description is illustrative rather than limiting and it will be apparent to those of ordinary skill in the art that many more embodiments and implementations are possible within the scope of the embodiments described herein. Although many possible combinations of features are shown in the drawings and discussed in the detailed description, many other combinations of the disclosed features are possible. Any feature or element of any embodiment may be used in combination with or instead of any other feature or element in any other embodiment, unless expressly limited otherwise.

The present application includes and contemplates combinations of features and elements known to those of ordinary skill in the art. The embodiments, features and elements disclosed in this application may also be combined with any conventional features or elements to form a unique inventive concept as defined by the claims. Any feature or element of any embodiment may also be combined with features or elements from other inventive aspects to form yet another unique inventive aspect, as defined by the claims. Thus, it should be understood that any of the features shown and/or discussed in this application may be implemented alone or in any suitable combination. Accordingly, the embodiments are not limited except as by the appended claims and their equivalents. Furthermore, various modifications and changes may be made within the scope of the appended claims.

Further, in describing representative embodiments, the specification may have presented the method and/or process as a particular sequence of steps. However, to the extent that the method or process does not rely on the particular order of steps set forth herein, the method or process should not be limited to the particular sequence of steps described. Other orders of steps are possible as will be understood by those of ordinary skill in the art. Therefore, the particular order of the steps set forth in the specification should not be construed as limitations on the claims. Further, the claims directed to the method and/or process should not be limited to the performance of their steps in the order written, and one skilled in the art can readily appreciate that the sequences may be varied and still remain within the spirit and scope of the embodiments of the present application.

As shown in fig. 1, an embodiment of the present application provides a neural network model training method, including:

s100, in the training process of the teacher network, the teacher networks of different time nodes selected in advance are respectively stored as process models;

s101, integrating a plurality of stored process models to form a new teacher network;

s102, the new teacher network is used for training the student network.

In step S100, the teacher network may be selected from any type of neural network model, and a high-performance and high-complexity model is generally selected as the teacher network; the selection of different training time nodes can adopt a manual selection method or an automatic selection method, can be uniform selection or non-uniform selection, and can select a process model according to needs.

In step S101, the stored multiple process models are integrated into a new teacher network, and there are multiple ways for integrating or integrating the multiple process models, which is not limited in this embodiment of the present application.

In step S102, the selection of the student network may be any type of neural network model, and the student network is generally a model with limited performance and low complexity.

The inventor of the application keenly finds in practice that the existing knowledge distillation methods are all based on learning the knowledge of the trained teacher network, and do not utilize the knowledge in the self-training process of the teacher network. If analogize with the reality education scene, the teacher only teaches the student with the knowledge that has solidified after oneself learns, does not teach the student with the knowledge of teacher in the learning process, actually, the process experience is often more important than the result knowledge.

In the scheme of the embodiment of the application, models in the teacher network training process are integrated, the process models reflect the learning process of a teacher network, the learning experience of the teacher network is included, the process models are integrated into a new teacher network, the new teacher network is used for training the student network, the student network can learn not only the knowledge after the teacher network is solidified, but also the process experience of the teacher network in learning is included, the knowledge migration is more complete, and the student network can learn more abundant and more generalized knowledge.

In some technologies, the application of deep neural networks has been developed from large devices to edge devices such as embedded devices and mobile devices. The equipment has the characteristics of limited computing capability, limited storage space and the like, and is difficult to deploy a large-scale deep neural network. The performance of the student network can be generally improved to a certain extent by a knowledge distillation method, so that the deployment effect of the lightweight network on the edge equipment is improved. In such an application scenario, the scheme in the embodiment of the application is beneficial to further improving the deployment effect of the lightweight network on the edge device.

In an exemplary embodiment, training a student network with a new teacher network includes:

performing knowledge distillation between the new teacher network and the new student network according to the selected knowledge distillation method, and constructing a training loss function corresponding to the knowledge distillation method;

and training the student network by using the loss function.

The knowledge of a new teacher network can be transferred to a student network in a knowledge distillation mode, and the method for distilling the knowledge is not limited in the embodiment of the application.

In an exemplary embodiment, integrating the stored plurality of process models includes:

assigning weight values omega to stored process models_jAnd according to the assigned weight value omega_jIntegrating a plurality of process models;

wherein, the weight values omega corresponding to the process models respectively_jThe sum is 1.

In an exemplary embodiment, the plurality of process models are assigned weight values ω respectively_jThe output of each process model may then be multiplied by a respective weight value and summed, with the resulting sum being the output of the new teacher network. However, the embodiment of the present application does not limit how the weight value is used.

In an exemplary embodiment, weights are assigned to the stored plurality of process models, respectivelyWeight value omega_jThe method comprises any one of the following modes:

the weight values are respectively preset for the process models, or the process models obtain the respective weight values in a training and autonomous learning mode.

For weight value omega_jThe weight value ω of each process model may be set by uniformly distributing the weight values ω to a plurality of process models, for example, a total of n process models are selected_j1/n; the weight value omega assigned to each process model can also be adjusted according to the actual effect_jAre different values; the respective weight values can also be obtained by a plurality of process models through a training self-learning mode. The embodiment of the present application does not limit the setting manner of the weight value.

In an exemplary embodiment, the weight value ω is assigned_jIntegrating the plurality of process models, comprising:

for arbitrary input sample x_iIs provided with

Wherein, T'_θA new network of teachers is represented,

representing the process model and n representing the number of process models.

In an exemplary embodiment, a network of teachers at different preselected time nodes includes:

teacher network at different epoch time selected according to preset interval.

The network models at n moments in the training process can be selected and stored according to actual needs (such as calculation overhead, running speed and the like), for example, the complete training process contains 200 epochs, and the 10 models generated at 20 th, 40 th, 60 th, 80 th, 100 th, 120 th, 140 th, 160 th, 180 th and 200 th epoch moments can be uniformly selected

The predetermined interval canTo be the time interval between epochs; or the number interval of the epochs, namely selecting several epochs at intervals; the same interval selection model can be set, and different intervals can also be set for selection, and the meaning of the intervals and the setting form of the intervals are not limited in the embodiment of the application.

In an exemplary embodiment, the knowledge distillation method employed comprises: and matching softened classification labels (namely soft labels (KD)) of the teacher network and the student network, an attention map (AT), middle layer features (hits), the relation between the instances or the relation between layers in the network structure, and the like.

In an exemplary embodiment, knowledge distillation is performed according to a selected knowledge distillation method, and a training loss function corresponding to the knowledge distillation method is constructed, including:

when a soft label knowledge distillation method is selected, when a student network is trained, a soft label loss item is added on the basis of a cross entropy loss function, and the formula of the training loss function is as follows:

wherein the content of the first and second substances,

a function representing the loss of training is represented,

represents the cross entropy loss function, σ (-) represents the softmax function, y represents the true class label, zs and zt represent the logits outputs of the student network and the new teacher network, respectively, T represents the temperature parameter, and α represents the loss term adjustment coefficient.

As shown in fig. 2, the following describes the scheme of the embodiment of the present application, taking training of a student network by using a knowledge distillation method as an example:

first, a teacher network T is trained on a known data set_θGenerally, a larger network is selected, such as ResNet50, DenseNet110, etc.; according toSelecting and storing the network models at n epoch moments in the training process according to actual needs (such as calculation overhead, running speed and the like);

secondly, the n process models are integrated to obtain a stronger teacher network T'_θFor example, a weight value ω may be assigned to each of the stored process models_jAnd according to the assigned weight value omega_jIntegrating multiple process models, and sampling any input sample x_iIs provided with

Wherein ω is_jIs a weight, and ∑_jω_j＝1；

Thirdly, selecting the student network to be deployed

Typically small networks such as ShuffleNet v2, MobileNet v2, and the like; from T'_θAs a network of teachers, the network of teachers,

as a student network, the knowledge distillation is carried out by adopting a traditional distillation method, and the training loss function of the student network is designed

Training student networks using constructed loss functions

Finally, the trained student network can be paired

And carrying out testing and deployment.

Compared with other methods for training the student network based on the knowledge distillation method, the method provided by the embodiment of the application not only limits the knowledge that the student network can learn after the teacher network is solidified, but also comprises the experience of the learning process of the teacher network, so that the knowledge that the student network can learn can be expanded, and the student network has stronger generalization.

As shown in fig. 3, an embodiment of the present application provides a neural network model training apparatus, including a memory and a processor, where the memory is used to store a program for performing neural network model training; the processor is used for reading a program for training the neural network model and can execute the neural network model training method in the embodiment.

As shown in fig. 4, an embodiment of the present application further provides a neural network model training apparatus, including:

the teacher network training module is internally provided with a linear classifier, a cosine distance classifier and a cross entropy loss function, has a process model storage function, and can respectively store the preselected teacher networks with different time nodes as process models according to setting in the training process of the teacher network;

the process model integration module is internally provided with an integration algorithm and is used for integrating a plurality of stored process models to form a new teacher network;

and the student network training module is internally provided with the setting of training parameters and is set to train the student network by utilizing a new teacher network.

In an exemplary embodiment, the neural network model training apparatus further includes:

and the loss function reconstruction module is internally provided with a plurality of knowledge distillation modes, and can automatically reconstruct the loss function of the student network according to the corresponding knowledge distillation modes.

In an exemplary embodiment, the process model integration module is further configured to include a selection mode for integrating weight parameters of different process models, and to assign weight values ω to the stored process models_jAnd according to the assigned weight value omega_jIntegrating a plurality of process models;

In an exemplary embodiment, the selection pattern of the different process models integrating the weight parameters may include:

uniformly selecting a mode: a pattern of weighted values assigned to a plurality of process models on average;

manually setting the mode: respectively presetting weighted value modes for the process models;

an autonomous learning mode: the multiple process models obtain respective patterns of weight values by training an autonomous learning manner.

In an exemplary embodiment, the teacher web training module is further configured to: teacher networks at different epoch times are selected at preselected intervals.

In an exemplary embodiment, the plurality of knowledge distillation methods built in the loss function reconstruction module include: the softened class labels (i.e., soft labels (KD)), attention maps (AT), middle layer features (hits), instances-to-instances relationships, or layer-to-layer relationships in the network structure, etc., of the teacher network and the student network are matched.

the data set preprocessing module is internally provided with a data preprocessing function and can be used for cleaning and standardizing the training data set, increasing samples, inputting a shuffle input sequence, dividing the mini-batch into different sizes and the like.

the teacher network selection module is internally provided with a plurality of network structures with higher complexity, and a user can select different teacher networks.

the student network selection module is internally provided with various lightweight network structures, and a user can select different student networks to perform learning training.

In an exemplary embodiment, the student network training module is further configured to train the student network in conjunction with the reconstructed loss function.

and the testing and deploying module is used for testing the network model obtained by training, storing the model after the test is finished and passed and implementing deployment.

The neural network model training device can implement the neural network model training method in any of the above embodiments, and details of implementation are not repeated herein. The module setting of the neural network model training device can be reduced or increased according to the actual application scene, and the embodiment of the application does not limit the module setting.

The embodiment of the present application further provides a computer-readable storage medium, which stores computer-executable instructions for executing the neural network model training method in the foregoing embodiment.

The following describes a neural network model training method in the embodiment of the present application with an example one:

example 1

Taking an application scene for image classification as an example, a reference data set selects CIFAR-100, a training set and a test set are divided in a standard dividing mode and respectively comprise 50000 pictures and 10000 pictures, an evaluation criterion adopts top-1 accuracy, a teacher network selects ResNet18, ResNet50 and DenseNet121, a student network selects Mobilenetv2 and Shufflentv 2, a classifier adopts a linear classifier, and a sample enhancement strategy adopts random clipping and horizontal turning.

The main parameters are set as follows:

the batch size is selected to be 128, the iteration number is 200, the optimizer selects Adam, the coding tool is Pythrch, and the Titan Xp video card is adopted for model training. The time for selecting and storing the process models is integral multiple epoch of 20, namely 10 process models are stored in one teacher network training, the average value 1/10 is selected by integrating weights, the temperature T is 4, and the loss term coefficient alpha is 0.5.

The results of the experimental comparison on the CIFAR-100 dataset are shown in the following table, wherein TEAcher represents a Teacher network, Student represents a Student network, Baseline represents the result of training the Student network independently without introducing the TEAcher network (i.e., without using a knowledge distillation method), KD represents the result of outputting a supervised training single Student by using a traditional single TEAcher soft label, and Ours represents the result obtained by using the method of the embodiment of the application.

As can be seen from the table, compared with a method for training a student network alone, the method provided by the embodiment of the application can be improved by 4.68% on average, and compared with a traditional KD method, the method can be improved by 2.41% on average, so that the effectiveness and the generalization of the method provided by the embodiment of the application are fully demonstrated.

It will be understood by those of ordinary skill in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

Claims

1. A neural network model training method is characterized by comprising the following steps:

integrating the stored multiple process models to form a new teacher network;

and training a student network by using the new teacher network.

2. The neural network model training method of claim 1, wherein said training a student network with the new teacher network comprises:

and training the student network by using the loss function.

3. The neural network model training method of claim 1, wherein the integrating the stored plurality of process models comprises:

respectively assigning weight values omega to the stored multiple process models_jAnd according to the assigned weight value omega_jIntegrating the plurality of process models;

wherein the process models respectively correspond to weight values omega_jThe sum is 1.

4. The neural network model training method of claim 3, wherein the weight values ω are respectively assigned to the stored process models_jThe method comprises any one of the following modes:

and respectively presetting weight values for the process models, or obtaining respective weight values for the process models in a training and autonomous learning mode.

5. The neural network model training method of claim 3, wherein the weight value ω is assigned_jIntegrating the plurality of process models, comprising:

for arbitrary input sample x_iIs provided with

Wherein, T'_θA new network of teachers is represented,

representing the process model and n representing the number of process models.

6. The neural network model training method of claim 1, wherein the network of teachers at different preselected time nodes comprises:

teacher network at different epoch time selected according to preset interval.

7. The neural network model training method of claim 2, wherein the knowledge distillation method comprises: matching softened class labels, attention maps, middle tier features, instance to instance relationships, or layer to layer relationships in a network structure for the new teacher network and the student networks.

8. The neural network model training method of claim 7, wherein the knowledge distillation is performed according to the selected knowledge distillation method, and the construction of the training loss function corresponding to the knowledge distillation method comprises:

when the soft label knowledge distillation method is selected, the formula of the training loss function is as follows:

wherein the content of the first and second substances,

a function representing the loss of training is represented,

9. The neural network model training device comprises a memory and a processor, and is characterized in that the memory is used for storing a program for training a neural network model; the processor is used for reading the program for training the neural network model and executing the neural network model training method according to any one of claims 1 to 8.

10. A computer-readable storage medium storing computer-executable instructions for performing the neural network model training method of any one of claims 1-8.