CN111126599A

CN111126599A - Neural network weight initialization method based on transfer learning

Info

Publication number: CN111126599A
Application number: CN201911321102.2A
Authority: CN
Inventors: 范益波; 刘超
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2019-12-20
Filing date: 2019-12-20
Publication date: 2020-05-08
Anticipated expiration: 2039-12-20
Also published as: CN111126599B

Abstract

The invention belongs to the technical field of neural network models, and particularly relates to a neural network weight initialization method based on transfer learning. In the method, for a specified target task, a neural network model with higher complexity, namely a teacher model, is designed, the teacher model is trained, and after the training is finished, the generated characteristic diagram is used for guiding the weight initialization of the student model; calculating the difference of the characteristic maps in a regeneration core Hilbert space by calculating the difference between the characteristic maps or mapping the characteristic maps to the regeneration core Hilbert space, and simplifying the calculation by adopting a kernel function method; the simple student model achieves a better weight initialization effect, and after the weight initialization is completed, the student model is generally trained, so that the student model achieves a better global convergence point, and the performance of the student model is more excellent. The invention can effectively improve the performance of the student model on the premise of not increasing the complexity of the student model.

Description

Neural network weight initialization method based on transfer learning

Technical Field

The invention belongs to the technical field of neural network models, and particularly relates to a neural network weight initialization method based on knowledge transfer learning.

Background

The neural network has been developed in recent years, especially in the computer vision field and the natural language processing field, many performances of the neural network exceed those of human beings, however, the high calculation amount and the large training requirement of the neural network cause great obstacles for the practical application of the neural network. Therefore, how to make a lightweight model perform better becomes a hot spot problem to be solved.

Over the past few years, many researchers have proposed various schemes to help neural networks achieve a better convergence. The method mainly comprises the following classes, one class is based on knowledge distillation and knowledge migration, and the trained teacher model is used for helping the student model to express better by adding some additional loss functions in the training process of the student model, so that the performance of the student model is improved on the basis of not increasing the complexity of the student model. The second type is quantization pruning based on a model, and the original 32-bit addition and subtraction is changed into 8-bit or even 1-bit addition and subtraction by quantizing the weight of the neural network, so that the weight complexity of the neural network is greatly reduced. Thereby reducing the amount of computation. Pruning is to directly delete some connecting edges of the neural network connection, and then to evaluate whether the loss of the model caused by pruning is negligible, thereby achieving the effect of effectively reducing the complexity of the model.

Disclosure of Invention

The invention aims to provide a neural network weight initialization method for effectively improving the performance of a model on the basis of not increasing the complexity of the model.

The neural network weight initialization method provided by the invention is based on the knowledge transfer learning technology, namely a neural network model with higher complexity is called a teacher model (the teacher model with higher complexity is difficult to be applied in practical engineering); the prior knowledge is learned from a relatively complex teacher model, so that a neural network model with low complexity (called a student model which has good balance between complexity and performance in practical application) is helped to have a good initialization state, and a local optimal point in the training process is removed, so that a better training effect is achieved.

Firstly, designing a neural network model with higher complexity, namely a teacher model, for a specified task, training the teacher model, and guiding the weight initialization of the student model by utilizing a generated characteristic diagram after the training is finished; calculating the difference of the characteristic maps in a regeneration core Hilbert space by calculating the difference between the characteristic maps or mapping the characteristic maps into the regeneration core Hilbert space, and simplifying the calculation by adopting a kernel function method; the simple student model achieves a better weight initialization effect, and after the weight initialization is completed, the student model is generally trained, so that the student model achieves a better global convergence point and is more excellent in performance.

The invention can effectively lead the student model to avoid the problem of converging to the local optimal solution caused by parameter dependence in the initial training.

The invention provides a neural network weight initialization method, which comprises the following specific steps:

(1) for a specific learning task, a conventional loss function and a model structure are often possessed, firstly, a teacher model is designed aiming at a target task, and the teacher model is trained by using the conventional loss function;

(2) then exporting the middle layer output of the trained teacher model, and obtaining a characteristic diagram in a mapping mode; wherein the mapping method may be attention migration [ Sergey Zagoruyko and Nikos Komodakis, "riding movement to engagement of the performance of the connected neural networks vision engagement transfer," arXiv prediction arXiv:1612.03928,2016 ], or mapping into the regeneration nuclear Hilbert space using a kernel function [ Zehao Huang and Naiyan Wang, "Like world shift of vision selection," arXiv prediction arXiv:1707.01219,2017 ]; specifically, the formula is shown as (2) and (3);

(3) designing a student model with a simpler structure, wherein the student model and the teacher model are required to have the same network structure; that is, the basic network layers constituting the network should be consistent, for example, the network structure adopts a serial connection network constituted based on convolutional layers, and the teacher model has more convolutional layers, more feature maps, fewer convolutional layers of the student model, and fewer feature maps;

(4) training the student model by taking the mean square error between the feature map calculated in the step (2) and the feature map obtained by mapping in the same way as the student model as a loss function; after training is finished, the obtained weights of the student models are not initialization of normal distribution or uniform distribution in the traditional sense, but the weights of the student models are adjusted by learning knowledge from the teacher model, so that the weights of the student models are specifically initialized, and the student models have the capability of approaching the performance of the teacher model;

(5) after the initialization is completed, the student model is finally trained by using a conventional loss function, and a usable student model is obtained.

The conventional loss function refers to a mean square error that is often used in the current task.

In the invention, the complexity of the teacher model is higher than that of the student model, so that the student model can well learn the characteristics of the teacher model.

Drawings

FIG. 1 is a schematic view of the process of the present invention.

Training loss and test loss (QP 22) for the experiment of fig. 2.

Training loss and testing loss (QP 37) for the experiment of fig. 3.

Detailed Description

The present invention is further described below by taking the neural network-based loop filtering task in video coding as an example.

For the target task, a neural network module is required to be added in the traditional video encoder, such as HEVC, and the function of the neural network module is loop filtering. The performance of the video encoder is improved by a loop filtering method based on a neural network. It can be understood as a noise reduction filtering problem to remove the artificial imprint and noise brought by the conventional video encoder. Firstly, designing a teacher model with higher complexity, wherein the complexity of the teacher model is significantly higher than that of the teacher model in the actual target application of the final target, for example, the computational complexity and the consumed computational resources of the teacher model are more than 2 times of those of the expected design model; and training the teacher model by using a conventional loss function to obtain a trained teacher model. Aiming at a loop filtering task based on a neural network, a convolutional neural network is designed, the structure of the convolutional neural network is shown in figure 1, the upper half part is a designed teacher model structure, and the lower half part is a designed student model structure. The conventional loss function is referred to herein as the mean square error, which is often used in current tasks. This loss function is used to train teacher and student models.

In the aspect of model structure, the depth separable convolution and batch normalization are adopted as main layers of the teacher model, the characteristic layer number is 64, and the convolution kernel size is 3x 3. Wherein 24 depth-separable convolutional layers are used as the backbone of the teacher's model, and are divided into three parts, the first part being composed of 10 depth-separable convolutional layers, the second part being composed of 8 depth-separable convolutional layers, and the third part being composed of 6 depth-separable convolutional layers. The last layer of the model is a common convolution layer, the number of the characteristic layers is 1, and the size of a convolution kernel is 1x 1. All depth separable convolutions have a ReLU activation function. The input of the model is connected to the final output through a direct edge, so that the neural network is in a residual learning state, and the model converges more quickly.

The input and output of the model are respectively a reconstructed pixel map and a filtered pixel map of a video encoder, and a loss function L_TOptionally the output of the neural network

And original pixel

Mean square error between:

after the teacher model training is finished, the middle layer outputs of three sub-parts of the teacher model are obtained from the trained data set, and the output results F of the teacher model at the positions are calculated from the neural network attention map or the mapping result of the regenerative nuclear Hilbert space_T. Intermediate layer calculation result F using teacher model_TAnd the data set input by the teacher model form a new data set to train the student model.

The construction of the student model needs a similar teacher model so as to ensure the success of knowledge migration. We have also adopted a similar network architecture using 9 depth separable convolutional layers as the backbone of the student model, dividing it into three parts, each of which is composed of 3 depth separable convolutional layers. The number of characteristic layers is 32, and the convolution kernel size is 3x 3. Meanwhile, because the designed targets of the student model and the teacher model are consistent, the input and the output of the student model are also consistent, and a structure of a direct connection edge is adopted for well learning residual errors of the student model.

The middle layer output of the teacher model and the middle layer output of the student model are subjected to the same mapping, and the mean square error between the middle layer output of the teacher model and the middle layer output of the student model is calculated. The loss function is expressed as equations (2) and (3). Calculating an attention diagram of the system by using a linear kernel function k (x, y) ═ x^Ty to approximate the result of the hubert space mapping at the regenerating kernel. The calculation formula is as follows:

here, F_T,F_SRepresenting the attention diagrams of the teacher and student models respectively,

and

the ith feature map, C_TAnd C_SThe number of feature layers of the teacher model and the student model are respectively represented. For practical applications, p is a positive integer. Thus, an initialized student model is obtained.

Then, the standard mean square error L shown in the formula (4) is used_SAnd finally training the student model to obtain the trained student model. Corresponding to

Representing the output of the student model.

After such initialization, lightweight models may tend to perform better than without the initialization method. After the student model is obtained, the student model is led back to the video encoder, and therefore loop filtering based on the neural network in video encoding can be achieved. The training and testing penalty in the experiment is shown in fig. 3, and it is evident that the penalty function decreases faster after initialization is used.

Claims

1. A neural network weight initialization algorithm based on transfer learning is characterized in that for a specified target task, a neural network model with high complexity, namely a teacher model, is designed, the teacher model is trained, and after the training is finished, a generated characteristic diagram is used for guiding the weight initialization of a student model; calculating the difference of the characteristic maps in a regeneration core Hilbert space by calculating the difference between the characteristic maps or mapping the characteristic maps to the regeneration core Hilbert space, and simplifying the calculation by adopting a kernel function method; the simple student model achieves a better weight initialization effect, and after the weight initialization is completed, the student model is generally trained, so that the student model achieves a better global convergence point and has more excellent performance; the method comprises the following specific steps:

(1) for a specific learning task, a conventional loss function and a model structure are possessed, firstly, a teacher model is designed aiming at a target task, and the teacher model is trained by using the conventional loss function;

(2) then exporting the middle layer output of the trained teacher model, and obtaining a characteristic diagram in a mapping mode; wherein, the mapping mode has attention migration, or the mapping mode uses kernel function to map to the regeneration kernel Hilbert space;

(3) designing a student model with a simpler structure, wherein the student model and the teacher model are required to have the same network structure, namely the basic network layer forming the network is consistent; when the network structures all adopt the series connection network formed by the convolution layers, the teacher model has more convolution layers and more feature graphs, and the student model has fewer convolution layers and fewer feature graphs;

(4) training the student model by taking the mean square error between the feature map calculated in the step (2) and the feature map obtained by mapping in the same way as the student model as a loss function; after training is finished, the obtained weight of the student model is adjusted by learning knowledge from the teacher model, so that the weight of the student model is specifically initialized, and the student model has the capability of approaching the performance of the teacher model;

(5) after initialization is completed, the student model is trained by using a conventional loss function, and a usable student model is obtained.

2. The neural network weight initialization algorithm based on transfer learning of claim 1, characterized in that a deep separable convolution and batch normalization is adopted as a main layer of a teacher model, the number of characteristic layers is 64, and the convolution kernel size is 3x 3; wherein 24 depth-separable convolutional layers are used as the backbone of the teacher model, and are divided into three parts, the first part is composed of 10 depth-separable convolutional layers, the second part is composed of 8 depth-separable convolutional layers, and the third part is composed of 6 depth-separable convolutional layers; the last layer of the model is a common convolution layer, the number of the characteristic layers is 1, and the size of a convolution kernel is 1x 1; all depth separable convolutions have a ReLU activation function; the input of the model is connected to the final output through a straight edge, so that the neural network is in a residual learning state, and the convergence is faster.

3. The neural network weight initialization algorithm based on transfer learning of claim 2, wherein the input and output of the model are respectively a reconstructed pixel map and a filtered pixel map of a video encoder, and the loss function L is_TSelecting the output of a neural network

And original pixel

Mean square error between:

after the teacher model training is finished, the middle layer outputs of three sub-parts of the teacher model are obtained from the trained data set, and the output results F of the teacher model at the positions are calculated from the neural network attention map or the mapping result of the regenerative nuclear Hilbert space_T(ii) a Intermediate layer calculation result F using teacher model_TForming a new data set with the data set input by the teacher model to train the student model;

the student model adopts a network structure similar to a teacher model, uses 9 depth separable convolutional layers as a backbone of the student model, and divides the student model into three parts, wherein each part is composed of 3 depth separable convolutional layers, the number of the characteristic layers is 32, and the size of a convolutional kernel is 3x 3; the designed targets of the student model and the teacher model are consistent, the input and the output of the student model are also consistent, and a structure of a direct connection edge is also adopted for better learning residual errors of the student model;

the middle layer output of the teacher model and the middle layer output of the student model are subjected to the same mapping, and the mean square error between the middle layer output of the teacher model and the middle layer output of the student model is calculated; the loss function is expressed as formulas (2) and (3); calculating an attention diagram of the system by using a linear kernel function k (x, y) ═ x^Ty to approximate the result of the hilbert space mapping at the regenerating kernel; the calculation formula is as follows:

and

the ith feature map, C_TAnd C_SRespectively representing the number of characteristic layers of a teacher model and a student model, and taking a positive integer as p; thus, an initialized student model is obtained.

4. The neural network weight initialization algorithm based on transfer learning of claim 3, wherein a standard mean square error L as shown in formula (4) is used_SCarrying out final training on the initialized student model to obtain a trained student model; corresponding to

Representing the output of the student model;

and leading the student model back to a video encoder, namely realizing the loop filtering based on the neural network in the video encoding.