CN111126599B

CN111126599B - Neural network weight initialization method based on transfer learning

Info

Publication number: CN111126599B
Application number: CN201911321102.2A
Authority: CN
Inventors: 范益波; 刘超
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2019-12-20
Filing date: 2019-12-20
Publication date: 2023-09-05
Anticipated expiration: 2039-12-20
Also published as: CN111126599A

Abstract

The invention belongs to the technical field of neural network models, and particularly relates to a neural network weight initialization method based on transfer learning. In the method, for a designated target task, a neural network model with higher complexity, namely a teacher model, is designed, the teacher model is trained, and after the training is finished, the weight initialization of the student model is guided by using the generated feature map; calculating the difference between the feature maps or mapping the feature maps into a regenerated kernel Hilbert space, calculating the difference of the feature maps in the regenerated kernel Hilbert space, and simplifying the calculation by adopting a kernel function method; the simple student model achieves a better weight initialization effect, and after the weight initialization is completed, the student model is generally trained, so that the student model achieves a better global convergence point, and the performance of the student model is more excellent. The invention can effectively improve the performance of the student model on the premise of not increasing the complexity of the student model.

Description

Neural network weight initialization method based on transfer learning

Technical Field

The invention belongs to the technical field of neural network models, and particularly relates to a neural network weight initialization method based on knowledge transfer learning.

Background

In recent years, the development of the neural network has been greatly advanced, especially in the fields of computer vision and natural language processing, and many performances of the neural network are superior to those of human beings, however, the excessively high calculation amount and the excessively large training requirement of the neural network cause great barriers to the practical application of the neural network. How to make a lightweight model perform better becomes a hotspot problem to be solved.

Over the past several years, many researchers have proposed various schemes to help neural networks achieve a better convergence. The training method mainly comprises the following steps of distilling and transferring based on knowledge, and trying to improve the performance of a student model by adding some extra loss functions in the training process of the student model and using a trained teacher model to help the student model to perform better, so that the performance of the model is improved on the basis of not increasing the complexity of the student model. The second type is pruning based on quantization of a model, and the original 32-bit addition and subtraction is changed into 8-bit addition and subtraction or even 1-bit addition and subtraction by quantizing the weight of the neural network, so that the weight complexity of the neural network is greatly reduced. Thereby reducing the amount of computation. The pruning is to delete some connecting edges of the neural network connection directly, and then evaluate whether the loss of the pruning to the accuracy of the model is negligible, thereby achieving the effect of reducing the complexity of the model effectively.

Disclosure of Invention

The invention aims to provide a neural network weight initialization method for effectively improving the model performance on the basis of not increasing the model complexity.

The neural network weight initialization method provided by the invention is based on a knowledge transfer learning technology, namely, a neural network model with higher complexity is called a teacher model (the teacher model with higher complexity is difficult to apply in actual engineering); the neural network model with lower complexity (called student model, which has better trade-off between complexity and performance in practical application) has a good initialization state, so that the local optimal point in the training process is moved out to achieve better training effect.

In the invention, firstly, a neural network model with higher complexity, namely a teacher model, is designed for a designated task, the teacher model is trained, and after the training is finished, the weight initialization of a student model is guided by utilizing a generated feature map; calculating the difference between the feature maps or mapping the feature maps into a regenerated kernel Hilbert space, calculating the difference of the feature maps in the regenerated kernel Hilbert space, and simplifying the calculation by adopting a kernel function method; the simple student model achieves a better weight initialization effect, and after the weight initialization is completed, the student model is generally trained, so that the student model achieves a better global convergence point, and the performance is more excellent.

The invention can effectively avoid the problem that the student model converges to the local optimal solution caused by parameter dependence in initial training.

The invention provides a neural network weight initialization method, which comprises the following specific steps:

(1) For a specific learning task, a conventional loss function and a model structure are often possessed, firstly, a teacher model is designed aiming at a target task, and the teacher model is trained by using the conventional loss function;

(2) Then, outputting and exporting the middle layer output of the trained teacher model, and obtaining a feature map in a mapping mode; the mapping mode can be attention migration [ Sergey Zagoruyko and Nikos Komodakis, "Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer," arXiv preprint arXiv:1612.03928,2016 ], or mapping into a regenerated kernel Hilbert space by using a kernel function [ Zehao Huang and Naiyan Wang, "Like what you like: knowledge distill via neuron selectivity transfer," arXiv preprint arXiv:1707.01219,2017 ]; specifically, the compounds are shown in the formula (2) and the formula (3);

(3) Designing a student model with a simpler structure, wherein the student model has the same network structure as a teacher model; the basic network layers forming the network are consistent, for example, the network structure adopts a serial connection network formed by convolution layers, the number of the convolution layers of the teacher model is more, the number of the feature images is more, the number of the convolution layers of the student model is less, and the number of the feature images is also less;

(4) Training the student model by taking the mean square error between the feature map calculated in the step (2) and the feature map obtained by mapping the same way as the student model as a loss function; after training is finished, the obtained weight of the student model is not the initialization of normal distribution or uniform distribution in the traditional sense, but the weight of the student model is adjusted by learning knowledge from the teacher model, so that the weight of the student model is specifically initialized, and the student model has the capability of approaching the performance of the teacher model;

(5) After initialization is completed, the student model is trained by using a conventional loss function, and an available student model is obtained.

Wherein, the conventional loss function refers to a mean square error commonly used in the current task.

In the invention, the complexity of the teacher model is higher than that of the student model, so that the student model can learn the characteristics of the teacher model well.

Drawings

FIG. 1 is a schematic diagram of the process of the present invention.

Training loss and test loss for the experiment of fig. 2 (qp=22).

Training loss and test loss for the experiment of fig. 3 (qp=37).

Detailed Description

The present invention is further described below with reference to the task of loop filtering based on neural networks in video coding.

For the target task, a neural network module needs to be added to a conventional video encoder, such as in HEVC, and the function of the module is loop filtering. The performance of the video encoder is improved by a loop filtering method based on a neural network. It can be understood that the problem of noise reduction and filtering is solved, and the artificial marks and noise brought by the traditional video encoder are removed. Firstly, designing a teacher model with higher complexity, wherein the complexity of the teacher model is obviously higher than that in the final target actual target application, for example, the computational complexity and the consumed computational resources of the teacher model are more than 2 times of those of an expected design model; and training the teacher model by using a conventional loss function to obtain a trained teacher model. For loop filtering task based on neural network, we designed a convolutional neural network, whose structure is as shown in fig. 1, the upper half is designed teacher model structure, and the lower half is designed student model structure. Conventional loss functions are referred to herein as mean square errors that are commonly used in current tasks. This loss function is used to train teacher and student models.

In terms of model structure, depth separable convolution and batch normalization are adopted as main layers of a teacher model, the number of layers is 64, and the convolution kernel size is 3x3. Wherein 24 depth separable convolution layers are used as the trunk of the teacher model and are divided into three parts, the first part is composed of 10 depth separable convolution layers, the second part is composed of 8 depth separable convolution layers, and the third part is composed of 6 depth separable convolution layers. The last layer of the model is a common convolution layer, the number of the layers is 1, and the convolution kernel size is 1x1. All depth separable convolutions have a ReLU activation function. The input of the model is connected to the final output through a direct connection edge, so that the neural network is in a residual learning state, and the model converges more quickly.

The input and output of the model are respectively a reconstructed pixel diagram and a filtered pixel diagram of the video encoder, and the loss function L _T The output of the neural network is selectedAnd original pixel->Mean square error between:

after the teacher model is trained, the middle layer output of three sub-parts of the teacher model is obtained from the trained data set, the mapping result of the attention map of the neural network or the regenerated kernel Hilbert space is used for calculating the output result F of the teacher model at the places _T . Intermediate layer calculation result F using teacher model _T And the data set input by the teacher model form a new data set to train the student model.

The construction of the student model requires a teacher-like model to ensure the success of knowledge migration. So we also have a similar network structure using 9 depth separable convolutional layers as the backbone of the student model, dividing it into three parts, each consisting of 3 depth separable convolutional layers. The number of layers is 32, and the convolution kernel size is 3x3. Meanwhile, as the designed targets of the student model and the teacher model are consistent, the input and the output of the student model are consistent, and the structure of the direct connection edge is adopted, so that the student model is used for well learning residual errors.

The middle layer output of the teacher model and the middle layer output of the student model are subjected to the same mapping, and the mean square error therebetween is calculated, and the like. The loss function is expressed as formulas (2) and (3). Calculate its attention map or use a linear kernel function k (x, y) =x ^T y to approximate the result of the spatial mapping in the regenerated kernel hilbert. The calculation formula is as follows:

here, F _T ,F _S Representing the attention patterns of the teacher and student models respectively,and->Then specifically represent the ith feature map, C _T And C _S The number of feature layers of the teacher model and the student model are represented, respectively. For practical applications, p is a positive integer. Thus, an initialized student model is obtained.

Then, the standard mean square error L shown in the formula (4) is used again _S To perform a final training of the student model,the trained student model can be obtained. Corresponding toRepresenting the output of the student model.

After such initialization, the lightweight model may often perform better than without the initialization method. After the student model is obtained, the student model is led back to the video encoder, and then loop filtering based on the neural network can be realized in video encoding. As figure 3 shows the training and testing losses at the time of the experiment, it is evident that the loss function decreases faster after the initialization is used.

Claims

1. A loop filter optimization method based on neural network in video coding is to add a neural network module in the traditional video coder, the function of the module is loop filter; aiming at a loop filtering target task, the neural network module comprises a neural network model with higher complexity, namely a teacher model and a simpler student model;

training a teacher model, and guiding the weight initialization of the student model by using the generated feature map after the training is completed; calculating the difference between the feature maps or mapping the feature maps into a regenerated kernel Hilbert space, calculating the difference of the feature maps in the regenerated kernel Hilbert space, and simplifying the calculation by adopting a kernel function method; the simple student model achieves a better weight initialization effect, and after the weight initialization is completed, the student model is generally trained, so that the student model achieves a better global convergence point, and the performance is more excellent; the method comprises the following specific steps:

(1) For a specific learning task, the model structure is provided with a conventional loss function, firstly, a teacher model is designed aiming at a target task, and the teacher model is trained by using the conventional loss function;

(2) Then, outputting and exporting the middle layer output of the trained teacher model, and obtaining a feature map in a mapping mode; the mapping mode has attention migration or uses a kernel function to map to a regenerated kernel Hilbert space;

(3) Designing a student model with a simpler structure, wherein the student model has the same network structure as a teacher model, namely, the student model has the same basic network layer forming a network; when the network structures are serial connection networks formed by the convolution layers, the number of the convolution layers of the teacher model is more, the number of the feature images is more, the number of the convolution layers of the student model is less, and the number of the feature images is also less;

(4) Training the student model by taking the mean square error between the feature map calculated in the step (2) and the feature map obtained by mapping the same way as the student model as a loss function; after training is finished, the obtained weight of the student model is obtained by learning knowledge from a teacher model to adjust the weight of the student model, so that the weight of the student model is specifically initialized, and the student model has the capability of approaching the performance of the teacher model;

(5) After initialization is completed, training the student model by using a conventional loss function to obtain an available student model;

the depth separable convolution and batch normalization are adopted as the main layer of a teacher model, the number of the characteristic layers is 64, and the convolution kernel size is 3x3; the method comprises the steps of using 24 depth separable convolution layers as a trunk of a teacher model, dividing the trunk into three parts, wherein the first part consists of 10 depth separable convolution layers, the second part consists of 8 depth separable convolution layers, and the third part consists of 6 depth separable convolution layers; the last layer of the model is a common convolution layer, the number of the layers is 1, and the size of a convolution kernel is 1x1; all depth separable convolutions have a ReLU activation function; the input of the model is connected to the final output through a direct connection edge, so that the neural network is in a residual error learning state, and the model has faster convergence;

the input and output of the model are respectively a reconstructed pixel diagram and a filtered pixel diagram of the video encoderPixel map, loss function L _T Selecting output of neural networkAnd original pixel->Mean square error between:

after the teacher model is trained, the middle layer output of three sub-parts of the teacher model is obtained from the trained data set, the mapping result of the attention map of the neural network or the regenerated kernel Hilbert space is used for calculating the output result F of the teacher model at the places _T The method comprises the steps of carrying out a first treatment on the surface of the Intermediate layer calculation result F using teacher model _T And the data set input by the teacher model form a group of new data sets to train the student model;

the student model adopts a network structure similar to a teacher model, 9 depth separable convolution layers are used as a trunk of the student model, the trunk is divided into three parts, each part is composed of 3 depth separable convolution layers, the number of the layers is 32, and the convolution kernel size is 3x3; the designed targets of the student model and the teacher model are consistent, the input and the output of the student model are consistent, and the structure of the direct connection edge is also adopted for the study residual error of the student model;

the middle layer output of the teacher model and the middle layer output of the student model are subjected to the same mapping, and the mean square error between the middle layer output and the middle layer output of the teacher model is calculated; the loss function is embodied as formulas (2) and (3); calculate its attention map or use a linear kernel function k (x, y) =x ^T y to approximate the result of the spatial mapping in the regenerated kernel hilbert; the calculation formula is as follows:

here, F _T ,F _S Attention patterns, f, representing teacher and student models, respectively _T ⁱ And f _S ⁱ Then specifically represent the ith feature map, C _T And C _S Respectively representing the number of characteristic layers of a teacher model and a student model, wherein p is a positive integer; thus, an initialized student model is obtained;

using a standard mean square error L as shown in formula (4) _S Final training is carried out on the initialized student model, and a trained student model is obtained; corresponding Y _S ⁱ An output representing a student model;

the student model is led back into the video encoder, i.e. loop filtering based on neural networks in video coding is achieved.