CN111126599B - Neural network weight initialization method based on transfer learning - Google Patents

Neural network weight initialization method based on transfer learning Download PDF

Info

Publication number
CN111126599B
CN111126599B CN201911321102.2A CN201911321102A CN111126599B CN 111126599 B CN111126599 B CN 111126599B CN 201911321102 A CN201911321102 A CN 201911321102A CN 111126599 B CN111126599 B CN 111126599B
Authority
CN
China
Prior art keywords
model
student model
student
teacher
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911321102.2A
Other languages
Chinese (zh)
Other versions
CN111126599A (en
Inventor
范益波
刘超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN201911321102.2A priority Critical patent/CN111126599B/en
Publication of CN111126599A publication Critical patent/CN111126599A/en
Application granted granted Critical
Publication of CN111126599B publication Critical patent/CN111126599B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/042Knowledge-based neural networks; Logical representations of neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • G06T9/002Image coding using neural networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/80Details of filtering operations specially adapted for video compression, e.g. for pixel interpolation
    • H04N19/82Details of filtering operations specially adapted for video compression, e.g. for pixel interpolation involving filtering within a prediction loop
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to the technical field of neural network models, and particularly relates to a neural network weight initialization method based on transfer learning. In the method, for a designated target task, a neural network model with higher complexity, namely a teacher model, is designed, the teacher model is trained, and after the training is finished, the weight initialization of the student model is guided by using the generated feature map; calculating the difference between the feature maps or mapping the feature maps into a regenerated kernel Hilbert space, calculating the difference of the feature maps in the regenerated kernel Hilbert space, and simplifying the calculation by adopting a kernel function method; the simple student model achieves a better weight initialization effect, and after the weight initialization is completed, the student model is generally trained, so that the student model achieves a better global convergence point, and the performance of the student model is more excellent. The invention can effectively improve the performance of the student model on the premise of not increasing the complexity of the student model.

Description

Neural network weight initialization method based on transfer learning
Technical Field
The invention belongs to the technical field of neural network models, and particularly relates to a neural network weight initialization method based on knowledge transfer learning.
Background
In recent years, the development of the neural network has been greatly advanced, especially in the fields of computer vision and natural language processing, and many performances of the neural network are superior to those of human beings, however, the excessively high calculation amount and the excessively large training requirement of the neural network cause great barriers to the practical application of the neural network. How to make a lightweight model perform better becomes a hotspot problem to be solved.
Over the past several years, many researchers have proposed various schemes to help neural networks achieve a better convergence. The training method mainly comprises the following steps of distilling and transferring based on knowledge, and trying to improve the performance of a student model by adding some extra loss functions in the training process of the student model and using a trained teacher model to help the student model to perform better, so that the performance of the model is improved on the basis of not increasing the complexity of the student model. The second type is pruning based on quantization of a model, and the original 32-bit addition and subtraction is changed into 8-bit addition and subtraction or even 1-bit addition and subtraction by quantizing the weight of the neural network, so that the weight complexity of the neural network is greatly reduced. Thereby reducing the amount of computation. The pruning is to delete some connecting edges of the neural network connection directly, and then evaluate whether the loss of the pruning to the accuracy of the model is negligible, thereby achieving the effect of reducing the complexity of the model effectively.
Disclosure of Invention
The invention aims to provide a neural network weight initialization method for effectively improving the model performance on the basis of not increasing the model complexity.
The neural network weight initialization method provided by the invention is based on a knowledge transfer learning technology, namely, a neural network model with higher complexity is called a teacher model (the teacher model with higher complexity is difficult to apply in actual engineering); the neural network model with lower complexity (called student model, which has better trade-off between complexity and performance in practical application) has a good initialization state, so that the local optimal point in the training process is moved out to achieve better training effect.
In the invention, firstly, a neural network model with higher complexity, namely a teacher model, is designed for a designated task, the teacher model is trained, and after the training is finished, the weight initialization of a student model is guided by utilizing a generated feature map; calculating the difference between the feature maps or mapping the feature maps into a regenerated kernel Hilbert space, calculating the difference of the feature maps in the regenerated kernel Hilbert space, and simplifying the calculation by adopting a kernel function method; the simple student model achieves a better weight initialization effect, and after the weight initialization is completed, the student model is generally trained, so that the student model achieves a better global convergence point, and the performance is more excellent.
The invention can effectively avoid the problem that the student model converges to the local optimal solution caused by parameter dependence in initial training.
The invention provides a neural network weight initialization method, which comprises the following specific steps:
(1) For a specific learning task, a conventional loss function and a model structure are often possessed, firstly, a teacher model is designed aiming at a target task, and the teacher model is trained by using the conventional loss function;
(2) Then, outputting and exporting the middle layer output of the trained teacher model, and obtaining a feature map in a mapping mode; the mapping mode can be attention migration [ Sergey Zagoruyko and Nikos Komodakis, "Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer," arXiv preprint arXiv:1612.03928,2016 ], or mapping into a regenerated kernel Hilbert space by using a kernel function [ Zehao Huang and Naiyan Wang, "Like what you like: knowledge distill via neuron selectivity transfer," arXiv preprint arXiv:1707.01219,2017 ]; specifically, the compounds are shown in the formula (2) and the formula (3);
(3) Designing a student model with a simpler structure, wherein the student model has the same network structure as a teacher model; the basic network layers forming the network are consistent, for example, the network structure adopts a serial connection network formed by convolution layers, the number of the convolution layers of the teacher model is more, the number of the feature images is more, the number of the convolution layers of the student model is less, and the number of the feature images is also less;
(4) Training the student model by taking the mean square error between the feature map calculated in the step (2) and the feature map obtained by mapping the same way as the student model as a loss function; after training is finished, the obtained weight of the student model is not the initialization of normal distribution or uniform distribution in the traditional sense, but the weight of the student model is adjusted by learning knowledge from the teacher model, so that the weight of the student model is specifically initialized, and the student model has the capability of approaching the performance of the teacher model;
(5) After initialization is completed, the student model is trained by using a conventional loss function, and an available student model is obtained.
Wherein, the conventional loss function refers to a mean square error commonly used in the current task.
In the invention, the complexity of the teacher model is higher than that of the student model, so that the student model can learn the characteristics of the teacher model well.
Drawings
FIG. 1 is a schematic diagram of the process of the present invention.
Training loss and test loss for the experiment of fig. 2 (qp=22).
Training loss and test loss for the experiment of fig. 3 (qp=37).
Detailed Description
The present invention is further described below with reference to the task of loop filtering based on neural networks in video coding.
For the target task, a neural network module needs to be added to a conventional video encoder, such as in HEVC, and the function of the module is loop filtering. The performance of the video encoder is improved by a loop filtering method based on a neural network. It can be understood that the problem of noise reduction and filtering is solved, and the artificial marks and noise brought by the traditional video encoder are removed. Firstly, designing a teacher model with higher complexity, wherein the complexity of the teacher model is obviously higher than that in the final target actual target application, for example, the computational complexity and the consumed computational resources of the teacher model are more than 2 times of those of an expected design model; and training the teacher model by using a conventional loss function to obtain a trained teacher model. For loop filtering task based on neural network, we designed a convolutional neural network, whose structure is as shown in fig. 1, the upper half is designed teacher model structure, and the lower half is designed student model structure. Conventional loss functions are referred to herein as mean square errors that are commonly used in current tasks. This loss function is used to train teacher and student models.
In terms of model structure, depth separable convolution and batch normalization are adopted as main layers of a teacher model, the number of layers is 64, and the convolution kernel size is 3x3. Wherein 24 depth separable convolution layers are used as the trunk of the teacher model and are divided into three parts, the first part is composed of 10 depth separable convolution layers, the second part is composed of 8 depth separable convolution layers, and the third part is composed of 6 depth separable convolution layers. The last layer of the model is a common convolution layer, the number of the layers is 1, and the convolution kernel size is 1x1. All depth separable convolutions have a ReLU activation function. The input of the model is connected to the final output through a direct connection edge, so that the neural network is in a residual learning state, and the model converges more quickly.
The input and output of the model are respectively a reconstructed pixel diagram and a filtered pixel diagram of the video encoder, and the loss function L T The output of the neural network is selectedAnd original pixel->Mean square error between:
after the teacher model is trained, the middle layer output of three sub-parts of the teacher model is obtained from the trained data set, the mapping result of the attention map of the neural network or the regenerated kernel Hilbert space is used for calculating the output result F of the teacher model at the places T . Intermediate layer calculation result F using teacher model T And the data set input by the teacher model form a new data set to train the student model.
The construction of the student model requires a teacher-like model to ensure the success of knowledge migration. So we also have a similar network structure using 9 depth separable convolutional layers as the backbone of the student model, dividing it into three parts, each consisting of 3 depth separable convolutional layers. The number of layers is 32, and the convolution kernel size is 3x3. Meanwhile, as the designed targets of the student model and the teacher model are consistent, the input and the output of the student model are consistent, and the structure of the direct connection edge is adopted, so that the student model is used for well learning residual errors.
The middle layer output of the teacher model and the middle layer output of the student model are subjected to the same mapping, and the mean square error therebetween is calculated, and the like. The loss function is expressed as formulas (2) and (3). Calculate its attention map or use a linear kernel function k (x, y) =x T y to approximate the result of the spatial mapping in the regenerated kernel hilbert. The calculation formula is as follows:
here, F T ,F S Representing the attention patterns of the teacher and student models respectively,and->Then specifically represent the ith feature map, C T And C S The number of feature layers of the teacher model and the student model are represented, respectively. For practical applications, p is a positive integer. Thus, an initialized student model is obtained.
Then, the standard mean square error L shown in the formula (4) is used again S To perform a final training of the student model,the trained student model can be obtained. Corresponding toRepresenting the output of the student model.
After such initialization, the lightweight model may often perform better than without the initialization method. After the student model is obtained, the student model is led back to the video encoder, and then loop filtering based on the neural network can be realized in video encoding. As figure 3 shows the training and testing losses at the time of the experiment, it is evident that the loss function decreases faster after the initialization is used.

Claims (1)

1. A loop filter optimization method based on neural network in video coding is to add a neural network module in the traditional video coder, the function of the module is loop filter; aiming at a loop filtering target task, the neural network module comprises a neural network model with higher complexity, namely a teacher model and a simpler student model;
training a teacher model, and guiding the weight initialization of the student model by using the generated feature map after the training is completed; calculating the difference between the feature maps or mapping the feature maps into a regenerated kernel Hilbert space, calculating the difference of the feature maps in the regenerated kernel Hilbert space, and simplifying the calculation by adopting a kernel function method; the simple student model achieves a better weight initialization effect, and after the weight initialization is completed, the student model is generally trained, so that the student model achieves a better global convergence point, and the performance is more excellent; the method comprises the following specific steps:
(1) For a specific learning task, the model structure is provided with a conventional loss function, firstly, a teacher model is designed aiming at a target task, and the teacher model is trained by using the conventional loss function;
(2) Then, outputting and exporting the middle layer output of the trained teacher model, and obtaining a feature map in a mapping mode; the mapping mode has attention migration or uses a kernel function to map to a regenerated kernel Hilbert space;
(3) Designing a student model with a simpler structure, wherein the student model has the same network structure as a teacher model, namely, the student model has the same basic network layer forming a network; when the network structures are serial connection networks formed by the convolution layers, the number of the convolution layers of the teacher model is more, the number of the feature images is more, the number of the convolution layers of the student model is less, and the number of the feature images is also less;
(4) Training the student model by taking the mean square error between the feature map calculated in the step (2) and the feature map obtained by mapping the same way as the student model as a loss function; after training is finished, the obtained weight of the student model is obtained by learning knowledge from a teacher model to adjust the weight of the student model, so that the weight of the student model is specifically initialized, and the student model has the capability of approaching the performance of the teacher model;
(5) After initialization is completed, training the student model by using a conventional loss function to obtain an available student model;
the depth separable convolution and batch normalization are adopted as the main layer of a teacher model, the number of the characteristic layers is 64, and the convolution kernel size is 3x3; the method comprises the steps of using 24 depth separable convolution layers as a trunk of a teacher model, dividing the trunk into three parts, wherein the first part consists of 10 depth separable convolution layers, the second part consists of 8 depth separable convolution layers, and the third part consists of 6 depth separable convolution layers; the last layer of the model is a common convolution layer, the number of the layers is 1, and the size of a convolution kernel is 1x1; all depth separable convolutions have a ReLU activation function; the input of the model is connected to the final output through a direct connection edge, so that the neural network is in a residual error learning state, and the model has faster convergence;
the input and output of the model are respectively a reconstructed pixel diagram and a filtered pixel diagram of the video encoderPixel map, loss function L T Selecting output of neural networkAnd original pixel->Mean square error between:
after the teacher model is trained, the middle layer output of three sub-parts of the teacher model is obtained from the trained data set, the mapping result of the attention map of the neural network or the regenerated kernel Hilbert space is used for calculating the output result F of the teacher model at the places T The method comprises the steps of carrying out a first treatment on the surface of the Intermediate layer calculation result F using teacher model T And the data set input by the teacher model form a group of new data sets to train the student model;
the student model adopts a network structure similar to a teacher model, 9 depth separable convolution layers are used as a trunk of the student model, the trunk is divided into three parts, each part is composed of 3 depth separable convolution layers, the number of the layers is 32, and the convolution kernel size is 3x3; the designed targets of the student model and the teacher model are consistent, the input and the output of the student model are consistent, and the structure of the direct connection edge is also adopted for the study residual error of the student model;
the middle layer output of the teacher model and the middle layer output of the student model are subjected to the same mapping, and the mean square error between the middle layer output and the middle layer output of the teacher model is calculated; the loss function is embodied as formulas (2) and (3); calculate its attention map or use a linear kernel function k (x, y) =x T y to approximate the result of the spatial mapping in the regenerated kernel hilbert; the calculation formula is as follows:
here, F T ,F S Attention patterns, f, representing teacher and student models, respectively T i And f S i Then specifically represent the ith feature map, C T And C S Respectively representing the number of characteristic layers of a teacher model and a student model, wherein p is a positive integer; thus, an initialized student model is obtained;
using a standard mean square error L as shown in formula (4) S Final training is carried out on the initialized student model, and a trained student model is obtained; corresponding Y S i An output representing a student model;
the student model is led back into the video encoder, i.e. loop filtering based on neural networks in video coding is achieved.
CN201911321102.2A 2019-12-20 2019-12-20 Neural network weight initialization method based on transfer learning Active CN111126599B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911321102.2A CN111126599B (en) 2019-12-20 2019-12-20 Neural network weight initialization method based on transfer learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911321102.2A CN111126599B (en) 2019-12-20 2019-12-20 Neural network weight initialization method based on transfer learning

Publications (2)

Publication Number Publication Date
CN111126599A CN111126599A (en) 2020-05-08
CN111126599B true CN111126599B (en) 2023-09-05

Family

ID=70500352

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911321102.2A Active CN111126599B (en) 2019-12-20 2019-12-20 Neural network weight initialization method based on transfer learning

Country Status (1)

Country Link
CN (1) CN111126599B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111882031A (en) * 2020-06-30 2020-11-03 华为技术有限公司 Neural network distillation method and device
CN111554268B (en) * 2020-07-13 2020-11-03 腾讯科技(深圳)有限公司 Language identification method based on language model, text classification method and device
CN112464959B (en) * 2020-12-12 2023-12-19 中南民族大学 Plant phenotype detection system and method based on attention and multiple knowledge migration
CN112929663B (en) * 2021-04-08 2022-07-15 中国科学技术大学 Knowledge distillation-based image compression quality enhancement method
CN113469977B (en) * 2021-07-06 2024-01-12 浙江霖研精密科技有限公司 Flaw detection device, method and storage medium based on distillation learning mechanism

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2017167347A (en) * 2016-03-16 2017-09-21 日本電信電話株式会社 Acoustic signal analysis device, method, and program
CN109087303A (en) * 2018-08-15 2018-12-25 中山大学 The frame of semantic segmentation modelling effect is promoted based on transfer learning
CN110163110A (en) * 2019-04-23 2019-08-23 中电科大数据研究院有限公司 A kind of pedestrian's recognition methods again merged based on transfer learning and depth characteristic

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10643602B2 (en) * 2018-03-16 2020-05-05 Microsoft Technology Licensing, Llc Adversarial teacher-student learning for unsupervised domain adaptation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2017167347A (en) * 2016-03-16 2017-09-21 日本電信電話株式会社 Acoustic signal analysis device, method, and program
CN109087303A (en) * 2018-08-15 2018-12-25 中山大学 The frame of semantic segmentation modelling effect is promoted based on transfer learning
CN110163110A (en) * 2019-04-23 2019-08-23 中电科大数据研究院有限公司 A kind of pedestrian's recognition methods again merged based on transfer learning and depth characteristic

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
稳健的多支持向量机自适应提升算法;张振宇;;大连交通大学学报;第31卷(第02期);98-100 *

Also Published As

Publication number Publication date
CN111126599A (en) 2020-05-08

Similar Documents

Publication Publication Date Title
CN111126599B (en) Neural network weight initialization method based on transfer learning
CN110599409B (en) Convolutional neural network image denoising method based on multi-scale convolutional groups and parallel
CN112150521B (en) Image stereo matching method based on PSMNet optimization
CN109949214A (en) A kind of image Style Transfer method and system
CN108764471A (en) The neural network cross-layer pruning method of feature based redundancy analysis
CN113362250B (en) Image denoising method and system based on dual-tree quaternary wavelet and deep learning
CN113595993B (en) Vehicle-mounted sensing equipment joint learning method for model structure optimization under edge calculation
CN113111760B (en) Light-weight graph convolution human skeleton action recognition method based on channel attention
CN109598732B (en) Medical image segmentation method based on three-dimensional space weighting
CN109614968A (en) A kind of car plate detection scene picture generation method based on multiple dimensioned mixed image stylization
CN107967516A (en) A kind of acceleration of neutral net based on trace norm constraint and compression method
CN112967178A (en) Image conversion method, device, equipment and storage medium
CN112767286A (en) Dark light image self-adaptive enhancement method based on intensive deep learning
CN112598602A (en) Mask-based method for removing Moire of deep learning video
CN111882053B (en) Neural network model compression method based on splicing convolution
CN108629374A (en) A kind of unsupervised multi-modal Subspace clustering method based on convolutional neural networks
Hui et al. Two-stage convolutional network for image super-resolution
CN116958534A (en) Image processing method, training method of image processing model and related device
CN113989283B (en) 3D human body posture estimation method and device, electronic equipment and storage medium
CN110288603B (en) Semantic segmentation method based on efficient convolutional network and convolutional conditional random field
CN106296583B (en) Based on image block group sparse coding and the noisy high spectrum image ultra-resolution ratio reconstructing method that in pairs maps
CN116596764A (en) Lightweight image super-resolution method based on transform and convolution interaction
CN109448039B (en) Monocular vision depth estimation method based on deep convolutional neural network
CN111640087A (en) Image change detection method based on SAR (synthetic aperture radar) deep full convolution neural network
CN113436101B (en) Method for removing rain by Dragon lattice tower module based on efficient channel attention mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant