CN108830813B

CN108830813B - Knowledge distillation-based image super-resolution enhancement method

Info

Publication number: CN108830813B
Application number: CN201810603516.3A
Authority: CN
Inventors: 高钦泉; 赵岩; 童同
Original assignee: Fujian Imperial Vision Information Technology Co ltd
Current assignee: Fujian Imperial Vision Information Technology Co ltd
Priority date: 2018-06-12
Filing date: 2018-06-12
Publication date: 2021-11-09
Anticipated expiration: 2038-06-12
Also published as: CN108830813A

Abstract

The invention discloses a knowledge distillation-based image super-resolution enhancement method, which comprises the following steps: 1) acquiring training data and testing data; 2) training a teacher network; the teacher network has a neural network model with deeper convolutional layers, and 3) the student network is trained; 4) the teacher network guides the student network to learn; the characteristic diagram of the teacher network is absorbed by the student network learning through three groups of guiding experiments; 5) testing and evaluating the image reconstruction effect; 6) and further guiding the student network according to different matrix relations among the output characteristic graphs. The invention utilizes the related thought of knowledge distillation to transfer the performance of the teacher network to the student network, the student network model can be efficiently operated on the mobile equipment and the embedded equipment with low power consumption limitation, and the PSNR of the student network guided by the teacher network is obviously improved on the premise of no change of the student network structure, thereby obtaining better reconstruction effect.

Description

Knowledge distillation-based image super-resolution enhancement method

Technical Field

The invention relates to the field of computer vision and deep learning, in particular to an image super-resolution enhancement method based on knowledge distillation.

Background

Super Resolution (SR) is a classic problem in computer vision, and Single Image Super Resolution (SISR) aims to recover a High-Resolution (HR) Image corresponding to a Single Low-Resolution (LR) Image from the LR Image by using digital Image processing and other methods. In the super-resolution problem, assuming that the low resolution image is X, our goal is to recover a super-resolution image Y' that is as similar as possible to the real (GT) image Y.

The conventional Interpolation-based amplification method includes Bilinear Interpolation (Bilinear Interpolation) and Bicubic Interpolation (Bicubic Interpolation), etc., and calculates the missing intermediate pixels in the amplified high-resolution image by using a fixed calculation formula and performing weighted average by using the neighborhood pixel information in the low-resolution image, but the simple Interpolation algorithm does not generate more image details with high-frequency information.

Dong^[1]The Super-Resolution convolutional Neural Network algorithm (SRCNN) proposed by et al first applies a convolutional Neural Network to image Super-Resolution, and it directly learns the end-to-end mapping relationship between an input low-Resolution image and a corresponding high-Resolution image. SRCNN well illustrates that deep learning is effective in the super resolution problem, and can reconstruct much of the high frequency image detail information. Kim^[2]People who are subjected to VGG-net^[3]Inspiring that Very Deep convolutional neural network (VDSR) is used in the Super-Resolution problem, the network structure of VDSR consists of 20 convolutional layers, more convolutional layers have larger receptive fields, more Image neighborhood information can be used for predicting Image high-frequency details, and therefore a better Super-Resolution reconstruction effect can be achieved. Lim^[4]Et al by SRResNet^[5]The enlightening of the method provides a deeper Enhanced depth Residual error network (EDSR), optimizes the structure of the SRResNet and obtains a better Super-Resolution reconstruction effect.

As can be seen from the work before academia, the image reconstruction effect is better as the network depth is increased. Although the depth of the network is increased to bring a better super-resolution reconstruction effect, the computation amount and the memory consumption are increased at the same time, and a deep convolutional neural network model cannot be run in real time in many practical application scenarios (for example, under the limiting conditions of low power consumption such as a mobile terminal and an embedded type).

Disclosure of Invention

The invention aims to provide an image super-resolution enhancement method based on knowledge distillation, which can improve the image super-resolution reconstruction effect of a network model on the premise of not changing the structure of a small convolutional neural network model, so that a super-resolution model based on a convolutional neural network can be efficiently operated on a mobile terminal and an embedded terminal.

The technical scheme adopted by the invention is as follows:

a knowledge distillation-based image super-resolution enhancement method comprises the following steps:

1) acquiring training data and testing data;

1-1) selecting DIV2K and Flickr2K as training sets, wherein the training sets comprise 3450 real images, and the testing sets respectively select International public data sets Set5, Set14, BSDS100 and Urban 100;

1-2) carrying out 3-time down-sampling on the real images of the training set by adopting a Bicubic down-sampling method to obtain a group of low-resolution images corresponding to the real images;

1-3) reading a real image and a low resolution image, respectively, using an imread () function in an opencv library, the images being in the format of BGR data representing blue, green, and red portions of a color space, respectively,

1-4) then converting the image of the BGR space to the YCrCB space, Y representing the brightness, i.e. the gray level value, Cr representing the difference between the BGR red portion and the BGR signal brightness value, and Cb representing the difference between the BGR blue portion and the BGR signal brightness value;

1-5) carrying out channel separation on an image in a YCrCb space, only selecting Y-channel data for training, and carrying out normalization processing on the Y-channel data;

1-6) cutting the Y-channel image, taking the cut real image block as a training target, taking the cut low-resolution image block as an input during network training, wherein training data required by each iteration is 32 pairs.

2) Training a teacher network; the teacher network is a neural network model with deeper convolutional layers,

2-1) the first layer of the teacher network is a feature extraction and representation layer, which is composed of a convolutional layer and a nonlinear activation layer, the nonlinear activation layer selects ReLU as an activation function, and the operation of the first layer can be expressed by the following formula:

F₁(X)＝max(0,W₁*X+b₁)

in the formula, W₁,b₁The weights and offsets of the first convolutional layer, respectively, "+" indicates the convolution operation, and the ReLU function is defined as max (0, x);

2-2) the middle layer of the teacher network consists of 10 residual blocks, each residual block has two convolution layers, and each convolution layer is followed by a nonlinear activation layer with an activation function of ReLU; adding the input of the first convolutional layer and the output of the second convolutional layer by using a jump connection, and only performing residual error learning on the input of the first convolutional layer; each residual block can be represented by the following formula:

F_2n+1(X)＝max(0,W_2n+1*F_n(X)+b_2n+1)+F_2n-1(X) (1≤n≤10)

wherein n represents a residual block number, F_n(X) represents the output of the first convolutional layer and the nonlinear active layer in the residual block, W_2n+1And b_2n+1Respectively representing the weight and the offset of the second convolutional layer in the residual block, F_2n-1(X) represents the input of the residual block.

2-3) the reconstruction layer of the teacher network is a deconvolution layer (deconvolution), and the deconvolution layer is used for up-sampling the output of the previous network layer to enable the size of the output super-resolution image to be equal to that of the training target;

2-4) for training of teacher's network, the learning rate is set to 0.0001, and the MSE function is used as a loss function of training target and network output, whose expression is as follows:

in the formula, n is the number of training samples; y is_iIs an input image, Y'_iIs a predicted image.

2-5) minimizing the loss function using Adam optimization method.

3) Training a student network;

in order to achieve a better reconstruction effect, the invention removes a normalization BN layer of a student network structure.

3-1) the first layer of the student network is a characteristic extraction and representation layer, and the parameter setting of the first layer of the student network is the same as that of the first layer of the teacher network;

3-2) the middle layer of the student network is composed of 3 depth separable convolution (depthwise partial convolution) modules, each module is composed of a 3 × 3 depth level convolutional layer (depthwise convolution) and a 1 × 1 convolutional layer, the depth level convolutional layer and the convolutional layer are both followed by a nonlinear activation layer with an activation function of ReLU, and the operation of the depth level convolution can be represented by the following formula:

wherein K is D_k×D_kA xM deep convolution kernel that applies the mth filter in K to the mth feature map of F to produce the mth feature map of the filtered output feature map G;

3-3) the parameter setting of the student network reconstruction layer is the same as that of the teacher network reconstruction layer;

3-4) the learning rate, the loss function and the optimization method of the student network are the same as those of the teacher network;

4) the teacher network guides the student network to learn; the characteristic diagram of the teacher network is absorbed by the student network learning through three groups of guiding experiments;

step 4) guiding an MSE function to be used as a loss function of a training target and network output in an experiment, and recording as loss0, namely:

wherein n is the number of experimental samples, Y_iIs an input image, Y'_iIs a predicted image.

And (3) guiding the first experiment: extracting the output characteristic diagram of the 1 st depth separable convolution module of the student network, averaging the output characteristic diagram and recording the average as S₁Namely:

in the formula, n₁Is the number of characteristic diagrams, s_iThe ith feature map is output by a 1 st depth separable convolution module in the student network;

extracting output characteristic diagram of the 4 th residual block module of the teacher network, averaging the output characteristic diagram, and recording the average as T₁Namely:

in the formula, n₁Is the number of feature maps, t_iThe ith feature map is output by a 4 th residual block module in the teacher network;

using MSE function as T₁And S₁The loss function of (2) is to make the students learn the contents of the teacher network characteristic diagram through the network, and is recorded as loss1, namely:

in the formula, n₁Is the number of characteristic graphs, T₁Is the mean value of the output characteristic diagram of the 4 th residual block module of the teacher network, S₁Is the mean of the output feature maps of the 1 st depth separable convolution module of the student network.

Guiding the total loss function of the first experiment to be less 0+ less 1, and minimizing the total loss function by using an Adam optimization method;

and (5) guiding an experiment II: extracting output characteristic diagram of 2 nd depth separable convolution module of student network, averaging the output characteristic diagram and recording the average as S₂Namely:

in the formula, n₂Is the number of characteristic diagrams, s_2iThe ith feature map is output by a 2 nd depth separable convolution module in the student network;

extracting output characteristic diagram of the 7 th residual block module of the teacher network, averaging the output characteristic diagram, and recording the average as T₂Namely:

in the formula, n₂Is the number of feature maps, t_2iThe ith feature map is output by a 7 th residual block module in the teacher network;

using MSE function as T₂And S₂The loss function of (2) is to make the students learn the contents of the teacher network characteristic diagram through the network, and is recorded as loss2, namely:

in the formula, n₂Is the number of characteristic graphs, T₂Is the mean value, S, of the output characteristic map of the 7 th residual block module of the teacher network₂Is the mean of the output feature maps of the 2 nd deep separable convolution module of the student network.

The total loss function of the second guided experiment is loss0+ loss2, and the Adam optimization method is used for minimizing the total loss function;

and (3) guiding an experiment III: extracting output characteristic diagram of 3 rd depth separable convolution module of student network, averaging the output characteristic diagram and recording the average as S₃Namely:

in the formula, n₃Is the number of characteristic diagrams, s_3iThe ith feature map is output by a 3 rd depth separable convolution module in the student network;

extracting output characteristic diagram of 10 th residual block module of teacher network, averaging the output characteristic diagram and recording as T₃Namely:

in the formula, n₃Is the number of feature maps, t_3iThe ith feature map is output by a 10 th residual block module in the teacher network;

using MSE function as T₃And S₃The loss function of (2) enables students to learn the contents of the teacher network characteristic diagram through network and memorize the contentsIs loss3, i.e.:

in the formula, n₃Is the number of characteristic graphs, T₃Is the mean value, S, of the output characteristic map of the 10 th residual block module of the teacher network₃The mean of the output feature maps of the 3 rd depth separable convolution module for the student network.

The total loss function for the third guiding experiment is loss0+ loss3, and Adam optimization method is used to minimize the total loss function.

5) Testing and evaluating the image reconstruction effect;

reading a real image of a test set by using an imread () function in an opencv library, wherein the image format is BGR data, converting the image of the BGR space into a YCrCB space, carrying out channel separation on the image of the YCrCb space, only selecting Y-channel data for testing, wherein the gray-scale value range of the Y-channel data is between [0 and 255], carrying out 3-time down-sampling on the gray-scale image of the test set by a Bicubic down-sampling method to obtain a corresponding low-resolution image, carrying out normalization processing on the Y-channel data to change the gray-scale value range of the Y-channel data to be between [0 and 1], and taking the gray-scale value range as the input of a network. And finally, calculating the PSNR of the output of the network and the gray level image of the real image to measure the super-resolution reconstruction effect.

Generally, the Peak signal-to-noise ratio (PSNR) is used to evaluate the quality of the image reconstruction effect, and the higher the PSNR value is, the better the image reconstruction effect is.

6) Further guiding the student network according to different matrix relations among the output characteristic graphs;

let the output tensor of the activation layer of the convolutional neural network be A ∈ R^C×H×WWherein C is the number of the characteristic diagrams, H and W are the height and width of the characteristic diagrams respectively,

the function M takes tensor a as input and outputs a two-dimensional matrix, namely: m: r^C×H×W→R^H×WAnd the output characteristic graphs satisfy the following relations:

p-th power of feature map mean:

mean of p-th power of feature:

maximum value of feature map: m_max(A)＝max_i＝1,CA_i

Minimum value of feature map: m_min(A)＝min_i＝1,CA_i。

In the formula, M is a function, p is a power coefficient, A is an output tensor of an activation layer of the convolutional neural network, C is the number of characteristic graphs, and i is a characteristic graph serial number.

The invention adopts the technical scheme to provide knowledge-based distillation technology^[6]And a smaller neural network model is made to learn the characteristics of a deeper network model, and the image super-resolution enhancement effect of the small network model is improved on the premise of not changing the model structure of the small network and not increasing the calculated amount, so that the super-resolution model with a better effect can be efficiently operated on a mobile terminal or an embedded terminal with low power consumption limitation.

Drawings

The invention is described in further detail below with reference to the accompanying drawings and the detailed description;

FIG. 1 is a schematic diagram of a teacher network structure of a knowledge distillation-based image super-resolution enhancement method of the present invention;

FIG. 2 is a schematic diagram of a student network structure of an image super-resolution enhancement method based on knowledge distillation according to the present invention;

FIG. 3 is a schematic diagram of a teaching process of a teacher network to a student network of the knowledge distillation-based image super-resolution enhancement method of the present invention;

FIG. 4 shows the comparison effect of part of experiments of the knowledge distillation-based image super-resolution enhancement method.

Detailed Description

As shown in one of fig. 1 to 4, an object of the present invention is to provide a super-resolution reconstruction method based on knowledge distillation, which improves an image super-resolution reconstruction effect of a network model without changing a structure of a small convolutional neural network model, so that a super-resolution model based on a convolutional neural network can be efficiently operated on a mobile terminal and an embedded terminal.

The invention discloses a super-resolution reconstruction method based on knowledge distillation, which comprises the following specific implementation modes:

(1) and acquiring a training set and a test set.

The training set selects DIV2K and Flickr 2K. DIV2K has 800 real images and Flickr2K has 2650 real images for a total of 3450 images.

The test Set selects international public data sets Set5, Set14, BSDS100 and Urban100 respectively. Set5 has 5 test images, Set14 has 14 test images, and BSDS100 and Urban100 each have 100 test images.

And (3) performing 3-time down-sampling on the real images of the training set by adopting a Bicubic down-sampling method to obtain a group of low-resolution images corresponding to the real images.

The real image and the low-resolution image are read separately using the immed () function in the opencv library, the images are formatted as BGR data, BGR represents blue, green, and red portions of a color space, respectively, and then the image of the BGR space is converted to a YCrCB space, Y represents brightness, i.e., a gray-scale value, Cr represents a difference between the BGR red portion and a BGR signal luminance value, and Cb represents a difference between the BGR blue portion and a BGR signal luminance value.

The conversion formula from the BGR space to the YCrCb space is as follows:

Y＝0.097906×B+0.504129×G+0.256789×R+16.0

Cr＝-0.071246×B-0.367789×G+0.439215×R+128.0

Cb＝0.439215×B-0.290992×G-0.148223×R+128.0

and performing channel separation on the image in the YCrCb space, and training by only selecting Y-channel data, wherein the image gray-scale value range is between [0 and 255], and performing normalization processing on the Y-channel data to change the image gray-scale value range into [0 and 1 ].

And (3) cutting the Y-channel image, wherein when the downsampling multiple is 3, the Y-channel image corresponding to the real image is cut into 120 × 120 image blocks serving as a training target, and the Y-channel image corresponding to the corresponding low-resolution image is cut into 40 × 40 image blocks serving as input during network training. The training data required for each iteration is 32 pairs.

(2) And (5) training a teacher network.

The teacher network is a neural network model with deeper convolutional layers, as shown in fig. 1. The first layer of the teacher's network is a feature extraction and presentation layer, consisting of a convolutional layer of 64 filters of size 3 x 3 and a nonlinear activation layer. The padding mode of the convolution layer is set as 'SAME', the sliding step (stride) of the convolution kernel is set as 1, the sizes of the images before and after the convolution operation are equal, the weight initialization method is set as XVaier, the bias term (bias) is initialized to be 0, and the nonlinear activation layer selects ReLU as the activation function. The operation of the first layer can be expressed by the following formula:

F₁(X)＝max(0,W₁*X+b₁)

in the formula, W₁,b₁The weights and offsets of the first convolutional layer, respectively, "+" indicates the convolution operation, and the ReLU function is defined as max (0, x).

The middle layer of the teacher network consists of 10 residual blocks, each residual block has two convolution layers, and each convolution layer is followed by a nonlinear activation layer with an activation function of ReLU. The input of the first convolutional layer and the output of the second convolutional layer are added by a skip connection, and residual learning is performed only on the input of the first convolutional layer. Each convolutional layer is composed of 64 filters with the size of 3 × 3, padding is set to 'SAME', stride is set to 1, weight initialization method is Xvaier method, and bias is initialized to 0. Each residual block can be represented by the following formula:

F_2n+1(X)＝max(0,W_2n+1*F_n(X)+b_2n+1)+F_2n-1(X) (1≤n≤10)

When the upsampling multiple is 3, the reconstructed layer of the teacher network is 1 deconvolution (deconvolution) with a filter size of 3 × 3, and stride is set to 3. The purpose of the deconvolution layer is to up-sample the output of the previous layer of network, so that the output super-resolution image is equal to the training target in size.

For training of the teacher's network, the learning rate is set to 0.0001, and the MSE function is used as a loss function of the training target and the network output, and the expression is as follows:

The loss function is minimized using Adam optimization method.

(3) Training of student networks

The structure of the student network is shown in fig. 2, and in order to achieve a better reconstruction effect, the invention removes the normalized BN layer of the network structure. The first layer of the student network is a feature extraction and presentation layer, the parameter settings of which are the same as those of the first layer of the teacher network.

The middle layer of the student network is composed of 3 depth separable convolution (depth separable convolution) modules, each depth separable convolution module is composed of 64 depth level convolution layers (depth convolution) with the size of 3 x 3 and 64 convolution layers with the size of 1 x 1, and the depth level convolution layers and the convolution layers are both followed by a nonlinear activation layer with the activation function of ReLU. The depth level convolution padding mode is set to 'SAME', stride is set to 1. Stride of the convolutional layer is set to 1, the weight initialization method is the XVeier method, and bias is initialized to 0. The operation of depth level convolution can be represented by the following formula:

wherein K is D_k×D_kThe xm deep convolution kernel applies the mth filter in K to the mth feature map of F to produce the mth feature map of the filtered output feature map G.

The parameter setting of the student network reconstruction layer is the same as that of the teacher network reconstruction layer.

The learning rate, loss function and optimization method of the student network are the same as those of the teacher network.

(4) And the teacher network guides the student network to learn.

The tutor network to student network tutoring process is shown in figure 3.

Experiment one:

extracting the output characteristic diagram of the 1 st module of the student network, averaging the output characteristic diagram and recording the average as S₁Namely:

in the formula, n is the number of characteristic graphs, s_iAnd the ith feature map is output by the 1 st module in the student network.

Extracting output characteristic diagram of the 4 th module of the teacher network, averaging the output characteristic diagram and recording the average as T₁Namely:

in the formula, n is the number of characteristic graphs, t_iAnd the ith feature map is output by the 4 th module in the teacher network.

the MSE function is used as a loss function for the training target and the network output and is denoted as loss0, i.e.:

wherein n is the number of experimental samples, Y_iFor inputting an image, Y_iIs a predicted image.

The total loss function loss, loss0+ loss1, is minimized using Adam optimization method.

Experiment two:

extracting output characteristic diagram of 2 nd module of student network, averaging the output characteristic diagram and recording as S₂。

Extracting output characteristic diagram of 7 th module of teacher network, averaging the output characteristic diagram and recording the average as T₂。

Experiment three:

extracting output characteristic diagram of the 3 rd module of the student network, averaging the output characteristic diagram and recording the average as S₃。

Extracting the output characteristic diagram of the 10 th module of the teacher network, averaging the output characteristic diagram and recording the average as T₃。

Using MSE function as T₂And S₂，T₃And S₃The loss functions of (1) are denoted as loss2 and loss3, respectively.

In the second experiment, the total loss function of the third experiment is loss0+ 2 and loss0+ 3 respectively, and the total loss function is minimized by using an Adam optimization method.

(5) Testing

Reading a real image of a test set by using an imread () function in an opencv library, wherein the image format is BGR data, converting the image of the BGR space into a YCrCB space, carrying out channel separation on the image of the YCrCb space, only selecting Y-channel data for testing, wherein the gray-scale value range of the Y-channel data is between [0 and 255], carrying out 3-time down-sampling on the gray-scale image of the test set by a Bicubic down-sampling method to obtain a corresponding low-resolution image, carrying out normalization processing on the Y-channel data to change the gray-scale value range of the Y-channel data to be between [0 and 1], and taking the gray-scale value range as the input of a network. And finally, calculating the PSNR of the output of the network and the gray level image of the real image to measure the super-resolution reconstruction effect. Generally, the Peak signal-to-noise ratio (PSNR) is used to evaluate the quality of the image reconstruction effect, and the higher the PSNR value is, the better the image reconstruction effect is.

The results of steps (2), (3) and (4) are shown in Table 1.

TABLE 1 instructor network to student network guidance effect

As can be seen from table 1, the PSNR of experiments one and two is slightly improved relative to the student network.

(6) Further guidance is as follows: and considering different matrix relations among the output characteristic graphs to further guide the student network. Let the output tensor of the activation layer of the convolutional neural network be A ∈ R^C×H×WIn the formula, C is the number of the feature maps, H and W are the height and width of the feature maps respectively, and the function M takes the tensor a as input and outputs a two-dimensional matrix, that is:

M：R^C×H×W→R^H×W

the invention takes into account the relationship between the following characteristic diagrams:

p-th power of feature map mean:

mean of p-th power of feature:

maximum value of feature map: m_max(A)＝max_i＝1,CA_i

Minimum value of feature map: m_min(A)＝min_i＝1,CA_i

On the basis of the step (4) and the step (5), the total loss function loss is 2 × loss0+ loss1+ loss2, and the total loss function is minimized by using an Adam optimization method. The results of further tutoring the student network by the teacher network are shown in table 2.

TABLE 2 further guidance of teacher network to student network

M_mean ²(A) The effect of the method compared with the bicubic interpolation method and the student network is shown in fig. 4.

According to the super-resolution method, under the premise that the structure of the student network is not changed, the PSNR of the student network guided by the teacher network is obviously improved, and a better reconstruction effect is obtained. The innovation of the super-resolution image enhancement method based on knowledge distillation mainly comprises the following three aspects:

first, the invention utilizes the related thought of knowledge distillation to transfer the performance of the teacher network to the student network, thereby greatly improving the image super-resolution reconstruction effect of the student network under the condition of not changing the structure of the student network model.

Secondly, in order to determine the effective information transfer mode of the teacher network and the student network model, the invention compares 7 different feature extraction and transfer methods, and finally determines the optimal feature extraction mode.

Third, the teacher network consumes significant computing resources while the student network model requires only a small amount of computation. The student network model provided by the invention can be efficiently operated on mobile equipment and embedded equipment with low power consumption limitation.

The technical solutions of the present invention have been described in detail, but the embodiments of the present invention are not considered to be limited to the description. It will be apparent to those skilled in the art that various changes may be made without departing from the spirit of the invention, and it is intended that all changes that are equivalent or similar to the invention shall fall within the scope of the invention.

Reference to the literature

[1]Chao Dong,Chen Change Loy,Kaiming He,Xiaoou Tang.Image Super-Resolution Using Deep Convolutional Networks[J].IEEE Transactions on Pattern Analysis&Machine Intelligence,2016,38(2):295-307.

[2]Jiwon Kim,Jung Kwon Lee,Kyoung Mu Lee.Accurate Image Super-Resolution Using Very Deep Convolutional Networks[C].IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2016:1646-1654.

[3]Karen Simonyan,Andrew Zisserman.Very Deep Convolutional Networks for Large-Scale Image Recognition[J].Computer Science,2014.

[4]Bee Lim,Sanghyun Son,Heewon Kim,Seungjun Nah,Kyoung Mu Lee.Enhanced Deep Residual Networks for Single Image Super-Resolution[C].Computer Vision and Pattern Recognition Workshops.IEEE,2017:1132-1140.

[5]Christian Ledig,Lucas Theis,Ferenc Huszar,Jose Caballero,Andrew Cunningham,Alejandro Acosta,Andrew Aitken,Alykhan Tejani,Johannes Totz,Zehan Wang,Wenzhe Shi.Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network[C].IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2017:105-114.

[6]Geoffrey Hinton,Oriol Vinyals,Jeff Dean.Distilling the Knowledge in a Neural Network[J].Computer Science,2015,14(7):38-39.

Claims

1. A knowledge distillation-based image super-resolution enhancement method is characterized by comprising the following steps: which comprises the following steps:

1) acquiring training data and testing data;

1-4) then converting the image of the BGR space to YCrCb space, Y representing the brightness, i.e. the gray level value, Cr representing the difference between the BGR red portion and the BGR signal brightness value, Cb representing the difference between the BGR blue portion and the BGR signal brightness value;

1-6) cutting a Y-channel image, taking a cut real image block as a training target, taking a cut low-resolution image block as an input during network training, wherein training data required by each iteration is 32 pairs;

2) training a teacher network; the teacher network has a neural network model with deeper convolutional layers,

2-1) the first layer of the teacher network is a feature extraction and representation layer, which is composed of a convolutional layer and a nonlinear activation layer, the nonlinear activation layer selects ReLU as an activation function, and the operation of the first layer can be expressed by the following formula: f₁(X)＝max(0,W₁*X+b₁) In the formula, W₁,b₁The weights and offsets of the first convolutional layer, respectively, "+" indicates the convolution operation, and the ReLU function is defined as max (0, x);

2-2) the middle layer of the teacher network consists of 10 residual blocks, each residual block has two convolution layers, and each convolution layer is followed by a nonlinear activation layer with an activation function of ReLU; adding the input of the first convolutional layer and the output of the second convolutional layer by using a jump connection, and only performing residual error learning on the input of the first convolutional layer; each residual block is represented by the following formula:

F_2o+1(X)＝max(0,W_2o+1*F_o(X)+b_2o+1)+F_2o-1(X)(1≤o≤10)

in the formula, o represents the residual block number, F_o(X) represents the output of the first convolutional layer and the nonlinear active layer in the residual block, W_2o+1And b_2o+1Respectively representing the weight and the offset of the second convolutional layer in the residual block, F_2o-1(X) represents the input of a residual block;

2-3) the reconstruction layer of the teacher network is an deconvolution layer, and the deconvolution layer is used for up-sampling the output of the previous network layer to enable the output super-resolution image to be equal to the size of the training target;

where n is the number of training samples, Y_iIs an input image, Y'_iIs a predicted image;

2-5) minimizing the loss function using Adam optimization method;

3) training a student network;

3-2) the middle layer of the student network is composed of 3 depth separable convolution modules, each module is composed of a 3 x 3 depth level convolution layer and a 1 x 1 convolution layer, the depth level convolution layer and the convolution layer are both followed by a nonlinear activation layer with an activation function of ReLU, and the operation of the depth level convolution is represented by the following formula:

3-4) the learning rate, the loss function and the optimization method of the student network are the same as those of the teacher network; i.e., the learning rate is set to 0.0001, the MSE function is used as a loss function for the training target and the network output, and the expression is as follows:

in the formula, n is the number of training samples;

5) testing and evaluating the image reconstruction effect;

p-th power of feature map mean:

mean of p-th power of feature:

maximum value of feature map: m_max(A)＝max_i＝1,CA_i

Minimum value of feature map: m_min(A)＝min_i＝1,CA_i

2. The method for enhancing the super-resolution of the image based on the knowledge distillation as claimed in claim 1, wherein: the conversion formula from the BGR space to the YCrCb space in step 1-4) is as follows:

Y＝0.097906×B+0.504129×G+0.256789×R+16.0

Cr＝-0.071246×B-0.367789×G+0.439215×R+128.0

Cb＝0.439215×B-0.290992×G-0.148223×R+128.0。

3. the method for enhancing the super-resolution of the image based on the knowledge distillation as claimed in claim 1, wherein: step 4) guiding an MSE function to be used as a loss function of a training target and network output in an experiment, and recording as loss0, namely:

wherein n is the number of experimental samples, Y_iIs an input image, Y'_iIs a predicted image;

in the formula, n₁Is the number of feature maps, t_iIn a teacher networkThe ith feature map output by the 4 th residual block module;

in the formula, n₁Is the number of characteristic graphs, T₁Is the mean value of the output characteristic diagram of the 4 th residual block module of the teacher network, S₁Is the mean of the output feature maps of the 1 st depth separable convolution module of the student network;

in the formula, n₂In order to provide the number of the characteristic diagrams,T₂is the mean value, S, of the output characteristic map of the 7 th residual block module of the teacher network₂Is the mean of the output feature maps of the 2 nd depth separable convolution module of the student network;

the total loss function of the second guided experiment is loss0+ loss2, and the Adam optimization method is used for minimizing the total loss function; and (3) guiding an experiment III: extracting output characteristic diagram of 3 rd depth separable convolution module of student network, averaging the output characteristic diagram and recording the average as S₃Namely:

using MSE function as T₃And S₃The loss function of (2) is to make the students learn the contents of the teacher network characteristic diagram through the network, and is recorded as loss3, namely:

in the formula, n₃Is the number of characteristic graphs, T₃Is the mean value, S, of the output characteristic map of the 10 th residual block module of the teacher network₃Average of the output feature maps of the 3 rd depth separable convolution module for the student network;

4. The method for enhancing the super-resolution of the image based on the knowledge distillation as claimed in claim 1, wherein: the specific steps of the step 5 are as follows:

5-1) reading real images of the test set by using an imread () function in an opencv library, wherein the images are in a BGR data format,

5-2) converting the image of the BGR space to the YCrCb space and performing channel separation on the image of the YCrCb space, at which time only Y-channel data is selected for testing, the gray scale value of which ranges from 0 to 255,

5-3) carrying out 3-time down-sampling on the gray level image of the test set by using a Bicubic down-sampling method to obtain a corresponding low-resolution image, carrying out normalization processing on Y-channel data of the low-resolution image to enable the gray level value range to be between [0 and 1], and taking the gray level value range as the input of a network;

5-4) calculating the PSNR of the output of the network and the gray level image of the real image to measure the super-resolution reconstruction effect.