CN113393377B

CN113393377B - Single-frame image super-resolution method based on video coding

Info

Publication number: CN113393377B
Application number: CN202110541900.7A
Authority: CN
Inventors: 吴庆波; 李鹏飞; 李宏亮; 孟凡满; 许林峰; 潘力立
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2021-05-18
Filing date: 2021-05-18
Publication date: 2022-02-01
Anticipated expiration: 2041-05-18
Also published as: CN113393377A

Abstract

The invention discloses a single-frame image super-resolution method based on video coding, which utilizes prior information directly obtained in the video coding to perform targeted processing on subblocks in different parts of an image, utilizes a complex network to process subblocks with more complex textures, and simultaneously designs an adaptive convolution module to perform targeted processing on subblocks with different coding modes, so that the network is more targeted, different detailed information is restored aiming at different textures, and the precision of a super-resolution result is improved. The invention shares the parameters of the network with few channels into the network with deep channels, namely, the super-resolution process of the whole picture is realized by using different layers of a main network, the network with relatively simple use, shallow layers and few channels is used for processing relatively large subblocks with smoother texture, and the time required by the super-resolution process is reduced.

Description

Single-frame image super-resolution method based on video coding

Technical Field

The invention relates to the technical field of image processing, in particular to a single-frame image super-resolution method based on video coding.

Background

The image super-resolution is a process of converting an input visual image with low resolution into a visual image with high resolution. One important concern of recent super-resolution work has been to propose various networks that accelerate the reasoning process. One branch is to utilize fewer parameters and realize high-efficiency super-resolution work at a higher speed. For example, in the early FSRCNN, feature extraction is directly performed on an input image, and then a feature map passes through an up-sampling network to complete construction of a super-resolution image. Also for example, the recent work CARN is to design a residual network by using a packet convolution technique to realize fast processing of input pictures. Another branch is to increase the complexity of the network model, increase the number of model branches, by training separately for different kinds of inputs, such as classssr.

ClassSR trains and infers low-resolution input images of different complexity by using neural networks of different complexity. Because most areas of the image only need to pass through the network with relatively small calculation amount, the method improves the operation speed of the network inference stage to a certain extent. Specifically, the method is to divide a picture into small blocks of 32 × 32 pixels. Classifying the small images into three classes according to the texture complexity of the small images through a classification network trained in advance: simple pictures, medium pictures, difficult pictures. Different classes of pictures correspond to backbone networks with different channel numbers.

In a traditional super-resolution network, a feature map is directly extracted from a whole picture, so that the network has no way to well learn different features of each region, and the same convolution kernel is applied to process different regions, so that the texture details of the recovered image are inconsistent with those of a real image. And because the texture detail complexity of different areas of the image is different, the complex processing of the low-detail area is unnecessary to increase the calculation amount of the network. However, the neural network that is classified first and then is not shared by three parameters as proposed by the classssr will spend a lot of time and computation power in training, and the complexity of the network is increased. In addition to the above-mentioned disadvantages, the super-resolution methods of today mostly ignore the help of the original prior information of the image to the image super-resolution process. Therefore, a super-resolution method with small network computation amount and improved accuracy of the recovered image texture details and the real image is needed.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a single-frame image super-resolution method based on video coding, which solves the problems mentioned in the background technology.

In order to achieve the purpose, the invention provides the following technical scheme: a single-frame image super-resolution method based on video coding comprises the following steps:

s1, encoding the low-resolution image I in the video by using the prior information of each frame of image of the video_LRSub-blocks of 4 × 4 pixels, 8 × 8 pixels, 16 × 16 pixels and 32 × 32 pixels are divided according to h.265 video coding information, for sub-blocks of 4 × 4 and 8 × 8 pixels

The corresponding coding prediction mode M can be obtained_preGenerating corresponding Gaussian distribution model G according to different coding modes_m；

S2, sub-blocks using 16 × 16 and 32 × 32 pixels

Training a channel adaptive backbone network CAB, dividing each convolution block in the CAB into conv1 and conv2 two-layer channels, and performing forward and backward propagation by using only the parameters of conv1 and not using the parameters of conv2 in each iteration by minimizing the perception loss

And mse loss

To obtain the final super-resolution output I_SR

S3, sub-blocks using 4 × 4 and 8 × 8 pixels

Training the channel adaptive backbone network CAB, and performing forward propagation by using the parameters of conv1 and conv2, wherein conv1 is already in the process

The training of (2) learns the feature extraction mode of the smooth information, fixes the parameters of conv1 during back propagation, updates only the parameters of conv2, and minimizes the perception loss

And mse loss

Obtaining the final super-resolution output I_SR

S4, S2 and S3, training the whole network, fixing the parameters of channel self-adapting backbone network CAB during training, and utilizing minimum perception loss

And mse loss

Training, updating the rest network parameters, training

Corresponding to the feature extraction module of the branch, and preliminarily extracting

Is characterized by

S5, sub-blocks for 4 × 4 and 8 × 8 pixels

When the corresponding branch network CAB is trained, the branch network CAB is used for training

The sub-blocks are input into the network in the relative order of the number i (i-0, 1,2 … 15), and each sub-block is denoted as

The four sub-blocks with the same size and adjacent to the sub-blocks are marked as

Wherein the content of the first and second substances,

i in (a) represents a numerical value of a numerical number;

s6, sampling the Gaussian model generated in the step S1 with high and wide pitches by taking (0,0) as the center

Obtain a matrix with the same width and height as the convolution block

And performing point multiplication operation with the convolution layer Conv in the adaptive convolution module ACB, and performing weighting, wherein the expression is as follows:

re-aligning the input image with the convolution kernel after dot multiplication

Carrying out common convolution operation, and obtaining a feature map more focused on image texture features after passing through an ACB module

S7, after each four adjacent sub-blocks pass through the self-adaptive texture processing module, splicing the four sub-blocks according to the positions of the four sub-blocks in the original image, and transmitting the four sub-blocks to the backbone network to obtain a feature map with the width and the height being twice of those of the single sub-block

Expressed in matrix form as:

s8 minimization of network utilization_totalAnd further fine adjustment is carried out, and the super-resolution process of the picture is completed.

Preferably, the encoding prediction mode M of step S1_preIncluding DC prediction mode, planar prediction mode, and angular prediction mode.

Preferably, the prediction mode M is encoded_preFor G_mThe covariance matrix C of (a) is controlled,

G_m＝Guss(C,θ|M_pre)

adjusting the covariance matrix to make the maximum value of the generated Gaussian model coincide with the pattern texture angle, and adaptively focusing on the image texture features, wherein M is used_preIn the DC mode or the planar mode, a Gaussian model with a unit covariance matrix is set, and M is set_preSetting an initial covariance matrix C for the subblock with the angle mode and the angle theta, and performing theta angle transformation on the initial covariance matrix C to obtain a result, wherein the result is expressed as:

G_m＝A(θ)CA(θ)^T

wherein A (theta) is a two-dimensional rotation matrix

A(θ)^TRepresenting the transpose of matrix a (θ).

Preferably, the fine adjustment in step S8 specifically includes:

using mse losses

To minimize the difference between the input low resolution image and the true high resolution image

Wherein, N represents the number of pixels,

representing the outputs of different branches, with their respective true images of the corresponding branches

Calculating, adding a perception loss term into the loss function, so that the distance L2 between the characteristic value of the generated picture passing through the CNN network and the characteristic value of the target picture passing through the CNN network is as small as possible, the picture to be generated is semantically more similar to the target picture,

wherein f represents a CNN network, and the CNN network is particularly a VGG-16 network.

Using larger loss weight values ω for 4 × 4, 8 × 8 sub-blocks₂For larger smooth subblocks 16 × 16, 32 × 32, smaller weight values ω are used₁，

Loss function L_totalExpressed as:

wherein ω is₁Is 0.5, omega₂Is 1.

The invention has the beneficial effects that:

1) the invention utilizes prior information which can be directly obtained in video coding to carry out targeted processing on different sub-blocks of an image, utilizes a complex network to process the sub-blocks with more complex textures, and designs an adaptive convolution module to carry out targeted processing on the sub-blocks with different coding modes, so that the network is more targeted, different detailed information is restored aiming at different textures, and the precision of a super-resolution result is improved.

2) The invention shares the parameters of the network with few channels into the network with deep channels, namely, the super-resolution process of the whole picture is realized by using different layers of a main network, the network with relatively simple use, shallow layers and few channels is used for processing relatively large subblocks with smoother texture, and the time required by the super-resolution process is reduced.

Drawings

FIG. 1 is a schematic diagram of a network structure according to an embodiment of the present invention;

FIG. 2 is a block diagram of a network adaptive texture processing module according to an embodiment of the present invention;

fig. 3 is a schematic diagram of an input sequence for training 4 × 4 and 8 × 8 pixel sub-blocks according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1-3, the present invention provides a technical solution: a super-resolution method of single-frame image based on video coding, the network structure is shown in figure 1, comprising the following steps:

The corresponding coding prediction mode M can be obtained_preCoding of prediction mode M_preIncluding DC prediction mode, plane prediction mode and angle prediction mode, and generating corresponding Gaussian distribution model G according to different coding modes_m。

Wherein the prediction mode M is predicted by encoding_preFor G_mThe covariance matrix C of (a) is controlled,

G_m＝Guss(C,θ|M_pre)

adaptively focusing on image texture features by adjusting the covariance matrix such that the maxima of the generated Gaussian model coincide with the pattern texture angles, wherein M is adjusted_preIn the DC mode or the planar mode, a Gaussian model with a unit covariance matrix is set, and M is set_preSetting an initial covariance matrix C for the subblock with the angle mode and the angle theta, and performing theta angle transformation on the initial covariance matrix C to obtain a result, wherein the result is expressed as:

G_m＝A(θ)CA(θ)^T

wherein A (theta) is a two-dimensional rotation matrix

A(θ)^TRepresenting the transpose of matrix a (θ).

S2, sub-blocks using 16 × 16 and 32 × 32 pixels

Training the Channel Adaptive Backbone network (CAB) in fig. 1, note that in order to efficiently process different types of input, each convolution block in CAB is divided into conv1 and conv2 two-layer channels, and in each iteration, forward and backward propagation is performed only using the parameters of conv1, and the parameters of conv2 are not used, and by minimizing the perceptual loss

And mse loss

To obtain the final super-resolution output I_SR

S3, sub-blocks using 4 × 4 and 8 × 8 pixels

Training the channel adaptive backbone network CAB in FIG. 1, note that since the complex texture information needs to be processed using more network parameters, the parameters of conv1 and conv2 are used for forward propagation since conv1 is already in the process

During training and learning the feature extraction method of the smoothed information, the parameters of conv1 are fixed and only the parameters of conv2 are updated during back propagation, and the perception loss is still minimized

And mse loss

Obtaining the final super-resolution output I_SR

S4, S2 and S3, training the whole network, and only using the minimum perception loss to fix the parameters of the channel self-adaptive backbone network CAB during training

And mse loss

Training, updating the rest network parameters, firstly training

Is characterized by

S5, sub-blocks for 4 × 4 and 8 × 8 pixels

Wherein the content of the first and second substances,

i in (a) represents a numerical value of a numerical number (as shown in fig. 3, when a sub-block in which i is 5 is input, its adjacent sub-blocks are i is 6,7, 8);

Obtain a matrix with the same width and height as the convolution block

With adaptive convolution module ACB (as shown in FIG. 2)The convolutional layer Conv performs point multiplication operation and weighting, and the expression is as follows:

Expressed in matrix form as:

s8, minimizing L for network utilization in order to focus more on detail information_totalAnd further fine adjustment is carried out, and the super-resolution process of the picture is completed.

In the training process described above, the mse loss is used

Wherein, N represents the number of pixels,

Calculation is carried out, but Pixel-by-Pixel loss of mse loss is different from real visual perception, so that a perception loss item is added into a loss function, the distance between the characteristic value of a generated picture passing through a CNN network and the characteristic value of a target picture passing through the CNN network is as small as possible, the picture to be generated is semantically more similar to the target picture (relative to a Pixel-level loss function),

here, we select the CNN network represented by f as VGG-16.

Furthermore, since the super-resolution quality of the image is more apparent in detail, we are more concerned with the reconstruction effect of the texture complex part, i.e. the 4 × 4, 8 × 8 sub-blocks, and we therefore give larger loss weight values ω to the two sub-blocks₂For larger smooth subblocks 16 × 16, 32 × 32, smaller weight values ω are used₁，

Therefore, the loss function L_totalExpressed as:

wherein ω is₁Is 0.5, omega₂Is 1.

The invention utilizes prior information which can be directly obtained in video coding to carry out targeted processing on different sub-blocks of an image, utilizes a complex network to process the sub-blocks with more complex textures, and designs an adaptive convolution module to carry out targeted processing on the sub-blocks with different coding modes, so that the network is more targeted, different detailed information is restored aiming at different textures, and the precision of a super-resolution result is improved. The invention shares the parameters of the network with few channels into the network with deep channels, namely, the super-resolution process of the whole picture is realized by using different layers of a main network, the network with relatively simple use, shallow layers and few channels is used for processing relatively large subblocks with smoother texture, and the time required by the super-resolution process is reduced.

Although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that various changes in the embodiments and/or modifications of the invention can be made, and equivalents and modifications of some features of the invention can be made without departing from the spirit and scope of the invention.

Claims

1. A single-frame image super-resolution method based on video coding is characterized by comprising the following steps:

S2, sub-blocks using 16 × 16 and 32 × 32 pixels

And mse loss

To obtain the final super-resolution output I_SR

S3, sub-blocks using 4 × 4 and 8 × 8 pixels

And mse loss

Obtaining the final super-resolution output I_SR

And mse loss

Training, updating the rest network parameters, training

Is characterized by

S5, sub-blocks for 4 × 4 and 8 × 8 pixels

Inputting the sub-blocks into the network according to the relative sequence of the number i, i-0, 1,2 … 15, and recording each sub-block as

Wherein the content of the first and second substances,

i in (a) represents a numerical value of a numerical number;

Obtain a matrix with the same width and height as the convolution block

Expressed in matrix form as:

2. The method for super-resolution of single-frame images based on video coding according to claim 1, wherein: the encoding prediction mode of step S1M_preIncluding DC prediction mode, planar prediction mode, and angular prediction mode.

3. The method for super-resolution of single-frame images based on video coding according to claim 1, wherein: the prediction mode M is encoded_preFor G_mThe covariance matrix C of (a) is controlled,

G_m＝Guss(C,θ|M_pre)

G_m＝A(θ)CA(θ)^T

wherein A (theta) is a two-dimensional rotation matrix

A(θ)^TRepresenting the transpose of matrix a (θ).

4. The method for super-resolution of single-frame images based on video coding according to claim 1, wherein: the fine adjustment in step S8 specifically includes:

using mse losses

Wherein, N represents the number of pixels,

wherein f represents a CNN network, and the CNN network is a VGG-16 network;

Loss function L_totalExpressed as:

wherein ω is₁Is 0.5, omega₂Is 1.