CN112019861A

CN112019861A - Video compression method and device based on keyframe guidance super-resolution

Info

Publication number: CN112019861A
Application number: CN202010698136.XA
Authority: CN
Inventors: 鲁继文; 周杰; 马程; 饶永铭
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2020-07-20
Filing date: 2020-07-20
Publication date: 2020-12-01
Anticipated expiration: 2040-07-20
Also published as: CN112019861B

Abstract

The invention discloses a video compression method and a device for guiding super-resolution based on a key frame, wherein the method comprises the following steps: inputting an input video into a key frame selection network in a frame sequence form to obtain a high-resolution key frame of the input video; down-sampling an input video by a frame sequence to obtain a low-resolution frame sequence of the input video; and inputting the high-resolution key frame and the low-resolution frame sequence into a generator to generate the super-resolution video. The method compresses a high-definition input video after downsampling, and selects a key frame from the compressed video to be used as a video super-resolution guide after the video is decompressed, so that high-level video compression is achieved, and a high-quality video can be restored from the compressed video.

Description

Video compression method and device based on keyframe guidance super-resolution

Technical Field

The invention relates to the technical field of image processing, in particular to a video compression method and device based on keyframe guidance super-resolution.

Background

In recent years, video processing techniques have been developed with great success, such as two fundamental problems in computer vision, video compression techniques and video super-resolution techniques. The video compression can improve the storage efficiency in a personal computer and enable online video browsing to be possible, and the video super-resolution technology has important value in applications such as satellite images, monitoring and high-definition televisions.

Recently many industry standards for video compression have been recognized worldwide, such as MPEG-4, H.264/AVC, and HEVC, among others. But these methods all require a trade-off between reconstruction loss and compression rate and at large compression rates the video quality is greatly degraded. It is known that downsampling a video before encoding and then upsampling the video after decoding improves the performance of the compression method at large compression rates, but does not achieve good decompressed video without a good upsampling method. Therefore, for the video compression problem, it is a difficult problem to maintain the quality of the video while reducing the storage space on a large scale. Meanwhile, with the development of deep neural networks, single-picture super-resolution [4,5,6] has been a good breakthrough in the past years, and based on this, many video super-resolution methods [7,8,9] have been proposed in recent years, wherein the reconstruction of each frame can be regarded as a combination of multiple single-picture super-resolution problems. In addition, the motion information between frames is also explored by many methods to find the temporal relationship between pixel pairs or block pairs between frames. However, both problems, whether single-picture super-resolution or video super-resolution, are highly ill-conditioned, since the super-resolution problem requires the recovery of multiple pixels from one pixel as the inverse of the down-sampling process, providing unknown detailed information for the input low-resolution picture, which is an unsolvable problem. The performance of the super-resolution method is therefore heavily dependent on the data distribution, which is also an important bottleneck for the super-resolution problem.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, an object of the present invention is to provide a method for compressing a super-resolution video based on keyframe guidance, which can achieve high video compression and recover a high-quality video therefrom.

Another object of the present invention is to provide a video compression apparatus for guiding super-resolution based on key frames.

In order to achieve the above object, an embodiment of the present invention provides a method for compressing a video based on a keyframe-guided super-resolution, including:

inputting an input video into a key frame selection network in a frame sequence form to obtain a high-resolution key frame of the input video;

down-sampling the input video by a frame sequence to obtain a low-resolution frame sequence of the input video;

and inputting the high-resolution key frame and the low-resolution frame sequence into a generator to generate the super-resolution video.

In order to achieve the above object, another embodiment of the present invention provides a video compression apparatus for guiding super-resolution based on key frames, including:

the selection module is used for inputting an input video to a key frame selection network in a frame sequence form to obtain a high-resolution key frame of the input video;

the compression module is used for down-sampling the input video by a frame sequence to obtain a low-resolution frame sequence of the input video;

and the decompression module is used for inputting the high-resolution key frame and the low-resolution frame sequence into a generator to generate the super-resolution video.

The video compression method and device based on the keyframe-guided super-resolution of the embodiment of the invention have the following advantages: the high-definition input video is compressed after being downsampled, and meanwhile, the key frame is selected from the compressed video and used as a video super-resolution guide after the compressed video is decompressed, so that high-level video compression is achieved, and the high-quality video can be restored from the compressed video.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flowchart of a method for keyframe-based guided super-resolution video compression according to one embodiment of the present invention;

FIG. 2 is a flowchart illustrating a method for video compression based on keyframe-guided super-resolution according to an embodiment of the present invention;

FIG. 3 is a flow diagram and framework diagram of a key frame selector according to one embodiment of the invention;

FIG. 4 is a network framework diagram of a super-resolution video generator utilizing key-frame guidance according to one embodiment of the present invention;

FIG. 5 is a network structure diagram of a mutual attention layer according to one embodiment of the present invention;

fig. 6 is a schematic structural diagram of a video compression apparatus for guiding super-resolution based on key frames according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

The following describes a method and an apparatus for video compression based on keyframe-guided super-resolution according to an embodiment of the present invention with reference to the accompanying drawings.

A video compression method based on keyframe-guided super-resolution proposed according to an embodiment of the present invention will first be described with reference to the accompanying drawings.

Fig. 1 is a flowchart of a method for video compression based on keyframe-guided super-resolution according to an embodiment of the present invention.

Fig. 2 is a flowchart illustrating a method for video compression based on keyframe-guided super-resolution according to an embodiment of the present invention.

As shown in fig. 1 and 2, the method for video compression based on keyframe-guided super resolution includes the following steps:

step S1, the input video is input to the key frame selection network in the form of a frame sequence, and a high resolution key frame of the input video is obtained.

Further, the method comprises the steps of inputting an input video into a key frame selection network in a frame sequence mode to obtain a high-resolution key frame of the input video, and further comprises the following steps:

extracting a characteristic vector of each frame of an input video, and acquiring a time sequence relation between video frames through a long-term and short-term memory artificial neural network;

converting the characteristic vector into a confidence score through a full connection layer and a softmax layer;

and taking the video frame corresponding to the highest number in the confidence scores as the high-resolution key frame.

And further, according to the extracted feature vector of each frame of the input video, calculating a training score by using Gumbel distributed sampling and a softmax layer, wherein the training score is given to each frame of picture by a weight, further selecting a substitute frame, and training a key frame by using the substitute frame to select a network.

In the process of selecting the key frames, a deep neural network with bidirectional LSTM is designed to extract the relation features between frame sequences, and the most representative frames are selected as the key frames in a classification mode.

Specifically, as shown in fig. 3, the input of the key frame selection network is a frame sequence of high definition images, and one or more frames in the frame sequence of the input video can be selected as key frames through the key frame selection network. It should be noted that a preset selection rule may be set, for example, every 10 seconds of input video selects a key frame, and then a 60 seconds of input video selects 6 key frames from the key frame, which may be set according to actual requirements, and is not limited herein.

One of the multiple pictures is selected as a key frame, so that the method can be regarded as a classification problem, similar to [10,11], a feature vector is extracted by using a pre-trained ResNet18 for each input frame, and then a Long Short-Term Memory (LSTM) model is used for acquiring the time sequence relation between the frames. The features can be converted into a confidence score of real value through the full link layer and the softmax layer, and the confidence score is used for representing the probability that the frame is selected as the key frame. The frame with the largest score is selected as the key frame and the other frames are down sampled and taken as input for the next generator network G.

But the operation of taking the maximum is discontinuous, so the step has no optimization method, and the gradient is provided for the key frame selection network to complete the optimization; on the other hand, to avoid the case of pattern collapse, the score is generated with Gumbel-Softmax instead of Softmax. By v_tAnd G_tThe samples representing the output of the previous layer and the Gumbel distribution, respectively, the training score can be calculated by the following equation:

in the process of forward transmission, the operation of taking the maximum is used for deciding which frame is selected as the key frame, and in the process of backward transmission, the score calculated by the expression is given to each high-definition picture input as the weight of the high-definition picture, so that a substitute frame I in the process of backward transmission is formed^Sub：

Although this is not a true image in the input video, it is combined from the input images according to the computed scores, so the parameters in the key frame selector S can also be updated according to the different scores of the input frames, so we perform the backpropagation of the backpropagation gradient with the following relaxation:

before using the key frame network, selecting a substitute frame by a back-transmission method to train the key frame selection network.

In step S2, the input video is down-sampled by a frame sequence to obtain a low resolution frame sequence of the input video.

Specifically, downsampling is performed on a high-definition input video to obtain a compressed low-resolution video.

In the process of video compression, except that the high-resolution video is down-sampled to a low-resolution video, a representative high-resolution frame is selected as a key frame of the whole high-resolution video. Both the key frames and the low-resolution video can be compressed by the existing method, and the number of the key frames is far less than that of the original frame rate, so that the storage space is greatly reduced.

Step S3, the high resolution key frame and the low resolution frame sequence are input to a generator, and a super-resolution video is generated.

Further, step S3 further includes:

s31, increasing the resolution of the low resolution frame sequence by interpolation;

s32, extracting feature maps of the low-resolution frame sequence and the high-resolution key frame through the convolutional layer to obtain a low-definition feature map and a high-definition feature map;

s33, performing super-resolution on the low-definition feature map to obtain a first intermediate feature map, and fusing the information of the low-definition feature map and the high-definition feature map through a mutual attention layer to obtain a second feature map;

and S34, splicing the first feature map and the second feature map, and inputting all spliced feature maps into a recovery module to generate the super-resolution video.

Fusing the information of the low-definition feature map and the high-definition feature map through the mutual attention layer to obtain a second feature map, wherein the second feature map comprises:

respectively extracting corresponding feature maps from the low-definition feature map and the high-definition feature map by using a preset stride;

respectively converting the corresponding characteristic diagrams into two-dimensional matrixes, and obtaining coefficient matrixes through point multiplication of characteristic vectors between the two-dimensional matrixes;

obtaining a reconstruction coefficient by using softmax according to the coefficient matrix, wherein the reconstruction coefficient is used as the weight of each block of the high-resolution key frame in the reconstruction process;

and cutting the high-resolution key frame into a corresponding number of blocks, and weighting according to the weight of each block to obtain a reconstructed second feature map.

In the decompression process, the high-quality reconstruction of the low-resolution video can be completed according to the guidance of the key frames by utilizing the time sequence consistency between adjacent frames, and because the lost detail information can be acquired from the key frames which are not downsampled, the decompression process is more solvable than the traditional ill-conditioned super-resolution problem.

As shown in fig. 4, the key frame selected by the key frame selection network S and the low resolution video frame are simultaneously input into the generator G, which has two branches, one branch is the super-resolution feature of the low resolution image itself, and the other branch is the guided recovery of the low resolution frame and the high resolution key frame by the attention mechanism.

Specifically, in the generator, the input is simply considered as one high definition key frame and one low definition video frame, rather than a series of video frame sequences. In order to restore a high-quality super-resolution image, the relation between two frames is fully utilized to mine detail information contained in a high-definition key frame. Firstly, recovering a high-definition frame from a low-definition picture in an interpolation mode, then extracting a feature map for two frames by using a convolution layer, wherein the length and the width of the feature map are reduced, but the number of channels is increased, so that the original information is kept to be input, and preparation for optimizing calculation is made for the next step. Two branches can generate two intermediate feature maps respectively, one is obtained by super-resolution of the low-definition map, and the other is generated by fusing the information in the two feature maps together by using the mutual attention layer. After the two feature maps are generated, the two intermediate feature maps are spliced together and then input into a recovery module to generate a final super-resolution picture. The overall architecture of the network is shown in fig. 4.

Layer of mutual attention: the direction of the low-definition to-be-recovered frame by the high-definition key frame is accomplished with a mutual attention layer, as shown in fig. 5. And firstly, extracting the other two feature maps by using the step s for the two feature maps, thereby further reducing the computational complexity. Then, the number of channels is taken as the length of the features, and then the length is multiplied by the width to be taken as the number of the features, so that the feature map of each picture is changed into a two-dimensional matrix from three dimensions. The mutual relation of the low-definition images and the key frames in the length dimension and the width dimension can be obtained through point multiplication of the feature vectors between the two-dimensional matrixes, and the guided reconstruction of the key frames on the low-definition images can be completed according to the relation. Firstly, two-dimensional matrixes are multiplied by a matrix to obtain a coefficient matrix, so that a parameter relation between two pictures is established, and secondly, softmax is used for obtaining a score which is used as the weight of each block of a key frame in the reconstruction process. After the weight scores are obtained, the original high-definition feature map is cut into blocks with the same size, and the reconstructed feature map is obtained according to the calculated score weights, so that super-resolution reconstruction with the key frame as a guide in the mutual attention branch is completed.

It can be understood that after the super-resolution video is obtained through decompression, in order to make the generated super-resolution image closer to the original high-definition image, a small neural network can be used as a discriminator to form a confrontation generation model with the discriminator, so that the network can better utilize information in the key frames and can also generate an image closer to the original high-definition video frame.

Specifically, a better super-resolution performance is obtained and the stability of the training process is improved, a discriminator D is built by using a convolutional neural network, and an original real high-resolution frame and a super-resolution frame are distinguished by using a wasserstein gan loss function with a gradient penalty, as follows:

with the above introduction, the ultimate goal is to generate super-resolution frames close to the original high-definition frames, so that the reconstruction loss is an important part, and for each pair of high-definition frame and super-resolution frame, the following is minimized:

when optimizing discriminator D, the penalty function is learned with countermeasures as follows:

the countermeasure loss and reconstruction loss also guide the generator during the training process, and unlike the discriminator which tries to distinguish between the high definition original frame and the generated super-resolution frame, the generator fools the discriminator by making the generated picture as close as possible to the original picture. The loss function of the generator is therefore as follows:

after back-propagation through the generator, we obtain the gradient for the key-frame, when we use I to optimize the parameters of the key-frame selector^SubInstead, the loss function of the key frame selector and the loss function of the generator are therefore identical, as:

equations (5) and (7) are used to optimize the generator, equation (6) is used to optimize the discriminator, and equation (8) is used to optimize the key frame selection network.

It will be appreciated that in the training, the key frame selection network, the generator and the discriminator are trained end-to-end, and in the testing, only the key frame selection network and the generator network are used.

According to the video compression method based on the key frame guidance super-resolution provided by the embodiment of the invention, the high-resolution key frame of the input video is obtained by inputting the input video into a key frame selection network in a frame sequence form; down-sampling an input video by a frame sequence to obtain a low-resolution frame sequence of the input video; and inputting the high-resolution key frame and the low-resolution frame sequence into a generator to generate the super-resolution video. Therefore, the high-definition input video is compressed after being downsampled, and meanwhile, the key frame is selected from the compressed high-definition input video and used as a video super-resolution guide after the video is decompressed, so that high-level video compression is achieved, and the high-quality video can be restored from the compressed high-definition input video.

Next, a video compression apparatus for guiding super-resolution based on key frames according to an embodiment of the present invention will be described with reference to the accompanying drawings.

As shown in fig. 6, the video compression apparatus for guiding super-resolution based on key frames includes: the device comprises a selection module 100, a compression module 200 and a decompression module 300.

The selecting module 100 is configured to input an input video to a key frame selecting network in a frame sequence form, so as to obtain a high-resolution key frame of the input video.

The compression module 200 is configured to down-sample the input video by a frame sequence to obtain a low resolution frame sequence of the input video.

And a decompression module 300, configured to input the high resolution key frame and the low resolution frame sequence into the generator, so as to generate the super-resolution video.

Further, in an embodiment of the present invention, the method further includes: a reinforcement module;

and the reinforcement module is used for building a discriminator through a convolution network, and performing countermeasure generation by taking the super-resolution video and the input video as the input of the discriminator at the same time.

Further, in an embodiment of the present invention, the method further includes: an optimization module;

and the optimization module is used for optimizing the key frame selection network, the generator and the discriminator through the loss function.

It should be noted that the foregoing explanation on the embodiment of the video compression method for guiding super-resolution based on the key frame is also applicable to the apparatus of this embodiment, and is not repeated here.

According to the video compression device based on the key frame guidance super-resolution, which is provided by the embodiment of the invention, the high-resolution key frame of the input video is obtained by inputting the input video to a key frame selection network in a frame sequence form; down-sampling an input video by a frame sequence to obtain a low-resolution frame sequence of the input video; and inputting the high-resolution key frame and the low-resolution frame sequence into a generator to generate the super-resolution video. Therefore, the high-definition input video is compressed after being downsampled, and meanwhile, the key frame is selected from the compressed high-definition input video and used as a video super-resolution guide after the video is decompressed, so that high-level video compression is achieved, and the high-quality video can be restored from the compressed high-definition input video.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A video compression method based on super-resolution guided by key frames is characterized by comprising the following steps:

2. The method for video compression based on keyframe-guided super resolution as claimed in claim 1, further comprising:

and constructing a discriminator through a convolutional network, and performing countermeasure generation by taking the super-resolution video and the input video as the input of the discriminator at the same time.

3. The method for video compression based on keyframe-guided super resolution of claim 1, wherein the inputting of the input video into the keyframe selection network in the form of a sequence of frames to obtain the high resolution keyframes of the input video further comprises:

extracting a characteristic vector of each frame of the input video, and acquiring a time sequence relation between video frames through a long-term and short-term memory artificial neural network;

converting the feature vector into a confidence score through a full connection layer and a softmax layer;

and taking the video frame corresponding to the highest number in the confidence scores as a high-resolution key frame.

4. The method for video compression with super-resolution guided based on key frames according to claim 3, wherein a training score is calculated by using Gumbel distributed sampling and softmax layer according to the extracted feature vector of each frame of the input video, the training score is given to each frame picture with a weight, and then a substitute frame is selected, and the key frame selection network is trained by using the substitute frame.

5. The method of claim 1, wherein the inputting the sequence of high resolution key frames and the sequence of low resolution frames into a generator generates super-resolution video, comprising:

increasing the resolution of the low-resolution frame sequence by means of interpolation;

extracting feature maps of the low-resolution frame sequence and the high-resolution key frame through a convolutional layer to obtain a low-definition feature map and a high-definition feature map;

performing super-resolution on the low-definition feature map to obtain a first intermediate feature map, and fusing information of the low-definition feature map and the high-definition feature map through a mutual attention layer to obtain a second feature map;

and splicing the first feature map and the second feature map, and inputting all spliced feature maps into a recovery module to generate a super-resolution video.

6. The method for video compression based on keyframe-guided super resolution as claimed in claim 5, wherein said fusing the information of the low-definition feature map and the high-definition feature map through a mutual attention layer to obtain a second feature map comprises:

and cutting the high-resolution key frame into a corresponding number of blocks, and weighting according to the weight of each block to obtain the reconstructed second feature map.

7. The method for video compression based on keyframe-guided super resolution as claimed in claims 1-6, further comprising: and optimizing the key frame selection network, the generator and the discriminator through a loss function.

8. A video compression apparatus for guiding super-resolution based on key frames, comprising:

9. The apparatus for video compression based on keyframe-guided super resolution of claim 8, further comprising: a reinforcement module;

and the enhancement module is used for building a discriminator through a convolution network, and performing countermeasure generation by taking the super-resolution video and the input video as the input of the discriminator at the same time.

10. The apparatus for video compression based on keyframe-guided super resolution as claimed in claims 8-9, further comprising: an optimization module;

and the optimization module is used for optimizing the key frame selection network, the generator and the discriminator through a loss function.