WO2023000179A1 - 视频超分辨网络及视频超分辨、编解码处理方法、装置 - Google Patents

视频超分辨网络及视频超分辨、编解码处理方法、装置 Download PDF

Info

Publication number
WO2023000179A1
WO2023000179A1 PCT/CN2021/107449 CN2021107449W WO2023000179A1 WO 2023000179 A1 WO2023000179 A1 WO 2023000179A1 CN 2021107449 W CN2021107449 W CN 2021107449W WO 2023000179 A1 WO2023000179 A1 WO 2023000179A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
resolution
frame sequence
video frame
super
Prior art date
Application number
PCT/CN2021/107449
Other languages
English (en)
French (fr)
Inventor
元辉
付丛睿
刘瑶
杨烨
李明
Original Assignee
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oppo广东移动通信有限公司 filed Critical Oppo广东移动通信有限公司
Priority to CN202180100597.5A priority Critical patent/CN117730338A/zh
Priority to PCT/CN2021/107449 priority patent/WO2023000179A1/zh
Priority to EP21950444.6A priority patent/EP4365820A1/en
Publication of WO2023000179A1 publication Critical patent/WO2023000179A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/132Sampling, masking or truncation of coding units, e.g. adaptive resampling, frame skipping, frame interpolation or high-frequency transform coefficient masking

Definitions

  • the general video compression process is shown in Figure 1.
  • the encoding end it includes processes such as video acquisition, video preprocessing, and video encoding.
  • the decoding end it includes processes such as video decoding, video post-processing, and display playback.
  • video preprocessing sometimes the frame rate of the video will be reduced due to restrictions such as bandwidth and bit rate, and the image quality will also be reduced during video compression encoding.
  • the video post-processing process after video decoding is an important link to improve video quality, but the improvement effect needs to be enhanced.
  • An embodiment of the present disclosure provides a video super-resolution network, including a generation network, wherein the generation network includes a first feature extraction part, a second feature extraction part, and a reconstruction part connected in sequence, wherein:
  • the first feature extraction part is configured to receive a first video frame sequence, extract a first feature from the first video frame sequence based on 3D convolution and output it;
  • the second feature extraction part is configured to receive the first feature, extract temporal and/or spatial second features from the first feature based on the 3D residual attention mechanism and output it;
  • the reconstruction part is configured to receive the second feature, realize feature fusion and feature spatio-temporal super-resolution based on 3D convolution and 3D upsampling, and reconstruct a video frame sequence based on 3D convolution to generate a second video frame sequence, the The resolution of the second sequence of video frames is greater than that of the first sequence of video frames.
  • An embodiment of the present disclosure also provides a video super-resolution processing method, including:
  • the feature fusion of the second feature and the spatiotemporal super-resolution of the feature are realized, and the video frame sequence is reconstructed based on the 3D convolution to generate a second video frame sequence, and the resolution of the second video frame sequence The rate is greater than the resolution of the first sequence of video frames.
  • An embodiment of the present disclosure also provides a video decoding processing method, including:
  • the first video frame sequence is output to the video super-resolution network for video super-resolution processing to obtain a second video frame sequence, the resolution of which is greater than The resolution of the first sequence of video frames.
  • An embodiment of the present disclosure also provides a video coding processing method, including:
  • An embodiment of the present disclosure also provides a video super-resolution processing device, including a processor and a memory storing a computer program that can run on the processor, wherein, when the processor executes the computer program, the following is implemented:
  • An embodiment of the present disclosure also provides a video decoding processing device, including a processor and a memory storing a computer program that can run on the processor, wherein, when the processor executes the computer program, the following The video decoding processing method described in any embodiment is disclosed.
  • An embodiment of the present disclosure also provides a video decoding processing device, including:
  • the super-resolution judging device is configured to judge whether the first video frame sequence satisfies the set super-resolution condition, and if the set super-resolution condition is met, output the first video frame sequence to the video super-resolution network Perform video super-resolution processing; if the set super-resolution condition is not met, determine to skip the video super-resolution processing of the first video frame sequence;
  • the video super-resolution network is configured to perform video super-resolution processing on the first video frame sequence to obtain a second video frame sequence with a resolution greater than that of the first video frame sequence.
  • An embodiment of the present disclosure also provides a video encoding processing device, including a processor and a memory storing a computer program that can run on the processor, wherein, when the processor executes the computer program, the following The video coding processing method described in any embodiment is disclosed.
  • An embodiment of the present disclosure also provides a video encoding processing device, including:
  • the down-sampling decision module is configured to determine whether to down-sample the video frame sequence from the data source when performing video preprocessing, and output the video frame sequence from the data source to the down-sampling device when it is determined to perform down-sampling , when it is determined not to perform down-sampling, directly output the sequence of video frames from the data source to a video encoder for encoding;
  • a downsampling device is configured to downsample the input video frame sequence, and output the downsampled video frame sequence to a video encoder for encoding;
  • a video encoder configured to perform video encoding on the sequence of video frames from a data source or the sequence of downsampled video frames.
  • An embodiment of the present disclosure further provides a video encoding and decoding system, including the video encoding processing device as described in the embodiment of the present disclosure and the video decoding processing device as described in the embodiment of the present disclosure.
  • An embodiment of the present disclosure further provides a code stream, wherein the code stream is generated according to the video coding processing method described in the embodiment of the present disclosure, and the code stream includes the downsampling flag.
  • An embodiment of the present disclosure also provides a non-transitory computer-readable storage medium, the computer-readable storage medium stores a computer program, wherein, when the computer program is executed by a processor, any implementation of the present disclosure can be realized.
  • the method described in the example is the same as the computer program.
  • Fig. 1 is the schematic diagram of video compression process
  • Figure 2 is an architecture diagram of a generative confrontation network
  • Fig. 3 is a structural diagram of generating a network according to an embodiment of the present disclosure.
  • FIG. 4 is a schematic structural diagram of a 3D residual attention mechanism model according to an embodiment of the present disclosure
  • FIG. 5 is a structural diagram of a discrimination network according to an embodiment of the present disclosure.
  • FIG. 6 is a flowchart of a video super-resolution processing method according to an embodiment of the present disclosure
  • FIG. 7A is a schematic diagram of super-resolution of a sequence of decoded video frames according to an embodiment of the present disclosure
  • FIG. 7B is a structural diagram of a video decoder according to an embodiment of the present disclosure.
  • FIG. 8A is a structural diagram of a video encoder according to an embodiment of the present disclosure.
  • Fig. 8B is a schematic diagram of a scalable video coding architecture according to an embodiment of the present disclosure, only showing the parts closely related to upsampling and downsampling;
  • Fig. 8C is a schematic diagram of a scalable video decoding architecture according to an embodiment of the present disclosure, only showing the part closely related to upsampling;
  • FIG. 9 is a flowchart of a video encoding processing method according to an embodiment of the present disclosure.
  • FIG. 10 is a flowchart of a video decoding processing method according to an embodiment of the present disclosure corresponding to the video encoding processing method shown in FIG. 9;
  • FIG. 11 is an architecture diagram of a video encoding and decoding system according to an embodiment of the present disclosure.
  • Fig. 12 is a schematic structural diagram of a video encoding processing device according to an embodiment of the present disclosure.
  • words such as “exemplary” or “for example” are used to mean an example, illustration or illustration. Any embodiment described in this disclosure as “exemplary” or “for example” should not be construed as preferred or advantageous over other embodiments.
  • "And/or” in this article is a description of the relationship between associated objects, which means that there can be three relationships, for example, A and/or B, which can mean: A exists alone, A and B exist simultaneously, and there exists alone B these three situations.
  • “A plurality” means two or more than two.
  • words such as “first” and “second” are used to distinguish the same or similar items with basically the same function and effect. Those skilled in the art can understand that words such as “first” and “second” do not limit the number and execution order, and words such as “first” and “second” do not necessarily limit the difference.
  • Video post-processing is mainly performed on quality loss during video pre-processing, video encoding, and video decoding, so as to enhance video image quality and increase the number of video frames.
  • some methods use filters to filter the compressed image.
  • these methods mainly improve the visual effect of the image by smoothing the distortion introduced by video compression, rather than restoring the pixel value of the image itself.
  • a non-block matching based frame rate increasing algorithm and a block matching based frame rate increasing algorithm may be used.
  • the non-block matching-based frame rate improvement algorithm does not consider the motion of objects in the image, but only uses adjacent video frames to perform linear interpolation to generate new video frames.
  • the frame rate improvement algorithm based on block matching improves the frame rate of the video frame by estimating the motion vector of the object and interpolating on the motion trajectory of the object. The quality of the video frame obtained by interpolation will be improved, but the complexity is higher.
  • Super-resolution (abbreviated as super-resolution in the text) refers to improving the resolution of an original image by means of hardware or software.
  • HR High Resolution
  • LR Low Resolution
  • HR High Resolution
  • LR Low Resolution
  • Super-resolution technology can reconstruct low-resolution video into high-resolution video images through deep learning methods, bringing users a good video experience.
  • GAN Generative Adversarial Networks
  • the network includes a generator G that can capture data distribution, also called a generator.
  • a discriminator network (discriminator) D that can estimate the probability of data originating from real samples, also known as a discriminator.
  • the generation network and the discrimination network are trained at the same time, and the two networks fight against each other to achieve the best generation effect.
  • the input to generative network training is a low-resolution image, and the output is a reconstructed super-resolved image.
  • the input of discriminative network training is super-resolution image and real image, and the output is the probability that the input image comes from the real image.
  • Low-resolution images can be obtained by downsampling the real image.
  • the training process of the generation network is to maximize the probability of the discriminator making mistakes, so that the discriminator mistakenly believes that the data is a real image (true sample) rather than a super-resolution image (false sample) generated by the generator.
  • the training goal of the discriminative network is to maximize the separation of real samples and fake samples. Therefore, this framework corresponds to a minimax game between two players.
  • a unique equilibrium solution can be obtained, so that after the fake samples generated by the generation network enter the discriminant network, the result given by the discriminant network is a value close to 0.5.
  • An embodiment of the present disclosure proposes a method for realizing super-resolution based on a generative adversarial network, and the corresponding network is a super-resolution generative adversarial network (SRGAN: Super-Resolution Generative Adversarial Networks), in the network framework of SRGAN.
  • SRGAN Super-Resolution Generative Adversarial Networks
  • the core of the generated network is multiple residual blocks of the same layout, using batch normalization (BN: batch-normalization layers) layer and corrected linear unit (ReLU: Rectified Linear Unit) as the activation function, using two trained sub Pixel (trained sub-pixel) convolutional layers increase the resolution of the input image.
  • the discriminant network consists of 8 incremental convolutional layers, which grow from 2 to 64 to 512 kernel functions.
  • the resulting 512 feature maps are followed by 2 fully connected layers (dense layers, also known as dense layers) and a The final sigmoid activation function to get the probability of the sample category.
  • SRGAN cannot achieve temporal and spatial super-resolution at the same time to fully extract useful features of different dimensions, and the improvement of video quality is limited.
  • Its discriminative network has a single structure, does not use optical flow information, and its discriminative ability is limited. Therefore, the quality of high-resolution images reconstructed by this network still needs to be improved.
  • An embodiment of the present disclosure provides a video super-resolution network, including a generation network for realizing a video spatio-temporal super-resolution function.
  • the generation network uses 3D convolution to realize the function of spatio-temporal super-separation, first performs shallow feature extraction based on 3D convolution, and then uses a series of residual attention blocks (RAB: residual attention Block) for deep feature extraction.
  • RAB residual attention Block
  • Each RAB block itself uses a residual learning method and a 3D attention mechanism to further improve the quality of spatio-temporal super-resolution.
  • the generation network includes a first feature extraction part, a second feature extraction part and a reconstruction part connected in sequence, wherein:
  • the first feature extraction part 10 is configured to receive a first video frame sequence, extract a first feature from the first video frame sequence based on 3D convolution and output it;
  • the second feature extraction part 20 is configured to receive the first feature, extract the second feature in time and/or space from the first feature based on the 3D residual attention mechanism and output it;
  • the reconstruction part 30 is configured to receive the second feature, realize feature fusion and feature spatio-temporal super-resolution based on 3D convolution and 3D upsampling, and reconstruct a video frame sequence based on 3D convolution to generate a second video frame sequence, the The resolution of the second sequence of video frames is greater than the resolution of the first sequence of video frames.
  • the above-mentioned second features are extracted from the first features, and the first features may also be called shallow features, and the second features may be called deep features.
  • the above-mentioned first video frame sequence may also be called a low-resolution video frame sequence, and the second video frame sequence may be called a high-resolution video frame sequence or a super-resolution video frame sequence.
  • the aforementioned image resolution and video frame rate may be collectively referred to as resolution, where the image resolution may also be referred to as spatial resolution, and the video frame rate may also be referred to as temporal resolution.
  • the first feature extraction part includes a sequentially connected 3D convolutional layer and an activation layer, such as Conv3d and PReLU in FIG. 3 , and the input of the 3D convolutional layer is the A first video frame sequence, the output of the activation layer is the first feature.
  • an activation layer such as Conv3d and PReLU in FIG. 3
  • the second feature extraction part includes a plurality of residual attention blocks (RABs) connected in sequence, as shown in FIG. 3 , the input of the first RAB is the first feature, the input of other RABs except the first RAB is the output of the previous RAB, and the output of the last RAB is the second feature; each of the RABs includes a sequentially connected 3D convolutional layer, activation layer, and 3D attention mechanism model units, such as Conv3d, PReLU and 3D-CBAM in Figure 5.
  • the input of the RAB is sent to the 3D convolutional layer, and a skip connection (skip connection) is added to the output of the 3D attention mechanism model unit, and the obtained sum is used as the output of the RAB.
  • the 3D attention mechanism model unit adopts a 3D convolutional block attention model (3D-CBAM: 3D Convolutional Block Attention Module), as shown in FIG. 4 , the 3D convolutional block Attention model includes 3D channel attention module 60 and 3D space attention module 70, and the input of described 3D attention mechanism model unit is sent into described 3D channel attention module; The input and output of described 3D channel attention module are related The first product obtained by multiplying is used as the input of the 3D spatial attention module, and the second product obtained by multiplying the output of the 3D spatial attention module with the first product is used as the 3D attention mechanism model unit output.
  • 3D-CBAM 3D Convolutional Block Attention Module
  • the attention mechanism is designed in two-dimensional space.
  • the 3D-CBAM of the embodiment of the present disclosure expands on the two-dimensional basis, adding a depth dimension.
  • 3D-CBAM sequentially infers channel attention feature maps and spatial attention feature maps.
  • the input feature map can be input into the shared multi-layer perceptron after the maximum pooling and mean pooling based on the width, height and depth, and then summed based on the corresponding elements.
  • the operation is activated by the sigmoid function to generate the initial channel feature map, and the initial channel feature map is multiplied by the input feature map to generate the final channel feature map.
  • the above-mentioned final channel feature map can be used as the input feature map of the 3D spatial attention module, and the channel-based maximum pooling operation and mean pooling operation are performed on it, and the extracted features are then performed based on
  • the merging operation of channels is then reduced into one channel through convolution operations (such as 7 ⁇ 7 convolution, 3 ⁇ 3 convolution, etc.), and then activated by the sigmoid function to generate a spatial attention feature map.
  • the generated spatial attention feature map is multiplied by the input final channel feature map to obtain the feature map output by 3D-CBAM.
  • the 3D attention mechanism considers both spatial and temporal changes when extracting features, which can be more suitable for the purpose of the video super-resolution network in the embodiment of the present disclosure, and better adaptive learning.
  • the 3D channel attention module pays more attention to which channels play a role in the final super-resolution, and selects the features that play a decisive role in the prediction.
  • the 3D spatial attention module focuses on which pixel positions will play a more important role in the prediction of the network. The joint use of these two attention mechanism modules can maximize the learning ability of the network and obtain better spatiotemporal super-resolution results. .
  • the reconstruction part 30 includes the following units connected in sequence:
  • a 3D convolution unit for fusion features including sequentially connected 3D convolution layers and activation layers (such as Conv3D and PReLu in Figure 3), the input of the 3D convolution unit for fusion features is the second Features;
  • PReLu is a parametric rectifier linear unit (Parametric Rectifier Linear Unit).
  • the 3D transposed convolution unit for realizing the spatio-temporal super-resolution of features, including the 3D transposed convolution layer and the activation layer connected in sequence (ConvTrans-3D and PReLu in Fig. 5), the 3D transposed convolution unit
  • the input is the output of the 3D convolution unit used to fuse features, and the 3D transposed convolution can realize the function of upsampling;
  • a 3D convolutional layer (such as Conv3D in FIG. 5 ) for generating a sequence of video frames, the input is the output of the 3D transposed convolution unit, and the output is the second sequence of video frames.
  • the activation function used by the activation layer is PReLu, there are many kinds of activation functions, and other activation functions may also be used here.
  • the characteristics of the above-mentioned generation network in the embodiment of the present disclosure include: using 3D convolution, the time and space features of the video can be extracted at the same time, the feature extraction is more accurate, and compared with the method of separately extracting time and space information, it further reduces The consumption of computing resources; the generation network adopts a 3D attention mechanism, which can well concentrate the features extracted by the network, so as to obtain better reconstruction results; the generation network can use a variable number of RAB blocks, and the network structure is more flexible. The number can be freely selected according to computing resources to meet the needs of different scenarios.
  • the generated network can be used independently as a video super-resolution network to complete the video super-resolution function.
  • the video super-resolution network further includes a discriminant network, and the overall architecture of the video super-resolution network composed of the generation network and the discriminant network is shown in FIG. 2 .
  • the input during the training of the discriminant network is a sequence of real video frames and a sequence of second video frames generated by the generating network, which are respectively used as real samples and fake samples input to the discriminant network.
  • the output of the judgment network is the probability that the input video frame sequence is a real video frame.
  • the first video frame sequence used as input data during network training is obtained by degrading the real video frame sequence.
  • the first video frame sequence used as the training set may be obtained by performing one or more of downsampling, blurring, adding noise, and compression codec on the real video frame sequence.
  • downsampling There are many methods of downsampling, including linear methods, such as nearest neighbor sampling, bilinear sampling, Bicubic downsampling, mean downsampling, etc.; non-linear methods, such as neural network downsampling.
  • linear methods such as nearest neighbor sampling, bilinear sampling, Bicubic downsampling, mean downsampling, etc.
  • non-linear methods such as neural network downsampling.
  • a variety of downsampling multiples can be set to obtain the first video frame sequence of different resolutions, so as to train multiple sets of network parameters.
  • the network parameters of the video super-resolution network can be flexibly set according to needs to obtain different super-resolution effects. .
  • the discrimination network includes a first branch, a second branch, an information fusion unit connected to the first branch and the second branch, and an information fusion unit connected to the A weight calculation unit connected to the information fusion unit, wherein:
  • the second branch 50 is configured to extract motion information features between video frames from the input video frame sequence based on the optical flow network, and perform authenticity judgment based on the motion information features;
  • the weight calculation unit is configured to perform weight calculation according to the fused information output by the information fusion unit to obtain the probability that the input video frame sequence is a real video frame sequence.
  • the information fusion unit is implemented using a fully connected layer (such as dense(1) in Figure 5); the weight calculation unit uses an S-shaped function (such as the sigmod function in Figure 5 )accomplish.
  • the first branch 40 includes the following units connected in sequence:
  • 2D convolution unit including sequentially connected 2D convolution layer and activation layer.
  • Conv_1 and LeakyReLU in Figure 5;
  • a plurality of 2D convolution plus normalization units includes sequentially connected 2D convolution layers, BN layers and activation layers, the Conv_2 layer, BN layer and LeakyReLU in Figure 5 form a 2D
  • the convolution plus normalization unit, and other 2D convolution plus normalization units are represented by CBL_2 to CBL_8 in the figure. 7 CBLs are used in the example of FIG. 5 , but the present disclosure is not limited to this number; the BN layer is used to speed up the network convergence rate.
  • the second branch 50 includes the following units connected in sequence:
  • M 2D deconvolution units including 2D deconvolution layers and activation layers, M ⁇ 2, four 2D deconvolution units are shown in Figure 5, and the 2D deconvolution layers are denoted as DeConv5, DeConv4, DeConv3 and DeConv2, the activation layer is LeakyReLU;
  • the fully connected unit includes a sequentially connected fully connected layer and an activation layer, such as Dense (1024) and LeakyReLU in the second branch 50 in FIG. 5 .
  • the connection relationship is shown in Figure 5. This kind of network structure can realize the extraction of motion information features between video frames and the authenticity judgment.
  • the activation function used by the activation layer in the discriminant network is LeakReLu with a leaky rectified linear unit.
  • activation functions There are many kinds of activation functions, and other activation functions can also be used here.
  • K represents the size of the convolution kernel (kernel)
  • s represents the step size (stride)
  • n represents the number of convolution kernels (number).
  • K3 means that the convolution kernel size is 3
  • s1 means that the step size is 1
  • n64 means that the number of convolution kernels is 64, and so on, the unit of convolution kernel size and step size can be pixels.
  • the convolutional layer parameters in the second branch 50 are set as follows:
  • the discriminant network in the embodiment of the present disclosure adopts two discriminant criteria, one is the feature of the video frame itself, and the other is the motion information between the video frames.
  • the discriminant network includes two branches, the whole is a U-shaped network structure, one branch is used to extract the detailed features and judgment of the input video frame sequence, and the other branch uses the optical flow network to obtain the motion information characteristics of the input video frame sequence and judgment.
  • the authenticity probability of the input video frame can be more accurately identified, that is, the real video frame sequence or the super-resolution video frame sequence (ie, the second video frame sequence).
  • the use of 3D residual attention mechanism can better extract useful features in different dimensions and improve video quality.
  • the embodiments of the present disclosure are based on a spatio-temporal super-resolution network of generated confrontation video, which can simultaneously improve the spatial resolution and temporal resolution of the video, that is, super-resolution in space and time, including multi-dimensional feature information. It can significantly enhance the image quality and frame rate of low-resolution video frame sequences, and use a network to achieve both effects of video frame image super-resolution and frame rate improvement.
  • the video spatio-temporal super-resolution network of the embodiment of the present disclosure puts the use of motion information on the discriminative network. Compared with using optical flow information for motion estimation in the generation network part, it can further use real video information to further Improve the performance of the entire network and improve the quality of video super resolution.
  • the network structure of the present disclosure may be changed on the basis of the foregoing embodiments.
  • the number of RABs included in the generated network can be appropriately reduced or increased to meet the requirements of different computing capabilities in different scenarios.
  • An embodiment of the disclosure also provides a method for training the video super-resolution network of the embodiment of the disclosure, including the following process:
  • each HR sequence has 7 frames, the height of each HR video frame is sH, and the width is sW.
  • the HR sequence can be down-sampled in time and space at the same time to obtain a blocky low-resolution video frame sequence (LR sequence for short) 5 ⁇ H ⁇ W. Setting smaller H and W values during training can reduce training time and increase the complexity of the data set. All training data are normalized so that their pixel values are in the (0, 1) interval, which is better used for network training results. Through the above processing, a sufficient number of LR sequences and HR sequences are obtained.
  • the LR sequence is used as the input data of the video super-resolution network, and the HR sequence is used as the target data of the video super-resolution network to train the generation network.
  • the output of the generator network is a super-resolution video frame sequence (referred to as SR sequence) of the same size as the HR sequence.
  • the SR sequence that is, the fake sample
  • the HR sequence that is, the real sample
  • the discriminant network is sent to the discriminant network as the input data of the discriminant network training, in which the HR sequence and the SR sequence each account for 50%, and the discriminant network outputs the judgment result, that is, the true value of the input data.
  • Pseudo-probability can also be said to be the probability that the input data is a HR sequence.
  • the judgment results of the discriminant network on the SR sequence and HR sequence are used to calculate the loss of the discriminant network and the confrontation loss of the generating network, and the mean square error (MSE: Mean Square Error) of the SR sequence output by the generating network and the HR sequence can be used as the loss function.
  • MSE Mean Square Error
  • the video super-resolution network is implemented on an Nvidia GTX 1080Ti GPU using the PyTorch platform (the PyTorch platform is an open source Python machine learning library). Both the training set and the test set of the experiment use Vimeo-90K. 4 times super-resolution is achieved on the video frame image, and the frame rate is increased by 2 times.
  • RAB residual attention block
  • the video encoding end may not be able to provide high-resolution video due to various objective limitations. For example, the camera resolution is not enough, the network bandwidth is not enough, and the source resources are not enough. Video super-resolution based on deep learning can better restore image details. Therefore, video super-resolution processing can be used to enhance video quality, present high-quality video to users, and improve the subjective visual effect of images.
  • An embodiment of the present disclosure provides a video super-resolution processing method, as shown in FIG. 6 , including:
  • Step 110 extracting a first feature from the first video frame sequence based on 3D convolution
  • Step 120 extracting temporal and/or spatial second features from the first features based on the 3D residual attention mechanism
  • Step 130 based on 3D convolution and 3D upsampling, realize the feature fusion of the second feature and the spatiotemporal super-resolution of the feature, and reconstruct the video frame sequence based on the 3D convolution to generate a second video frame sequence, the second video frame
  • the resolution of the sequence is greater than the resolution of said first sequence of video frames.
  • the video super-resolution processing method is implemented based on the video super-resolution network described in any embodiment of the present disclosure
  • the image resolution of the second video frame sequence is greater than the image resolution of the first video frame sequence, and/or, the video of the second video frame sequence is first The video frame rate of the sequence of video frames.
  • Video super-resolution can be used in various aspects of the video compression process, such as video post-processing at the decoding end, video pre-processing at the encoding end, and video encoding and decoding. Below are a few examples to illustrate.
  • One way to deal with it is to use the conventional video encoding method but increase the intensity of compression, such as increasing the quantization step size, to encode a video frame sequence with a lower bit rate, and then improve the video quality through video super-resolution at the decoding end. That is to say, video super-resolution is applied to the post-processing process in video decoding. For example, the resolution of the reconstructed video frame can be increased by performing super-resolution processing on the decoded video frame sequence output by the decoder in the video playback device.
  • the first video frame sequence in the video super-resolution processing method shown in FIG. 6 is the decoded code stream output A video frame sequence
  • the video super-resolution processing is used to increase the resolution of the decoded video frame sequence.
  • video super-resolution processing can be used to replace the original post-filtering, or the original post-processing filtering can be retained and video super-resolution processing can be added.
  • FIG. 7A is a structural block diagram of a video decoding end in this application scenario. As shown, including:
  • the video decoder 101 is configured to decode an encoded video stream (code stream for short) to obtain a first video frame sequence;
  • the video super-resolution network 103 is configured to perform video super-resolution processing on the first video frame sequence to obtain a second video frame sequence whose resolution is greater than that of the first video frame sequence;
  • the display 105 is configured to display and play the second video frame sequence.
  • the video decoder 101 in this embodiment may adopt the video decoder shown in FIG. 7B .
  • the structure of the video decoder can be used for video decoding of H.264/AVC, H.265/HEVC, VVC/H.266 and other similar standards.
  • the video decoder 101 may also use other types of video decoders, such as neural network-based video decoders in the end-to-end video coding and decoding technology.
  • the video decoder 101 includes an entropy decoding unit 150, a prediction processing unit 152, an inverse quantization unit 154, an inverse transformation processing unit 156, a reconstruction unit 158 (indicated by a circle with a plus sign in the figure), a filter unit 159, and picture buffer 160.
  • video decoder 30 may contain more, fewer or different functional components.
  • the entropy decoding unit 150 may perform entropy decoding on the received code stream to extract information such as syntax elements, quantized coefficient blocks, and PU motion information.
  • the prediction processing unit 152 , the inverse quantization unit 154 , the inverse transform processing unit 156 , the reconstruction unit 158 and the filter unit 159 can all perform corresponding operations based on the syntax elements extracted from the code stream.
  • the inverse quantization unit 154 may inverse quantize the quantized TU-associated coefficient blocks.
  • Inverse transform processing unit 156 may apply one or more inverse transforms to the inverse quantized coefficient block in order to generate the reconstructed residual block of the TU.
  • Prediction processing unit 152 includes inter prediction processing unit 162 and intra prediction processing unit 164 . If the PU is encoded using intra-frame prediction, the intra-frame prediction processing unit 164 can determine the intra-frame prediction mode of the PU based on the syntax elements parsed from the code stream, and according to the determined intra-frame prediction mode and the adjacent PU obtained from the picture buffer device 60 Intra prediction is performed on the reconstructed reference information, resulting in a prediction block of the PU. If the PU is encoded using inter-prediction, inter-prediction processing unit 162 may determine one or more reference blocks for the PU based on the motion information of the PU and corresponding syntax elements to generate a predictive block for the PU.
  • the reconstruction unit 158 may obtain the reconstruction block of the CU based on the reconstruction residual block associated with the TU and the prediction block of the PU generated by the prediction processing unit 152 (ie intra prediction data or inter prediction data).
  • the above display 105 may be, for example, a liquid crystal display, a plasma display, an organic light emitting diode display or other types of display devices.
  • the decoding end may not include the display 105, but may include other devices that can apply the decoded data.
  • the embodiments of the present disclosure can be used to solve problems such as image quality loss and video frame rate drop generated in the video compression process.
  • the video super-resolution network By applying the video super-resolution network to the post-processing of the decoding end, the temporal-spatial super-resolution of the decoded output video frame sequence can improve the quality of the video image.
  • the frame rate In order to meet the frame rate requirements of the decoding end, the frame rate can also be increased during post-processing, so as to present users with high-quality video with higher resolution and higher frame rate.
  • the video super-resolution network is used to enhance the quality of the decoded video frame sequence, and the encoding end is not required to down-sample the video frame sequence during video preprocessing.
  • the first video frame sequence is a decoded video frame sequence output by decoding the code stream; the video super-resolution processing method further includes: parsing the code stream from the code stream network parameter information of the video super-resolution network, and set the network parameters of the video super-resolution network according to the network parameter information.
  • different network parameters such as the number of RABs in the generation network can be configured for the video super-resolution network to achieve a better super-resolution effect.
  • Appropriate network parameters can be generated by the encoding end and written into the code stream, and the decoding end can parse the network parameters from the code stream and configure them to achieve better quality enhancement effects.
  • video super-resolution is applied to a video preprocessing process.
  • the acquired original video frame sequence is input into the video super-resolution network of the embodiment of the present disclosure for processing, and an output video with higher resolution and higher frame rate is obtained, and then the output The video is encoded as the input video of the video encoder.
  • the first video frame sequence in the video super-resolution processing method shown in FIG. 6 is the original video frame sequence collected by a video acquisition device, and the video super-resolution Processing can increase the resolution of a sequence of raw video frames.
  • Adaptive Resolution Change allows the video frame sequence to transmit video frames of different resolutions according to the network status, transmit low-resolution video frames when the network bandwidth is low, and transmit original resolution video frames when the bandwidth is high.
  • IDR Instantaneous Decoding Refresh
  • the encoder wants to change the resolution during video transmission, it needs to insert an instant decoding refresh (IDR: Instantaneous Decoding Refresh) frame or similar frame that meets the new resolution.
  • IDR Instantaneous Decoding Refresh
  • the transmission of IDR frames requires a relatively high bit rate, and delays will be introduced for video conferencing applications. If the IDR frame is not inserted, the different resolutions of the current frame and the reference frame will cause problems during inter-frame prediction.
  • VP9 is an open video compression standard developed by Google
  • RPR reference image resampling
  • VVC Versatile Video Coding
  • the image-based RPR puts the reference image before resampling and after resampling into the decoded picture buffer
  • DPB Decoded Picture Buffer
  • the reference image of the corresponding resolution is found in the DPB for prediction.
  • video super-resolution is applied to the processing of RPR in the video coding process.
  • the reference image that needs to be up-sampled is obtained from the DBP of the device (it can be one or more frames of reference images, and only the image resolution can be increased).
  • the video super-resolution processing can realize the up-sampling of the reference image, and obtain a reference image with a larger image resolution for selection during inter-frame prediction.
  • the video encoder 1000 shown in FIG. 8A can be used to implement RPR, and it includes an image resolution adjustment unit 1115, in which the super-resolution network of the embodiment of the present disclosure can be used to realize up-sampling of reference images.
  • the video encoder 207 includes a prediction processing unit 1100, a division unit 1101, a residual generation unit 1102, a transformation processing unit 1104, a quantization unit 1106, an inverse quantization unit 1108, an inverse transformation processing unit 1110, a reconstruction unit 1112, A filter unit 1113 , a decoded picture buffer 1114 , an image resolution adjustment unit 1115 , and an entropy encoding unit 1116 .
  • the prediction processing unit 1100 includes an inter prediction processing unit 121 and an intra prediction processing unit 1126 .
  • video encoder 20 may contain more, fewer or different functional components than this example.
  • the division unit 1101 cooperates with the prediction processing unit 1100 to divide the received video data into slices (Slices), CTUs or other larger units.
  • the video data received by the dividing unit 1101 may be a video sequence including video frames such as I frames, P frames, or B frames.
  • the prediction processing unit 1100 may divide a CTU into CUs, and perform intra-frame predictive coding or inter-frame predictive coding on the CUs.
  • the CU can be divided into one or more prediction units (PU: prediction unit).
  • the inter prediction processing unit 1121 may perform inter prediction on the PU to generate prediction data of the PU, the prediction data including the prediction block of the PU, motion information of the PU and various syntax elements.
  • the intra prediction processing unit 1126 may perform intra prediction on the PU to generate prediction data for the PU.
  • the prediction data for a PU may include the prediction block and various syntax elements for the PU.
  • the residual generation unit 1102 may generate the residual block of the CU based on the original block of the CU minus the prediction block of the PU divided by the CU.
  • the transform processing unit 1104 may divide the CU into one or more transform units (TU: Transform Unit), and the residual block associated with the TU is a sub-block obtained by dividing the residual block of the CU.
  • a TU-associated coefficient block is generated by applying one or more transforms to the TU-associated residual block.
  • the quantization unit 1106 can quantize the coefficients in the coefficient block based on the selected quantization parameter, and the degree of quantization of the coefficient block can be adjusted by adjusting the QP value.
  • the inverse quantization unit 1108 and the inverse transformation unit 1110 may respectively apply inverse quantization and inverse transformation to the coefficient blocks to obtain TU-associated reconstruction residual blocks.
  • the reconstruction unit 1112 may add the reconstruction residual block and the prediction block generated by the prediction processing unit 1100 to generate a reconstruction block of the CU.
  • the filter unit 1113 After the filter unit 1113 performs loop filtering on the reconstructed block, it stores it in the decoded picture buffer 1114 as a reference image.
  • the intra prediction processing unit 1126 may extract reference images of blocks adjacent to the PU from the decoded picture buffer 1114 to perform intra prediction.
  • the inter prediction processing unit 1121 may perform inter prediction on the PU of the current frame image using the reference image of the previous frame buffered in the decoded picture buffer 1114 .
  • the image resolution adjustment unit 1115 resamples the reference images stored in the decoded picture buffer 1114 , which may include upsampling and/or downsampling, and obtains reference images of various resolutions and stores them in the decoded picture buffer 1114 .
  • the entropy encoding unit 1116 may perform an entropy encoding operation on received data (such as syntax elements, quantized systematic blocks, motion information, etc.).
  • Scalable video coding introduces concepts such as base layer (BL: Base Layer) and enhancement layer (EL: Enhance Layer), and transmits important information (bits) for decoding images in a guaranteed channel. This collection of important information is called the base layer.
  • the secondary information (bits) is transmitted in an unguaranteed channel, and the collection of these data information is called an enhancement layer.
  • some or all of the enhancement layer information is lost, and the decoder can still recover acceptable image quality from the base layer information.
  • scalable video coding there are many types of scalable video coding, such as scalable coding in space domain, scalable coding in time domain, scalable coding in frequency domain, and scalable coding in quality.
  • spatial scalable coding generates multiple images with different spatial resolutions for each frame of video in the video, and decodes the low-resolution images obtained from the basic layer code stream. If the enhancement layer code stream is added to the Decoder, what is obtained is a high-resolution image.
  • FIG. 8B An exemplary scalable video coding framework is shown in FIG. 8B , the coding framework includes a base layer, a first enhancement sublayer, namely L1 layer, and a second enhancement sublayer, namely L2 layer. Only the parts of the encoding architecture that are closely related to upsampling and downsampling are shown in the figure.
  • the input video frame sequence is sent to the basic encoder 805 for encoding after being down-sampled twice by the first down-sampling unit 801 and the second down-sampling unit 803, and the coded base layer code stream is output.
  • the reconstructed video frame is up-sampled in the first up-sampling unit 807 to obtain the reconstructed video frame of the L1 layer.
  • the first subtractor 806 subtracts the reconstructed video frame of the L1 layer from the original video frame of the L1 layer output by the first downsampling unit 801 to obtain a residual of the L1 layer.
  • the reconstructed video frame of the L1 layer and the reconstruction residual of the L1 layer are added together in the adder 808 and then up-sampled by the second upsampling unit 809 to obtain the reconstructed video frame of the L2 layer.
  • the second subtractor 810 subtracts the reconstructed video frame of the L2 layer from the input video frame sequence to obtain a residual of the L2 layer.
  • the scalable video coding framework may also include 3 or more enhancement sub-layers.
  • video super-resolution is applied to a video coding architecture including a base layer and an enhancement layer, such as an encoder of Low Complexity Enhanced Video Coding (LCEVC: Low Complexity Enhancement Video Coding), Generation of enhancement layer data for encoding side.
  • an enhancement layer such as an encoder of Low Complexity Enhanced Video Coding (LCEVC: Low Complexity Enhancement Video Coding), Generation of enhancement layer data for encoding side.
  • the video super-resolution network of the embodiments of the present disclosure can be used to implement the up-sampling unit in the scalable video coding architecture.
  • the first video frame sequence in the video super-resolution processing method shown in FIG. 6 is generated in a scalable video coding architecture including a base layer and an enhancement layer.
  • the reconstruction (Reconstruction) video frame sequence of the basic layer or the reconstruction video frame sequence of the enhanced sub-layer (such as the L1 layer), the video super-resolution processing can realize the up-sampling of the reconstruction video frame sequence for generating corresponding enhancement sublayer residuals,
  • video super-resolution is applied to a scalable video decoding architecture including a base layer and an enhancement layer.
  • An exemplary scalable video decoding architecture is shown in FIG. 8C.
  • the decoding architecture includes a base layer, a first enhancement sublayer, L1 layer, and a second enhancement sublayer, L2 layer, but may also include an enhancement sublayer or 3 more than one enhancement sublayer. Only the parts of the decoding architecture that are closely related to upsampling are shown in the figure.
  • the decoded video frame sequence of the base layer output by the base decoder 901 is up-sampled by the first up-sampling unit 903 to obtain an initial intermediate picture (Preliminary Intermediate Picture).
  • the initial intermediate image and the decoded data of the L1 layer are added in the first adder 904 to obtain a combined intermediate image (Combined Intermediate Picture) of the L1 layer.
  • the combined intermediate image is up-sampled by the second up-sampling unit 905 to obtain an initial output image (Preliminary Output Picture).
  • the initial output image and the decoded data of the L2 layer are added in the second adder 906 to obtain an output video frame sequence.
  • the decoded video frame sequence of the base layer or the combined intermediate image (can be one or more images) of the enhanced sub-layer, the video super-resolution processing can realize the up-sampling of the decoded video frame sequence to generate the initial intermediate image ; or implement upsampling of the combined intermediate image to generate the initial output image.
  • the video encoding end before encoding the video, the video encoding end first determines whether to perform downsampling according to the current situation. For example, when resources such as bandwidth are insufficient, downsampling is used to reduce the amount of encoded data, so that the code traffic is greatly reduced. After the video decoding end completes the decoding of the code stream, it judges whether to perform super-resolution on the decoded video frame sequence.
  • the network bandwidth is small, only the basic video stream encoded after downsampling is transmitted, but when the network bandwidth is large, no downsampling is performed, which is equivalent to
  • the enhanced video information is transmitted to obtain self-adaptability, ensuring that most terminals with network connections can use appropriate code streams to transmit multimedia information.
  • this scheme is superior to the scheme in which the encoding end directly encodes the video frame into an image with the same bit rate, and the decoding end uses a super-resolution network to enhance the quality of the decoded image.
  • An embodiment of the present disclosure provides a video coding processing method, as shown in FIG. 9 , including:
  • Step 210 when performing video preprocessing, determine whether to downsample the video frame sequence from the data source, if yes, perform step 220, if not, perform step 230;
  • Step 220 when it is determined not to perform down-sampling, directly input the video frame sequence from the data source into the video encoder for video encoding, generate a code stream, and end;
  • Step 230 if downsampling is determined, downsampling the video frame sequence from the data source, inputting the downsampled video frame sequence into a video encoder for video encoding, and generating a code stream.
  • the video encoding process referred to herein includes video preprocessing and video encoding.
  • the video preprocessing may include processing such as downsampling.
  • the video decoding processing referred to herein includes video decoding and video post-processing, and the video post-processing may include the video super-resolution processing in the embodiments of the present disclosure.
  • the down-sampling the video frame sequence from the data source includes: down-sampling the image resolution and/or video frame rate of the video frame sequence from the data source.
  • an appropriate down-sampling multiple can be selected according to bandwidth and other factors, so that the encoded code rate can adapt to the bandwidth.
  • the video encoding processing method further includes: when performing video encoding, writing a downsampling flag into the code stream, and the downsampling flag is used to indicate that the encoding end accepts all data from the data source Whether the preprocessing of the video frame sequence includes downsampling.
  • the encoding end performs downsampling when preprocessing the video frame sequence from the data source, and the video super-resolution network is trained based on the real video frame sequence and the first video frame sequence obtained by downsampling the real video frame sequence , then the encoding end down-samples the video frames from the data source and then performs compression encoding to generate a code stream, and the decoding end decodes the code stream to reconstruct the first video frame sequence, and the decoding end uses the video super-resolution network to reconstruct the first video frame sequence A video frame sequence is subjected to video super-resolution processing, which significantly improves video quality.
  • the application scenario of the video super-resolution network is similar to the training scenario at this time, both are used to restore the resolution of the downsampled video frame. And if the video frame is not down-sampled at the encoding end, even if the decoded video quality does not meet the requirements, at the decoding end, the video super-resolution network trained in the above-mentioned manner is used to enhance the quality of the decoded video frame sequence, which will affect the video quality.
  • the lifting effect is limited or no effect.
  • the encoding end generates the above-mentioned down-sampling flag and writes it into the code stream, so that the decoding end can determine whether to perform video super-resolution processing according to the down-sampling flag, or determine whether to perform video super-resolution processing according to the down-sampling flag and other conditions, which is beneficial
  • the decoding end reasonably makes a judgment on whether to perform video super-resolution processing.
  • the determining whether to downsample the sequence of video frames from the data source includes: determining to downsample the sequence of video frames from the data source when any of the following conditions is met: Downsample:
  • the bandwidth available to stream the video stream is less than the bandwidth required to stream the video stream without downsampling:
  • the resource at the encoding end does not support direct video encoding of the video frame sequence from the data source
  • the video encoding processing method further includes: when performing video encoding, acquiring the network parameters of the video super-resolution network corresponding to the video frame sequence from the data source, and converting the network parameters to Write code stream.
  • the encoder can make training samples based on the video resource in advance, and train the video super-resolution network, so as to obtain the network parameters of the video super-resolution network corresponding to the video resource, and then the The network parameters are saved together with the video resource, and when video encoding is performed on the video resource, the network parameters are read and encoded into a code stream. In this way, the decoding end can parse out the network parameters, use the network parameters to configure the video super-resolution network, and obtain the expected quality enhancement effect.
  • the video encoding processing method in this embodiment can determine whether to perform downsampling when preprocessing the video frame according to the bandwidth and other conditions, so that the encoding end can adaptively select an appropriate encoding processing method to adapt to changes in the network environment and encoding resources, etc. .
  • An embodiment of the present disclosure also provides a video decoding processing method, as shown in FIG. 10 , including:
  • Step 310 decoding the code stream to obtain the first video frame sequence
  • Step 320 judging whether the first video frame sequence satisfies the set super-resolution condition, if yes, execute step 330, if not, execute step 340;
  • Step 330 when the set super-resolution condition is met, output the first video frame sequence to the video super-resolution network for video super-resolution processing to obtain a second video frame sequence, the second video frame sequence a resolution greater than that of the first sequence of video frames;
  • Step 340 if the set super-resolution condition is not met, skip the video super-resolution processing on the first video frame sequence.
  • video super-resolution processing is skipped, or the video super-resolution processing is performed to obtain the second video frame sequence, subsequent post-decoding processing, or video display and playback may be performed.
  • the video super-resolution network adopts the video super-resolution network described in any embodiment of the present disclosure.
  • other video super-resolution networks may also be used to perform the video super-resolution processing of this embodiment.
  • the video super-resolution network includes a generation network, and when the generation network is trained, the first video frame sequence as a sample is used as input data, and the real video frame sequence is used as target data, wherein, The resolution of the real video frame sequence is the same as that of the second video frame sequence, and the first video frame sequence as a sample is obtained by downsampling the real video frame sequence.
  • the input of the generated network training is the first video frame sequence as a sample
  • the input of the generated network after training can be the decoded first video frame sequence (or the first video frame from the data source sequence, etc.)
  • the resolution of the first video frame sequence as a sample and the decoded first video frame sequence are the same, and the content may be different.
  • the video super-resolution network trained according to this example is suitable for restoring a low-resolution video frame sequence that has been down-sampled and then compressed, encoded and decoded to a high-resolution video frame sequence.
  • the decoding of the code stream further obtains a downsampling flag, and the downsampling flag is used to indicate whether the preprocessing of the first video frame sequence by the encoding end includes downsampling
  • the set super-resolution condition at least includes: the downsampling flag indicates that the preprocessing of the first video frame sequence by the encoding end includes downsampling.
  • the downsampling flag indicates that the preprocessing of the first video frame sequence by the encoder does not include downsampling, it may be determined to skip the video super-resolution processing of the first video frame sequence .
  • the downsampling flag itself is used to indicate whether the preprocessing of the video frame by the encoding end includes downsampling, and the downsampling flag here can be used to indicate that the preprocessing of the first video frame sequence by the encoding end includes downsampling, which means that the downsampling
  • the sampling flag is related to the first video frame sequence, for example, belongs to the same coding unit.
  • the down-sampling flag can help the decoder determine whether the encoder has down-sampled during video preprocessing, so as to better judge whether to Carry out video super-resolution processing, simply based on the quality of the decoded video, when the video quality does not reach a fixed threshold, video super-resolution is performed, and when the video quality reaches the threshold, video super-resolution is not performed, regardless of The expected effect of video super-resolution is relatively mechanical and has limitations. If the encoding end performs down-sampling and the decoded video quality just reaches the threshold, video super-resolution can also be performed at this time to improve the video quality.
  • the encoding end has not performed downsampling, due to other factors such as poor resolution of the camera itself, large noise on the transmission path, etc., the quality of the decoded video cannot reach the threshold, and video super-resolution may not be performed at this time.
  • the set super-resolution conditions include one or any combination of the following conditions:
  • the image quality of the first video frame sequence is lower than a set quality requirement
  • the preprocessing of the first video frame sequence by the encoding end includes downsampling
  • the video super-resolution function of the decoding end is available
  • the super-resolution conditions listed above can be used in combination. For example, when the image quality of the first video frame sequence is lower than the set quality requirement, the encoding end performs down-sampling on the first video frame sequence, and the video super-resolution of the decoding end When the function is available, it is determined to perform super-resolution processing on the first video frame sequence. But the conditions here are not exhaustive, and there may be other conditions.
  • the above quality requirements can be expressed by the set evaluation indicators such as PSNR Peak Signal to Noise Ratio (PSNR), Structural Similarity (SSIM: Structural Similarity), and Mean Square Error (MSE: Mean Square Error).
  • the video super-resolution network is applied to the video processing process.
  • the video is sampled up and down in space and time, which greatly reduces the amount of video data that needs to be encoded; after decoding, use the trained video super-resolution network to perform corresponding up-sampling to restore the original video.
  • the code rate is significantly reduced, the coding efficiency is greatly improved, and the transmission code stream is reduced.
  • An embodiment of the present disclosure also provides a video encoding and decoding system, as shown in FIG. 11 , including an encoding end device and a decoding end device.
  • the encoding end device includes a data source 201 and a video encoding processing device 200.
  • the data source 201 may be a video capture device (for example, a video camera), an archive containing previously captured data, a feed-in interface for receiving data from a content provider, and A computer graphics system that generates the data, or a combination of these sources.
  • the video encoding processing device 200 can be realized by using any one of the following circuits or any combination of the following circuits: one or more microprocessors, digital signal processors, application specific integrated circuits, field programmable gate arrays, discrete logic, hardware .
  • instructions for the software may be stored in a suitable non-transitory computer-readable storage medium and executed in hardware using one or more processors to thereby Implement the disclosed method.
  • the video encoding processing apparatus 200 may implement the video encoding processing method described in any embodiment of the present disclosure based on the above circuit.
  • the video encoding processing device 200 includes:
  • the down-sampling device 205 is configured to down-sample the input video frame sequence, and output the down-sampled video frame sequence to a video encoder for encoding;
  • the video encoder 207 is configured to perform video encoding on the sequence of video frames from a data source or the sequence of downsampled video frames.
  • the downsampling judging device 203 determines whether to downsample the video frame sequence from the data source, including: when any of the following conditions is met, determining whether to downsample the video frame sequence from the data source A sequence of video frames is downsampled:
  • the bandwidth available to stream the video stream is less than the bandwidth required to stream the video stream without downsampling:
  • the resource at the encoding end does not support direct video encoding of the video frame sequence from the data source
  • the video frame sequence from the data source belongs to the specified video frame sequence that needs to be down-sampled.
  • the downsampling device 205 performs downsampling on the video frame sequence from the data source, including: performing an image resolution and/or video frame rate on the video frame sequence from the data source Downsample.
  • the downsampling judging device 203 is further configured to generate a downsampling flag and output it to the video encoder 207, the downsampling flag is used to indicate Whether the preprocessing of the video frame sequence includes downsampling; the video encoder 207 is also configured to write the downsampling flag into the code stream when performing video encoding.
  • the downsampling flag here can be used to indicate that the preprocessing of the video frame sequence from the data source by the encoder includes downsampling, indicating that the downsampling flag here is related to the video frame sequence from the data source, if they belong to the same coding unit.
  • the decoding end device includes a video decoding processing device 300 and a display 307 , and the display 307 may be a liquid crystal display, a plasma display, an organic light emitting diode display or other types of display devices.
  • the video decoding processing device 300 can be realized by using any one of the following circuits or any combination of the following circuits: one or more microprocessors, digital signal processors, application specific integrated circuits, field programmable gate arrays, discrete logic, hardware . If the present disclosure is implemented partially in software, instructions for the software may be stored in a suitable non-transitory computer-readable storage medium and executed in hardware using one or more processors to implement The disclosed method.
  • the video decoding processing apparatus 300 may implement the video decoding processing method described in any embodiment of the present disclosure based on the above circuit.
  • the video decoding processing device 300 further includes:
  • the video decoder 301 is configured to decode the code stream to obtain the first video frame sequence
  • the super-resolution judging device 303 is configured to judge whether the first video frame sequence satisfies the set super-resolution condition, and if the set super-resolution condition is met, output the first video frame sequence to the video super-resolution
  • the network performs video super-resolution processing; if the set super-resolution condition is not satisfied, it is determined to skip the video super-resolution processing of the first video frame sequence;
  • the video super-resolution network 305 is configured to perform video super-resolution processing on the first video frame sequence to obtain a second video frame sequence whose resolution is greater than that of the first video frame sequence;
  • the video super-resolution network adopts the video super-resolution network described in any embodiment of the present disclosure.
  • the first video frame sequence as a sample is used as input data
  • the real video frame sequence is used as target data
  • the first video frame sequence as a sample is A video frame sequence is obtained by down-sampling the real video frame sequence.
  • the video super-resolution network trained in this way is suitable for restoring the low-resolution video frame sequence after downsampling, compression encoding and decoding to a high-resolution video frame sequence, and has a good quality enhancement effect.
  • the video decoder decodes the code stream, and further extracts a down-sampling flag from the code stream, and the down-sampling flag is used to indicate the encoding end's Whether the preprocessing includes downsampling;
  • the superresolution condition used by the superresolution decision device at least includes: the downsampling flag indicates that the preprocessing of the first video frame sequence by the encoding end includes downsampling; in an example
  • the super-resolution judging device may determine not to perform super-resolution processing on the first video frame sequence when the downsampling flag indicates that the preprocessing of the first video frame sequence by the encoding end does not include downsampling.
  • the set super-resolution condition used by the super-resolution decision device includes one or any combination of the following conditions:
  • the image quality of the first video frame sequence is lower than a set quality requirement
  • the preprocessing of the first video frame sequence by the encoding end includes downsampling
  • the video super-resolution function of the decoding end is available
  • the super-resolution judging device may determine to skip video super-resolution processing on the first video frame sequence when the first video frame sequence does not meet the set super-resolution condition.
  • the encoding end judges whether it is necessary to downsample the video frame sequence according to factors such as the currently detected bandwidth environment, and if necessary (for example, when the bandwidth is insufficient), then Select the corresponding down-sampling multiple, down-sample the spatial resolution and/or time resolution of the video frame sequence, and then encode it into a code stream for transmission; and use the corresponding decoder to decode at the decoding end, the quality of the decoded video frame is not high. High, it can be sent to the video super-resolution network for quality improvement, and a video with the required spatial and temporal resolution can be obtained.
  • the encoding end can directly encode the video frame sequence from the data source into a code stream for transmission, and the decoding end can directly decode to obtain high-quality video. At this time, video super-resolution is not performed. Regardless of whether the encoding end performs down-sampling, the same video encoder can be used for encoding.
  • the encoding operation is relatively simple and the resource occupation is small.
  • An embodiment of the present disclosure also provides a video encoding processing device, as shown in FIG. 5.
  • a video encoding processing device as shown in FIG. 5.
  • An embodiment of the present disclosure also provides a video decoding processing device, as shown in FIG. 12 , including a processor and a memory storing a computer program that can run on the processor, wherein the processor executes the computer
  • the program implements the video decoding processing method described in any embodiment of the present disclosure.
  • An embodiment of the present disclosure further provides a video encoding and decoding system, including the video encoding processing device described in any embodiment of the present disclosure and the video decoding processing device described in any embodiment of the present disclosure.
  • An embodiment of the present disclosure further provides a code stream, wherein the code stream is generated according to the video encoding processing method described in the embodiment of the present disclosure, and the code stream includes a downsampling flag.
  • An embodiment of the present disclosure also provides a non-transitory computer-readable storage medium, the computer-readable storage medium stores a computer program, wherein, when the computer program is executed by a processor, any implementation of the present disclosure can be realized.
  • the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over, as one or more instructions or code, a computer-readable medium and executed by a hardware-based processing unit.
  • Computer-readable media may include computer-readable storage media that correspond to tangible media such as data storage media, or communication media including any medium that facilitates transfer of a computer program from one place to another, eg, according to a communication protocol. In this manner, a computer-readable medium may generally correspond to a non-transitory tangible computer-readable storage medium or a communication medium such as a signal or carrier wave.
  • Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for implementation of the techniques described in this disclosure.
  • a computer program product may comprise a computer readable medium.
  • such computer-readable storage media may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk or other magnetic storage, flash memory, or may be used to store instructions or data Any other medium that stores desired program code in the form of a structure and that can be accessed by a computer.
  • any connection could also be termed a computer-readable medium. For example, if a connection is made from a website, server or other remote source for transmitting instructions, coaxial cable, fiber optic cable, dual wire, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium.
  • the technical solutions of the embodiments of the present disclosure may be implemented in a wide variety of devices or devices, including a wireless handset, an integrated circuit (IC), or a set of ICs (eg, a chipset).
  • IC integrated circuit
  • Various components, modules, or units are described in the disclosed embodiments to emphasize functional aspects of devices configured to perform the described techniques, but do not necessarily require realization by different hardware units. Rather, as described above, the various units may be combined in a codec hardware unit or provided by a collection of interoperable hardware units (comprising one or more processors as described above) in combination with suitable software and/or firmware.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

一种视频超分辨网络及视频超分辨、编解码处理方法、装置,视频超分辨网络基于3D卷积实现对低分辨率视频帧序列的浅层特征提取、深层特征提取,重建得到更高分辨率的视频帧序列。视频超分辨网络的判别网络采用两个分支分别进行细节特征判断和运动信息特征判断。在视频压缩过程中对视频帧进行超分辨可以增强图像质量。还可以将解码后的超分辨处理与编码前的下采样处理结合,实现低码率图像的传输和恢复。

Description

视频超分辨网络及视频超分辨、编解码处理方法、装置 技术领域
本公开实施例涉及但不限于图像处理技术,更具体地,涉及一种视频超分辨网络及视频超分辨、编解码处理方法、装置。
背景技术
一般的视频压缩过程如图1所示,在编码端,包括视频采集、视频预处理、视频编码等过程。在解码端,包括视频解码、视频后处理和显示播放等过程。在视频预处理时,有时为了带宽、码率等限制会降低视频的帧率,同时视频压缩编码时也会带来图像质量的降低。视频解码后的视频后处理过程是提升视频质量的一个重要环节,但提升效果还有待增强。
发明概述
以下是对本文详细描述的主题的概述。本概述并非是为了限制权利要求的保护范围。
本公开一实施例提供了一种视频超分辨网络,包括生成网络,其中,所述生成网络包括依次连接的第一特征提取部分、第二特征提取部分和重建部分,其中:
所述第一特征提取部分设置为接收第一视频帧序列,基于3D卷积从所述第一视频帧序列中提取第一特征并输出;
所述第二特征提取部分,设置为接收所述第一特征,基于3D残差注意力机制从所述第一特征中提取时间和/或空间上的第二特征并输出;
所述重建部分设置为接收所述第二特征,基于3D卷积和3D上采样实现特征融合和特征的时空超分辨,及基于3D卷积重建视频帧序列,生成第二视频帧序列,所述第二视频帧序列的分辨率大于所述第一视频帧序列。
本公开一实施例还提供了一种视频超分辨处理方法,包括:
基于3D卷积从所述第一视频帧序列中提取第一特征;
基于3D残差注意力机制从所述第一特征中提取时间和/或空间上的第二特征;
基于3D卷积和3D上采样实现所述第二特征的特征融合和特征的时空超分辨,及基于3D卷积重建视频帧序列,生成第二视频帧序列,所述第二视频帧序列的分辨率大于所述第一视频帧序列的分辨率。
本公开一实施例还提供了一种视频解码处理方法,包括:
对码流进行解码,得到第一视频帧序列;
判断所述第一视频帧序列是否满足设定的超分辨条件;
在满足设定的超分辨条件的情况下,将所述第一视频帧序列输出到视频超分辨网络进行视频超分辨处理,得到第二视频帧序列,所述第二视频帧序列的分辨率大于所述第一视频帧序列的分辨率。
本公开一实施例还提供了一种视频编码处理方法,包括:
进行视频预处理时确定是否对来自数据源的视频帧序列进行下采样;
在确定不进行下采样的情况下,将来自数据源的所述视频帧序列直接输入视频编码器进行视频编码;
在确定进行下采样的情况下,对来自数据源的所述视频帧序列进行下采样,将下采样后的视频帧序列输入视频编码器进行视频编码。
本公开一实施例还提供了一种视频超分辨处理装置,包括处理器以及存储有可在所述处理器上运行的计算机程序的存储器,其中,所述处理器执行所述计算机程序时实现如本公开任一实施例所述的视频超分辨处理方法。
本公开一实施例还提供了一种视频解码处理装置,包括处理器以及存储有可在所述处理器上运行的计算机程序的存储器,其中,所述处理器执行所述计算机程序时实现如本公开任一实施例所述的视频解码处理方法。
本公开一实施例还提供了一种视频解码处理装置,包括:
视频解码器,设置为对码流进行解码,得到第一视频帧序列;
超分辨判决装置,设置为判断所述第一视频帧序列是否满足设定的超分辨条件,在满足设定的超分辨条件的情况下,将所述第一视频帧序列输出到视频超分辨网络进行视频超分辨处理;在不满足设定的超分辨条件的情况下,确定跳过对所述第一视频帧序列的视频超分辨处理;
视频超分辨网络,设置为对所述第一视频帧序列进行视频超分辨处理,得到分辨率大于所述第一视频帧序列的第二视频帧序列。
本公开一实施例还提供了一种视频编码处理装置,包括处理器以及存储有可在所述处理器上运行的计算机程序的存储器,其中,所述处理器执行所述计算机程序时实现如本公开任一实施例所述的视频编码处理方法。
本公开一实施例还提供了一种视频编码处理装置,其中,包括:
下采样判决模块,设置为进行视频预处理时确定是否对来自数据源的视频帧序列进行下采样,在确定进行下采样的情况下,将来自数据源的所述视频帧序列输出到下采样装置,在确定不进行下采样的情况下,将来自数据源的所述视频帧序列直接输出到视频编码器进行编码;
下采样装置,设置为对输入的视频帧序列进行下采样,将下采样后的视频帧序列输出到视频编码器进行编码;
视频编码器,设置为对来自数据源的所述视频帧序列或者下采样后的所述视频帧序列进行视频编码。
本公开一实施例还提供了一种视频编解码系统,包括如本公开实施例所述的视频编码处理装置和如本公开实施例所述的视频解码处理装置。
本公开一实施例还提供了一种码流,其中,所述码流包括根据如本公开实施例所述的视频编码处理方法生成,所述码流中包含所述下采样标志。
本公开一实施例还提供了一种非瞬态计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其中,所述计算机程序时被处理器执行时实现如本公开任一实施例所述的方法。
在阅读并理解了附图和详细描述后,可以明白其他方面。
附图概述
附图用来提供对本公开实施例的理解,并且构成说明书的一部分,与本公开实施例一起用于解释本公开的技术方案,并不构成对本公开技术方案的限制。
图1是视频压缩过程的示意图;
图2是一种生成对抗网络的架构图;
图3是本公开一实施例生成网络的结构图;
图4是本公开一实施例3D残差注意力机制模型的结构示意图;
图5是本公开一实施例判别网络的结构图;
图6是本公开一实施例视频超分辨处理方法的流程图;
图7A是本公开一实施例对已解码视频帧序列进行超分辨的示意图;
图7B是本公开一实施例视频解码器的架构图;
图8A是本公开一实施例视频编码器的架构图;
图8B是本公开一实施例可分级视频编码架构的示意图,仅示出了与上采样和下采样密切相关的部分;
图8C是本公开一实施例可分级视频解码架构的示意图,仅示出与上采样密切相关的部分;
图9是本公开一实施例视频编码处理方法的流程图;
图10是与图9所示视频编码处理方法对应的本公开一实施例视频解码处理方法的流程图;
图11是本公开一实施例视频编解码系统的架构图;
图12是本公开一实施例视频编码处理装置的结构示意图。
详述
本公开描述了多个实施例,但是该描述是示例性的,而不是限制性的,并且对于本领域的普通技术人员来说显而易见的是,在本公开所描述的实施例包含的范围内可以有更多的实施例和实现方案。
本公开的描述中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本公开中被描述为“示例性的”或者“例如”的任何实施例不应被解释为比其他实施例更优选或更具优势。本文中的“和/或”是对关联对象的关联关系的一种描述,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。“多个”是指两个或多于两个。另外,为了便于清楚描述本公开实施例的技术方案,采用了“第一”、“第二”等字样对功能和作用基本相同的相同项或相似项进行区分。本领域技术人员可以理解“第一”、“第二”等字样并不对数量和执行次序进行限定,并且“第一”、“第二”等字样也并不限定一定不同。
在描述具有代表性的示例性实施例时,说明书可能已经将方法和/或过程呈现为特定的步骤序列。然而,在该方法或过程不依赖于本文所述步骤的特定顺序的程度上,该方法或过程不应限于所述的特定顺序的步骤。如本领域普通技术人员将理解的,其它的步骤顺序也是可能的。因此,说明书中阐述的步骤的特定顺序不应被解释为对权利要求的限制。此外,针对该方法和/或过程的权利要求不应限于按照所写顺序执行它们的步骤,本领域技术人员可以容易地理解,这些顺序可以变化,并且仍然保持在本公开实施例的精神和范围内。
请参照图1,视频后处理主要是针对视频预处理、视频编码和视频解码过程中的质量损失进行的,以增强视频的图像质量和提升视频的帧数。为了提升视频质量,一些方法是用滤波器对压缩后的图像进行滤波处理。但这些方法主要是通过平滑视频压缩引入的失真以达到提升图像的视觉效果,而不是恢复图像本身的像素值。为了提升视频帧的帧率,可以采用非基于块匹配的帧率提升算法和基于块匹配的帧率提升算法。非基于块匹配的帧率提升算法不考虑图像中的物体运动,只是利用相邻视频帧进行线性插值产生新的视频帧,这种算法运算复杂度底,但视频帧存在抖动、模糊现象。而基于块匹配的帧率提升算法通过估计物体的运动向量,在物体的运动轨迹上插值来提高视频帧的帧率。插值得到的视频帧的质量会所提升,但复杂度较高。
超分辨率(SR:Super-Resolution)(文中简称为超分辨)是指通过硬件或软件方法提高原有图像的分辨率。通过一幅或者多幅低分辨率(LR:Low Resolution)图像来得到一幅高分辨率(HR:High Resolution)图像的过程就是超分辨的过程。超分辨技术可以通过深度学习的方法将低分辨率视频重建成高分辨率视频图像、为用户带来良好的视频体验。
本公开一实施例提供了一种生成对抗网络(GAN:Generative Adversarial Networks)。如图2所示,该网络包括能捕获数据分布的生成网络(generator)G,也可称为生成器。以及是能估计数据来源于真实样本概率的判别网络(discriminator)D,也可称为判别器。在这种框架下,同时训练生成网络和判别网络,通过这两个网络互相对抗来达到最好的生成效果。生成网络训练时的输入是低分辨率图像,输出为重建的超分辨图像。而判别网络训练时的输入是超分辨图像和真实图像,输出是输入图像来源于真实图像的概率,低分辨率图像可以通过对真实图像进行下采样而得到。生成网络的训练过程就是最大化判别器犯错误的概率,使得判别器误以为数据是真实图像(真样本)而不是生成器生成的超分辨图像(假样本)。而判别网络的训练目标是能将真样本和假样本最大化分开。因此,这一框架就对应于两个参与者的极小极大博弈(minimax game)。在所有可能的网络参数中,可以求出唯一均衡解,使得生成网络生成的假样本进去了判别网络以后,判别网络给出的结果是一个接近0.5的值。
本公开一实施例提出了基于生成对抗网络实现超分辨的方法,相应的网络为超分辨生成对抗网络(SRGAN:Super-Resolution Generative Adversarial Networks),在SRGAN的网络框架中。生成网络的核心是多个相同布局的残差块,使用批量归一化(BN:batch-normalization layers)层和修正线性单元(ReLU:Rectified Linear Unit)作为激活函数,用2个训练好的子像素(trained sub-pixel)卷积层增加输入图像的分辨率。判别网络包含8个递增的卷积层,按从2到64到512个核函数增长,作为结果的512个特征图后是2个全连接层(dense layers,也可称为密集层)和一个最终的S形(sigmoid)激活函数,以得到样本类别的概率。但SRGAN不能同时实现时间和空间上的超分辨以充分提取不同维度的有用特征,对视频质量的提升有限。其判别网络结构单一,没有利用光流信息,判别能力受到限制。因此经该网络重建的高分辨率图像的质量仍有待提高。
本公开一实施例提供了一种视频超分辨网络,包括用于实现视频时空超分辨功能的生成网络。所述生成网络使用3D卷积来实现时空超分的功能,先基于3D卷积进行浅层特征提取,之后用一系列残差注意力块(RAB:residual attention Block)进行深层特征提取。每个RAB块自身使用残差学习的方式,使用3D注意力机制,来进一步提升时空超分辨的质量。
如图3所示,所述生成网络包括依次连接的第一特征提取部分、第二特征提取部分和重建部分,其中:
第一特征提取部分10,设置为接收第一视频帧序列,基于3D卷积从所述第一视频帧序列中提取第一特征并输出;
第二特征提取部分20,设置为接收所述第一特征,基于3D残差注意力机制从所述第一特征中提取时间和/或空间上的第二特征并输出;
重建部分30,设置为接收所述第二特征,基于3D卷积和3D上采样实现特征融合和特征的时空超分辨,及基于3D卷积重建视频帧序列,生成第二视频帧序列,所述第二视频帧序列的分辨率大于所述第一视频帧序列的分辨率。
上述第二特征是从第一特征中提取的,也可以将第一特征称为浅层特征,将第二特征称为深层特征。
上述第一视频帧序列也可以称为低分辨率视频帧序列,第二视频帧序列可以称为高分辨率视频帧序列或者超分辨率视频帧序列。
上述图像分辨率和视频帧率可以统称为分辨率,其中图像分辨率也可以称为空间分辨率,视频帧率也可以称为时间分辨率。
上述第二视频帧序列的分辨率大于所述第一视频帧序列的分辨率,可以是:所述第二视频帧序列的图像分辨率大于所述第一视频帧序列的图像分辨率,和/或,所述第二视频帧序列的视频帧率大于所述第一视频帧序列的视频帧率。
在本公开一示例性的实施例中,所述第一特征提取部分包括依次连接的3D卷积层和激活层,如图3中的Conv3d和PReLU,所述3D卷积层的输入为所述第一视频帧序列,所述激活层的输出为所述第一特征。
在本公开一示例性的实施例中,所述第二特征提取部分包括依次连接的多个残差注意力块(RAB),如图3所示,第一个RAB的输入为所述第一特征,除第一个RAB之外的其他RAB的输入为前一RAB的输出,最后一个RAB的输出为所述第二特征;每一所述RAB包括依次连接的3D卷积层、激活层以及3D注意力机制模型单元,如图5中的Conv3d、PReLU和3D-CBAM。所述RAB的输入送入所述3D卷积层,还跳跃连接(skip connection)与所述3D注意力机制模型单元的输出相加,得到的和作为所述RAB的输出。
在本公开一示例性的实施例中,所述3D注意力机制模型单元采用3D卷积块注意力模型(3D-CBAM:3D Convolutional Block Attention Module),如图4所示,该3D卷积块注意力模型包括3D通道注意力模块60和3D空间注意力模块70,所述3D注意力机制模型单元的输入送入所述3D通道注意力模块;所述3D通道注意力模块的输入与输出相乘得到的第一积作为所述3D空间注意力模块的输入,所述3D空间注意力模块的输出与所述第一积相乘得到的第二积,作为所述3D注意力机制模型单元的输出。
在一些技术中,注意力机制是在二维空间上设计的,本公开实施例的3D-CBAM在二维的基础上进行扩展,增加了一个深度维度,在每一次提取通道特征和空间特征时,考虑深度参数的变化。对于输入的3D特征图,3D-CBAM按照顺序推理通道注意力特征图和空间注意力特征图。
3D通道注意力模块实现时,可以将输入的特征图分别经过基于宽度、高度和深度上的最大值池化和均值池化后,输入共享多层感知机,然后分别进行基于对应元素的加和操作,再用sigmoid函数激活,生成初始的通道特征图,将初始的通道特征图与输入的特征图相乘,生成最终的通道特征图。
3D空间注意力模块实现时,可以将上述最终的通道特征图作为3D空间注意力模块的输入特征图,对其做基于通道的最大池化操作和均值池化操作,提取出的特征再进行基于通道的合并操作,然后通过卷积操作(如7×7卷积,3×3卷积等)将其降维成一个通道,再经过sigmoid函数激活生成空间注意力特征图。最后,将生成的空间注意力特征图和输入的所述最终的通道特征图相乘,得到3D-CBAM输出的特征图。
3D注意力机制在对特征提取时同时考虑空间和时间的变化,能够更加适应本公开实施例视频超分辨网络要达到的目的,更好的自适应学习。3D通道注意力模块更加关注哪些通道对最终的超分辨起到作用,选择出对预测起到决定性作用的特征。3D空间注意力模块则关注哪些像素位置会对网络的预测起到更重要的最用,联合使用这两种注意力机制模块可以最大限度的提高网络的学习能力,得到更好的时空超分辨结果。
在本公开一示例性的实施例中,请参见图3,所述重建部分30包括依次连接的以下单元:
用于融合特征的3D卷积单元,包括依次连接的3D卷积层和激活层(如图3中的Conv3D和PReLu),所述用于融合特征的3D卷积单元的输入为所述第二特征;PReLu是带参数修正线性单元(Parametric Rectifier Linear Unit)。
用于实现特征的时空超分辨的3D转置卷积单元,包括依次连接的3D转置卷积层和激活层(如图5中的ConvTrans-3D和PReLu),所述3D转置卷积单元的输入为所述用于融合特征的3D卷积单元的输出,3D转置卷积可以实现上采样的功能;及
用于生成视频帧序列的3D卷积层(如图5中的Conv3D),输入为所述3D转置卷积单元的输出,输出为所述第二视频帧序列。
在本公开一示例性的实施例中,所述激活层使用的激活函数为PReLu,激活函数有很多种,此处也可以采用其他的激活函数。
本公开实施例上述生成网络的特点包括:采用3D卷积,可以同时提取到视频的时间和空间特征,特征提取更加准确,同时相对于分别对时间和空间信息提取的方法,也进一步减小了计算资源的消耗;该生成网络采用3D注意力机制,可以很好的将网络提取的特征集中,从而得到更好的重建结果;该生成网络可以采用数量可变的RAB块,网络结构更加灵活,可以根据计算资源自由选择数量,来适合不同场景的需求。
生成网络可以独立使用,作为视频超分辨网络来完成视频超分辨功能。而在本公开一示例性的实施例中,所述视频超分辨网络还包括判别网络,生成网络和判别网络组成的视频超分辨网络的总体架构如图2所示。所述判别网络训练时的输入为真实视频帧序列和生成网络生成的第二视频帧序列,分别作为输入判别网络的真样本和假样本。判断网络的输出为输入的视频帧序列为真实视频帧的概率。生成网络训练时作为输入数据的第一视频帧序列是对真实视频帧序列进行劣化得到。例如,可以通过对真实视频帧序列进行下采样、模糊处理,加入噪声、压缩编解码中的一种或多种处理得到作为训练集的第一视频帧序列。下采样的方式有多种,有线性方式,例如最近邻采样,双线性采样,Bicubic下采样,均值下采样等;也可以非线性方式,如神经网络下采样。可以设置多种下采样倍数以得到不同分辨率的第一视频帧序列,从而训练出多组网络参数,在使用时就可以根据需要灵活设置视频超分辨网络的网络参数以取得不同的超分辨效果。
本公开一实施例中,所述视频超分辨网络还包括:判别网络,设置为在训练时,以真实视频帧序列和所述生成网络训练时生成的所述第二视频帧序列为输入,从输入的视频帧序列中提取细节特征以及视频帧之间的运动信息特征,基于所述细节特征和运动信息特征确定输入的所述视频帧序列为真实视频帧序列的概率,其中,所述真实视频帧序列的分辨率与所述第二视频帧序的分辨率相同,所述生成网络训练时接收的所述第一视频帧序列通过对所述真实视频帧序列进行下采样而得到。
在本公开一示例性的实施例中,如图5所示,所述判别网络包括第一分支、第二分支、与所述第一分支和第二分支连接的信息融合单元,以及与所述信息融合单元连接的权重计算单元,其中:
第一分支40,设置为基于特征提取网络从输入的视频帧序列中提取细节特征,基于所述细节特征进行真伪判断;
第二分支50,设置为基于光流网络从输入的视频帧序列中提取视频帧之间的运动信息特征,基于所述运动信息特征进行真伪判断;
信息融合单元,设置为对所述第一分支和第二分支输出的真伪判断的结果进行融合;
权重计算单元,设置为根据所述信息融合单元输出的融合后的信息进行权重计算,得到输入的视频帧序列为真实视频帧序列的概率。
在本公开一示例性的实施例中,所述信息融合单元采用全连接层(如图5中的dense(1))实现;所述权重计算单元采用S形函数(如图5中的sigmod函数)实现。
在本公开一示例性的实施例中,所述第一分支40包括依次连接的以下单元:
2D卷积单元,包括依次连接的2D卷积层和激活层。如图5中的Conv_1和LeakyReLU;
多个2D卷积加归一化单元,所述2D卷积加归一化单元包括依次连接的2D卷积层、BN层和激活层,图5中的Conv_2层、BN层和LeakyReLU组成一个2D卷积加归一化单元,其他的2D卷积加归一化单元在图中分别用CBL_2至CBL_8表示。图5的示例中使用了7个CBL,但本公开不局限于此数量;BN层用于加快网络收敛速率。
全连接单元,包括依次连接的全连接层和激活层,如图5中的第一分支40中的Dense(1024)和LeakyReLU。
在本公开一示例性的实施例中,所述第二分支50包括依次连接的以下单元:
N个2D卷积加归一化单元,包括依次连接的2D卷积层、BN层和激活层,N≥2,如图5的第二分支示出了9个2D卷积加归一化单元,Conv1层、BN层和LeakyReLU组成第一个2D卷积加归一化单元,其他2D卷积加归一化单元表示为CBL2、CBL3、CBL3-1、CBL4、CBL4-1、CBL5、CBL5-1和CBL6;
M个2D反卷积单元,包括2D反卷积层和激活层,M≥2,图5中示出了4个2D反卷积单元,其中的2D反卷积层分别表示为DeConv5、DeConv4、DeConv3和DeConv2,激活层均为LeakyReLU;
全连接单元,包括依次连接的全连接层和激活层,如图5中的第二分支50中的Dense(1024)和LeakyReLU。
在本公开一示例性的实施例中,第2i个2D卷积加归一化单元的输出还连接到第M-i+1个2D反卷积单元的输入,1≤i≤M,N=2M+1。在图5所示的示例中,N=9,M=4。其连接关系具体如图5所示,此种网络结构可实现对视频帧之间的运动信息特征的提取和真伪判断。
在本公开一示例性的实施例中,所述判别网络中的激活层使用的激活函数为带泄露修正线性单元LeakReLu。激活函数有很多种,此处也可以采用其他的激活函数。
在本公开一示例性的实施例中,图5所示的判别网络中,第一支路40中卷积层参数设置如下表所示:
Conv_1 Conv_2 Conv_3 Conv_4
K3 s1 n64 K3 s2 n64 K3 s1 n128 K3 s2 n128
Conv_5 Conv_6 Conv_7 Conv_8
K3 s1 n256 K3 s2 n256 K3 s1 n512 K3 s1 n512
其中,K表示卷积核(kernel)大小,s表示步长(stride),n表示卷积核数量(number)。K3表示卷积核大小为3,s1表示步长为1,n64表示卷积核数量为64,依此类推,卷积核大小和步长的单位可以为像素。
第二支路50中卷积层参数设置如下:
Conv1 Conv2 Conv3 Conv3-1 Conv4
K7 s2 n64 K5 s2 n128 K3 s3 n256 K3 s1 n256 K3 s2 n512
Conv4-1 Conv5 Conv5-1 Conv6  
K3 s1 n512 K3 s2 n512 K3 s1 n512 K3 s2 n1024  
DeConv5 DeConv4 DeConv3 DeConv2  
K4 s2 n512 p1 K4 s2 n256 p1 K4 s2 n128 p1 K4 s2 n64 p1  
其中,K、s、n的含义同上表,p表示填充(padding)。
本公开实施例的判别网络采用两个判别准则,一是视频帧本身的特征,二是视频帧之间的运动信息。相应地,判别网络包括两个分支,整体为U型网络结构,其中一条分支用于提取输入视频帧序列的细节特征和判断,另一条分支用光流网络来获取输入视频帧序列的运动信息特征和判断。可以更准确地的识别出输入视频帧的真伪概率,即是真实视频帧序列还是超分辨率视频帧序列(即第二视频帧序列)。而相较于使用2D注意力机制,通过使用3D残差注意力机制能够更好的提取到不同维度上的有用特征,提高视频质量。
本公开实施例的视频超分辨网络可使用以下电路中的任意一种或者以下电路的任意组合来实现:一个或多个微处理器、数字信号处理器、专用集成电路、现场可编程门阵列、离散逻辑、硬件。如果部分地以软件来实施本公开,那么可将用于软件的指令存储在合适的非易失性计算机可读存储媒体中,且可使用一个或多个处理器在硬件中执行所述指令从而实施本公开方法。
本公开实施例基于生成对抗的视频时空超分辨网络,能够同时对视频的空间分辨率和时间分辨率进行 提升,即在空间和时间上超分辨,包含多维特征信息。可以显著增强低分辨率视频帧序列的图像质量和帧率,使用一个网络同时实现了视频帧图像超分辨和帧率提升两种效果。此外,本公开实施例的视频时空超分辨网络将运动信息的利用放在了判别网络上,相较于在生成网络部分利用光流信息进行运动估计,能够进一步的利用真实视频的信息,来进一步提升整个网络的性能,提高视频超分的质量。
本公开的网络结构可以在上述实施例的基础上有所变化。例如,生成网络包含的RAB的个数可以进行适当的删减或增加,来满足不同场景不同计算能力的需求。
本公开一实施例还提供了一种对本公开实施例的视频超分辨网络的训练方法,包括以下过程:
数据预处理过程:
对作为样本的连续视频帧序列,选取该连续视频序列中的7帧,将整个视频裁剪成7×sH×sW大小的块状视频帧序列,作为训练集的高分辨率的真实视频帧序列(简称HR序列),每个HR序列有7帧,每一HR视频帧的高度为sH,宽度为sW。可以对HR序列在时间和空间同时做下采样,得到块状的低分辨率视频帧序列(简称LR序列)5×H×W。训练时设置较小的H,W值,可以减小训练时间,增加数据集的复杂度。所有的训练数据进行归一化处理,使其像素值在(0,1)区间内,更好的用于网络的训练结果。通过上述处理,得到足够数量的LR序列和HR序列。
训练过程:
采用Kaiming初始化方法对网络中的各个参数初始化,学习速率r=1e-4,使用Adam优化器优化网络参数。
将LR序列作为视频超分辨网络的输入数据,HR序列作为视频超分辨网络的目标数据,对生成网络进行训练。生成网络的输出为与HR序列大小相同的超分辨率视频帧序列(简称SR序列)。将SR序列(即假样本)与HR序列(即真样本)作为判别网络训练的输入数据送入判别网络,其中HR序列与SR序列各占50%,判别网络输出判定结果,即输入数据的真伪概率,也可以说是输入数据为HR序列的概率。
判别网络对SR序列和HR序列的判定结果用于计算判别网络的损失和生成网络的对抗损失,生成网络输出的SR序列与HR序列的均方误差(MSE:Mean Square Error)可以作为生成网络的损失函数。经过多次反复迭代,直至网络误差达到预先设定的允许误差,训练结束,保存网络模型结构参数,得到训练好的基于神经网络的视频超分辨网络模型。
在本公开一示例中,视频超分辨网络使用PyTorch平台(PyTorch平台是一个开源的Python机器学习库)在Nvidia GTX 1080Ti GPU上实现。实验的训练集和测试集均使用Vimeo-90K。在视频帧图像上实现了4倍超分辨,帧率提升了2倍。实验显示,残差注意力块(RAB:residual attention block)的块数对实验结果存在影响,RAB数量分别取3,7,9和12时,RAB数量为12时生成的SR视频帧的质量最好。而使用生成对抗网络相比于只使用生成网络,超分辩视频帧的质量更好。
视频编码端可能因为种种客观限制,无法提供高分辨率的视频。比如摄像机分辨率不够,网络带宽不足,源端资源不足等。基于深度学习的视频超分辨能较好的恢复图像细节。因而可以借助视频超分辨处理,对于视频质量做增强,呈现给用户高质量的视频,提升图像的主观视觉效果。
本公开一实施例提供了一种视频超分辨处理方法,如图6所示,包括:
步骤110,基于3D卷积从所述第一视频帧序列中提取第一特征;
步骤120,基于3D残差注意力机制从所述第一特征中提取时间和/或空间上的第二特征;
步骤130,基于3D卷积和3D上采样实现所述第二特征的特征融合和特征的时空超分辨,及基于3D卷积重建视频帧序列,生成第二视频帧序列,所述第二视频帧序列的分辨率大于所述第一视频帧序列的分辨率。
在本公开一示例性的实施例中,所述视频超分辨处理方法基于如本公开任一实施例所述的视频超分辨网络实现
在本公开一示例性的实施例中,所述第二视频帧序列的图像分辨率大于所述第一视频帧序列的图像分辨率,和/或,所述第二视频帧序列的视频第一视频帧序列的视频帧率。
视频超分辨可以用于视频压缩过程的各个环节,例如,用于解码端的视频后处理过程,用于编码端的视频预处理过程,也可以用于视频编码过程和视频解码过程。下面用几个示例分别加以说明。
第一种应用场景
一种处理方式是采用常规的视频编码方法但是加大压缩的力度如增大量化步长,以编码得到较低码率的视频帧序列,在解码端再通过视频超分辨提升视频质量。即将视频超分辨应用于视频解码中的后处理过程。如可以在视频播放设备中通过对解码器输出的已解码视频帧序列进行超分辨处理,提高重建视频帧的分辨率。
在本公开一示例性的实施例中,将视频超分辨应用于视频后处理时,图6所示的视频超分辨处理方法中的所述第一视频帧序列是对码流解码输出的已解码视频帧序列,所述视频超分辨处理用于提升所述已解码视频帧序列的分辨率。将视频超分辨应用于视频后处理时,可以用视频超分辨处理代替原有的后处理滤波器(post-filtering),也可以保留原有的后处理滤波,增加视频超分辨处理。
图7A是该应用场景下视频解码端的结构框图。如图所示,包括:
视频解码器101,设置为对已编码视频码流(简称码流)进行解码,得到第一视频帧序列;
视频超分辨网络103,设置为对所述第一视频帧序列进行视频超分辨处理,得到分辨率大于所述第一视频帧序列的第二视频帧序列;
显示器105,设置为对所述第二视频帧序列进行显示和播放。
本实施例的视频解码器101可以采用如图7B所示的视频解码器。该视频解码器的结构可以用于H.264/AVC、H.265/HEVC、VVC/H.266及其他类似标准的视频解码。在其他实施例中,视频解码器101也可以采用其他类型的视频解码器,如端到端视频编解码技术中基于神经网络的视频解码器。
如图7B所示,视频解码器101包含熵解码单元150、预测处理单元152、反量化单元154、反变换处理单元156、重建单元158(图中用带加号的圆圈表示)、滤波器单元159,以及图片缓冲器160。在其它实施例中,视频解码器30可以包含更多、更少或不同的功能组件。
熵解码单元150可对接收的码流进行熵解码,提取语法元素、量化后的系数块和PU的运动信息等信息。预测处理单元152、反量化单元154、反变换处理单元156、重建单元158以及滤波器单元159均可基于从码流提取的语法元素来执行相应的操作。
作为执行重建操作的功能组件,反量化单元154可对量化后的TU关联的系数块进行反量化。反变换处理单元156可将一种或多种反变换应用于反量化后的系数块以便产生TU的重建残差块。
预测处理单元152包含帧间预测处理单元162和帧内预测处理单元164。如果PU使用帧内预测编码,帧内预测处理单元164可基于从码流解析出的语法元素确定PU的帧内预测模式,根据确定的帧内预测模式和从图片缓冲器件60获取的PU邻近的已重建参考信息执行帧内预测,产生PU的预测块。如果PU使用帧间预测编码,帧间预测处理单元162可基于PU的运动信息和相应的语法元素来确定PU的一个或多个参考块,基于所述参考块来产生PU的预测块。
重建单元158可基于TU关联的重建残差块和预测处理单元152产生的PU的预测块(即帧内预测数据或帧间预测数据),得到CU的重建块。
滤波器单元159可对CU的重建块执行环路滤波,得到重建的图片。重建的图片存储在图片缓冲器160中。图片缓冲器160可提供参考图片以用于后续运动补偿、帧内预测、帧间预测等,也可将重建的视频数据作为已解码视频数据输出,在显示装置上的呈现。
上述显示器105例如可以是液晶显示器、等离子显示器、有机发光二极管显示器或其它类型的显示装置。在其他示例中,解码端也可以不包含显示器105,而是包含可应用解码后数据的其他装置。
本公开实施例可以用于解决视频压缩过程中产生的图像质量损失和视频帧率下降等问题。通过将视频超分辨网络应用于解码端的后处理,对解码输出的视频帧序列进行时空超分辨,可以提升视频图像的质量。为了符合解码端对帧率的要求,在后处理时也可以同时对帧率做提升,从而为用户呈现更高分辨率和更高帧率的高质量视频。
本实施例采用视频超分辨网络对解码后的视频帧序列进行质量增强,并不要求编码端在视频预处理时对视频帧序列下采样。
在本公开一示例性的实施例中,所述第一视频帧序列是对码流解码输出的已解码视频帧序列;所述视频超分辨处理方法还包括:从码流中解析出编码端发送的视频超分辨网络的网络参数信息,及,根据所述 网络参数信息设置所述视频超分辨网络的网络参数。对于不同的已解码视频帧序列,可以为视频超分辨网络配置不同的网络参数如生成网络中RAB的个数等,以达到更好的超分辨效果。而合适的网络参数可以由编码端来产生并生并写入码流,解码端从码流中解析出所述网络参数并进行配置,可以取得更好的质量增强的效果。
第二种应用场景
在本公开一示例性的实施例中,将视频超分辨应用于视频预处理过程。如在视频采集设备的视频预处理模块,将采集到的原始视频帧序列输入本公开实施例的视频超分辨网络进行处理,得到分辨率更高、帧率更高的输出视频,再将该输出视频作为视频编码器的输入视频进行编码处理。
本实施例将视频超分辨应用于视频预处理时,图6所示的视频超分辨处理方法中的所述第一视频帧序列是视频采集设备采集到的原始视频帧序列,所述视频超分辨处理可以提升原始视频帧序列的分辨率。
第三种应用场景
自适应分辨率改变(ARC:Adaptive Resolution Change)允许视频帧序列根据网络状态传输不同分辨率的视频帧,当网络带宽低时传输低分辨率视频帧,带宽高时传输原始分辨率视频帧。在H.265和H.264中,当编码器在传输视频过程中想改变分辨率时需要插入一个满足新分辨率的即时解码刷新(IDR:Instantaneous Decoding Refresh)帧或类似的帧。但是传输IDR帧需要比较多的码率,对视频会议类的应用会引入延迟。如果不插入IDR帧则在帧间预测时当前帧和参考帧分辨率不同会带来问题。VP9(VP9是Google开发的开放的视频压缩标准)通过参考图像重采样(RPR:reference picture resampling)来解决这个问题,使得不同分辨率直接的图像可以进行帧间预测。RPR也已经写入多功能视频编码(VVC:Versatile Video Coding)标准中。根据RPR,当视频序列的分辨率改变时,为了能够进行运动补偿,需要对参考图像进行重采样,其中,基于图像的RPR将重采样前和重采样后的参考图像都放入已解码图片缓冲器(DBP:Decoded Picture Buffer)内,当运动补偿时在DPB中找对应分辨率的参考图像进行预测。
在本公开一示例性的实施例中,将视频超分辨应用于视频编码过程的RPR的处理,此时图6所示的视频超分辨处理方法中的所述第一视频帧序列是从视频编码器的DBP中获取的需要进行上采样的参考图像(可以是一帧或多帧参考图像,可以只提升图像分辨率)。所述视频超分辨处理可以实现对参考图像的上采样,得到图像分辨率更大的参考图像供帧间预测时选择。
图8A所示的视频编码器1000可以用于实现RPR,其包括图像分辨率调整单元1115,本公开实施例的超分辨网络可用于图像分辨率调整单元1115中,实现参考图像的上采样。
如图8A所示,视频编码器207包含预测处理单元1100、划分单元1101、残差产生单元1102、变换处理单元1104、量化单元1106、反量化单元1108、反变换处理单元1110、重建单元1112、滤波器单元1113、已解码图片缓冲器1114、图像分辨率调整单元1115,以及熵编码单元1116。预测处理单元1100包含帧间预测处理单元121和帧内预测处理单元1126。在其他实施例中,视频编码器20可以包含比该示例更多、更少或不同功能组件。
划分单元1101与预测处理单元1100配合将接收的视频数据划分为切片(Slice)、CTU或其它较大的单元。划分单元1101接收的视频数据可以是包括I帧、P帧或B帧等视频帧的视频序列。
预测处理单元1100可以将CTU划分为CU,对CU执行帧内预测编码或帧间预测编码。对CU做帧内预测和帧间预测时,可以将CU划分为一个或多个预测单元(PU:prediction unit)。
帧间预测处理单元1121可对PU执行帧间预测,产生PU的预测数据,所述预测数据包括PU的预测块、PU的运动信息和各种语法元素。
帧内预测处理单元1126可对PU执行帧内预测,产生PU的预测数据。PU的预测数据可包含PU的预测块和各种语法元素。
残差产生单元1102可基于CU的原始块减去CU划分的PU的预测块,产生CU的残差块。
变换处理单元1104可将CU划分为一个或多个变换单元(TU:Transform Unit),TU关联的残差块是CU的残差块划分得到的子块。通过将一种或多种变换应用于TU关联的残差块来产生TU关联的系数块。
量化单元1106可基于选定的量化参数对系数块中的系数进行量化,通过调整QP值可以调整对系数块的量化程度。
反量化单元1108和反变换单元1110可分别将反量化和反变换应用于系数块,得到TU关联的重建残差块。
重建单元1112可将所述重建残差块和预测处理单元1100产生的预测块相加,产生CU的重建块。
滤波器单元1113对所述重建块执行环路滤波后,将其存储在已解码图片缓冲器1114中作为参考图像。帧内预测处理单元1126可以从已解码图片缓冲器1114中提取PU邻近的块的参考图像以执行帧内预测。帧间预测处理单元1121可使用已解码图片缓冲器1114缓存的上一帧的参考图像对当前帧图像的PU执行帧间预测。
图像分辨率调整单元1115对已解码图片缓冲器1114中存储的参考图像进行重采样,可以包括上采样和/或下采样,得到多种分辨率的参考图像保存在已解码图片缓冲器1114中。
熵编码单元1116可以对接收的数据(如语法元素、量化后的系统块、运动信息等)执行熵编码操作。
第四种应用场景
在视频编码端受到网络带宽不足,源端资源不足等因素影响时,还有一种处理方式是采用可分级视频编码的方式。可分级视频编码引入了基本层(BL:Base Layer)、增强层(EL:Enhance Layer)等概念,把对解码图像重要的信息(比特)放在有保障的信道中传输。这些重要信息的集合称为基本层。而把次要信息(比特)放在没有保障的信道中传输,这些数据信息的集合称为增强层。在解码端,增强层信息部分甚至全部丢失,解码器仍能从基本层的信息中恢复出可接受的图像质量。
可分级视频编码有多种类型,如空域可分级编码、时域可分级编码、频域可分级编码和质量可分级编码等。以空域可分级编码为例,空域可分级编码对视频中的每帧图像产生多个不同空间分辨率的图像,解码基本层码流得到的低分辨率的图像,如果同时加入增强层码流到解码器,得到的是高分辨率的图像。
一个示例性的可分级视频编码框架如图8B所示,该编码框架包括基本层、第一增强子层即L1层和第二增强子层即L2层。图中仅示出了编码架构中与上采样和下采样密切相关的部分。输入视频帧序列经第一下采样单元801和第二下采样单元803进行两次下采样后送入基本编码器805进行编码,输出已编码基本层码流,基本编码器805输出的基本层的重建视频帧在第一上采样单元807进行上采样,得到L1层的重建视频帧。第一减法器806用第一下采样单元801输出的L1层的原始视频帧减去该L1层的重建视频帧,得到L1层残差。L1层的重建视频帧和L1层的重建残差在加法器808相加后,再在第二上采样单元809上采样,得到L2层的重建视频帧。第二减法器810用输入视频帧序列减去该L2层的重建视频帧,得到L2层残差。可分级视频编码框架中也可以包括3个或更多的增强子层。
在本公开一示例性的实施例中,将视频超分辨应用于包括基本层和增强层的视频编码架构中,例如低复杂度增强视频编码(LCEVC:Low Complexity Enhancement Video Coding)的编码器中,用于编码侧的增强层数据的生成。具体地,可以使用本公开实施例的视频超分辨网络来实现可分级视频编码架构中的上采样单元。
本实施例将视频超分辨应用于可分级视频编码架构时,图6所示的视频超分辨处理方法中的所述第一视频帧序列是包括基本层和增强层的可分级视频编码架构中产生的基本层的重建(Reconstruction)视频帧序列或增强子层(如L1层)的重建视频帧序列,所述视频超分辨处理可以实现对所述重建视频帧序列的上采样,用于生成相应增强子层的残差,
第五种应用场景
在本公开一示例性的实施例中,将视频超分辨应用于包括基本层和增强层的可分级视频解码架构中。一个示例性的可分级视频解码架构如图8C所示,该解码架构包括基本层、第一增强子层即L1层和第二增强子层即L2层,但也可以包括一个增强子层或3个以上的增强子层。图中仅示出了解码架构中与上采样密切相关的部分。如图所示,基本解码器901输出的基本层的已解码视频帧序列经第一上采样单元903上采样得到初始中间图像(Preliminary Intermediate Picture)。初始中间图像和L1层的已解码数据在第一加法器904相加,得到L1层的组合中间图像(Combined Intermediate Picture)。组合中间图像在第二上采样单元905上采样后得到初始输出图像(Preliminary Output Picture)。初始输出图像和L2层的已解码数据在第二加法器906相加,得到输出视频帧序列。
本实施例将视频超分辨应用于包括基本层和增强层的视频解码架构中,例如LCEVC的解码器中,用于解码侧的增强层数据的生成。具体地,可以使用本公开实施例的视频超分辨网络来实现可分级视频解码架构中的上采样单元。本实施例将视频超分辨应用于可分级视频解码架构时,图6所示的视频超分辨处理方法中的所述第一视频帧序列是包括基本层和增强层的可分级视频解码架构中产生的基本层的已解码视频帧序列或增强子层的组合中间图像(可以是一幅或多幅图像),所述视频超分辨处理可以实现对已解码视 频帧序列的上采样,生成初始中间图像;或实现对组合中间图像的上采样,生成初始输出图像。
本公开一实施例中,视频编码端对视频编码之前,先根据当前情况确定是否进行下采样,如在带宽等资源不足时通过下采样减少编码的数据量,使码流量大大减少。视频解码端完成对码流的解码之后,再判断是否对解码后的视频帧序列进行超分辨。这些方式也可实现分级编码的类似效果,例如,当网络带宽较小的时候,只有下采样后编码得到的基本视频码流被传输,而在网络带宽较大时,不进行下采样,相当于传输增强的视频信息,以此得到自适应性,保证拥有网络连接的大部分终端都可以用适当的码流来传输多媒体信息。且这种方案优于编码端将视频帧直接编码成相同码率的图像、解码端再用超分辨网络对解码后的图像进行质量增强的方案。
本公开一实施例提供了一种视频编码处理方法,如图9所示,包括:
步骤210,进行视频预处理时确定是否对来自数据源的视频帧序列进行下采样,如果是,执行步骤220,如果否,执行步骤230;
步骤220,在确定不进行下采样的情况下,将来自数据源的视频帧序列直接输入视频编码器进行视频编码,生成码流,结束;
步骤230,在确定进行下采样的情况下,对来自数据源的视频帧序列进行下采样,将经下采样后的视频帧序列输入视频编码器进行视频编码,生成码流。
本文中所称的视频编码处理包括视频预处理和视频编码。所述视频预处理可以包括下采样等处理。本文所称的视频解码处理包括视频解码和视频后处理,视频后处理可以包括本公开实施例的视频超分辨处理。
在本公开一示例性的实施例中,所述对来自数据源的视频帧序列进行下采样,包括:对来自数据源的视频帧序列的图像分辨率和/或视频帧率进行下采样。下采样时,可以根据带宽等因素选取合适的下采样倍数,使得编码后的码率与带宽相适应。
在本公开一示例性的实施例中,所述视频编码处理方法还包括:进行视频编码时,将一下采样标志写入码流,所述下采样标志用于指示编码端对来自数据源的所述视频帧序列的预处理是否包括下采样。
如果编码端对来自数据源的所述视频帧序列做预处理时进行过下采样,而视频超分辨网络是基于真实视频帧序列和对真实视频帧序列下采样得到的第一视频帧序列训练得到的,那么编码端对来自数据源的视频帧进行下采样再进行压缩编码生成码流,解码端对码流解码重建第一视频帧序列后,解码端使用所述视频超分辨网络对重建的第一视频帧序列进行视频超分辨处理,其对视频质量的提升是显著的。因为此时视频超分辨网络的应用场景与训练场景是相似的,都是用于恢复下采样后的视频帧的分辨率。而如果编码端不对视频帧进行下采样,那么即使解码后的视频质量达不到要求,在解码端使用按上述方式训练的视频超分辨网络对已解码视频帧序列进行质量增强,其对视频质量的提升效果是有限的或没有效果。因此编码端生成上述下采样标志并写入码流,使得解码端可以根据该下采样标志确定是否进行视频超分辨处理或者根据该下采样标志和其他条件共同确定是否进行视频超分辨处理,有利于解码端合理地做出是否进行视频超分辨处理的判决。
在本公开一示例性的实施例中,所述确定是否对来自数据源的视频帧序列进行下采样,包括:在满足以下条件中的任一种时,确定对来自数据源的视频帧序列进行下采样:
可用于传输视频码流的带宽小于不进行下采样时传输视频码流所需的带宽:
编码端的资源不支持对来自数据源的视频帧序列直接进行视频编码;
所述来自数据源的视频帧序列属于指定的需要进行下采样的视频帧序列。
虽然此处列出了需要对来自数据源的视频帧序列进行下采样的几种情况,但这仅仅是示意性的,完全可能存在其他需要对视频帧序列进行下采样的情况。本公开对此不加以局限。
在本公开一示例性的实施例中,所述视频编码处理方法还包括:进行视频编码时,获取来自数据源的所述视频帧序列对应的视频超分辨网络的网络参数,将所述网络参数写入码流。例如,对某一视频资源,可以由编码端预先根据所述视频资源制作训练用的样本,对视频超分辨网络进行训练,从而得到该视频资源对应的视频超分辨网络的网络参数,然后可以将所述网络参数与所述视频资源保存在一起,在对所述视频资源进行视频编码时,读取所述网络参数并编码写入码流。这样解码端可以解析出所述网络参数,使用所述网络参数配置视频超分辨网络,取得预期的质量增强效果。
本实施例视频编码处理方法可以根据带宽等情况,确定对视频帧进行预处理时是否进行下采样,使得 编码端可以自适应地选择一种合适的编码处理方法来适应网络环境、编码资源等变化。
本公开一实施例还提供了一种视频解码处理方法,如图10所示,包括:
步骤310,对码流进行解码,得到第一视频帧序列;
步骤320,判断所述第一视频帧序列是否满足设定的超分辨条件,如果是,执行步骤330,如果否,执行步骤340;
步骤330,在满足设定的超分辨条件的情况下,将所述第一视频帧序列输出到视频超分辨网络进行视频超分辨处理,得到第二视频帧序列,所述第二视频帧序列的分辨率大于所述第一视频帧序列的分辨率;
步骤340,在不满足设定的超分辨条件的情况下,跳过对所述第一视频帧序列的视频超分辨处理。
在跳过视频超分辨处理,或者进行视频超分辨处理得到第二视频帧序列之后,可以进行后续的解码后处理过程,或者进行视频的显示、播放。
在本实施例的一个示例中,所述视频超分辨网络采用本公开任一实施例所述的视频超分辨网络。但在本实施例的其他示例中,也可以采用其他的视频超分辨网络来进行本实施例的视频超分辨处理。
在本实施例的一个示例中,所述视频超分辨网络包括生成网络,所述生成网络训练时,以作为样本的第一视频帧序列为输入数据,以真实视频帧序列为目标数据,其中,所述真实视频帧序列的分辨率和所述第二视频帧序列的分辨率相同,作为样本的所述第一视频帧序列通过对所述真实视频帧序列进行下采样而得到的。本文中,生成网络训练时的输入是作为样本的第一视频帧序列,生成网络训练好之后使用时的输入可以是解码得到的第一视频帧序列(还可以是来自数据源的第一视频帧序列等),作为样本的第一视频帧序列和解码得到的第一视频帧序列的分辨率相同,内容可以不同。对真实视频帧序列进行下采样得到第一视频帧序列的过程中,也可以进行除下采样之外的其他处理,如在下采样之后还进行压缩编码和解码等处理。按照本示例训练的视频超分辨网络,适合于将经过下采样后再进行压缩编码和解码的低分辨率视频帧序列恢复为高分辨率视频帧序列。
在本公开一示例性的实施例中,所述对码流进行解码,还得到一下采样标志,所述下采样标志用于指示编码端对所述第一视频帧序列的预处理是否包括下采样;所述设定的超分辨条件至少包括:所述下采样标志指示编码端对所述第一视频帧序列的预处理包括下采样。在一个示例中,在所述下采样标志指示编码端对所述第一视频帧序列的预处理不包括下采样的情况下,可以确定跳过对所述第一视频帧序列的视频超分辨处理。下采样标志本身用于指示编码端对视频帧的预处理是否包括下采样,此处的下采样标志可用于指示编码端对所述第一视频帧序列的预处理包括下采样,意味着该下采样标志与第一视频帧序列相关,例如属于同一编码单元。
如上文所述,在视频超分辨网络使用经下采样得到的训练样本的情况下,下采样标志可以帮助解码端确定编码端在视频预处理时是否进行过下采样,从而更好地判断是否要进行视频超分辨处理,单纯地根据解码后视频质量,在该视频质量达不到某个固定的阈值时就进行视频超分辨,在该视频质量达到该阈值时就不进行视频超分辨,不考虑视频超分辨的预期效果,是比较机械的,有局限性。如果编码端进行过下采样而解码后的视频质量刚好达到阈值,此时也是可以进行视频超分辨以提高视频质量。如果编码端没有进行过下采样,由于其他因素如摄像机本身的分辨率差,传输路径上的噪声大等,使得解码后的视频质量达不到阈值,此时也可以不进行视频超分辨。
在本公开一示例性的实施例中,所述设定的超分辨条件包括以下条件中的一种或任意组合:
所述第一视频帧序列的图像质量低于设定的质量要求;
编码端对所述第一视频帧序列的预处理包括下采样;
解码端的视频超分辨功能处于可用状态;
在所述第一视频帧序列不满足设定的超分辨条件的情况下,跳过对所述第一视频帧序列的超分辨处理。
上述列举的超分辩条件可以组合使用,例如,在第一视频帧序列的图像质量低于设定的质量要求、编码端对所述第一视频帧序列序列进行下采样,且解码端的视频超分辨功能处于可用状态时,再判定执行对第一视频帧序列的超分辨处理。但这里的条件并不是穷举,可能还存在其他条件。上述质量要求可以用设定的峰值信噪比(PSNR Peak Signal to Noise Ratio)、结构相似性(SSIM:Structural Similarity),均方误差(MSE:Mean Square Error)等评价指标来表示。
本实施例将视频超分辨网络应用于视频处理的流程中。在压缩编码前将视频在空间和时间上下采样,大大降低了需要编码的视频数据量;在解码后用所训练的视频超分辨网络进行相对应的上采样,恢复出原 有视频。总体上明显降低码率,大大提高编码效率,减少传输码流。
本公开一实施例还提供了一种视频编解码系统,如图11所示,包括编码端设备和解码端设备。
编码端设备包括数据源201和视频编码处理装置200,数据源201可以是视频捕获装置(例如,摄像机)、含有先前捕获的数据的存档、用以从内容提供者接收数据的馈入接口,用于产生数据的计算机图形系统,或这些来源的组合。视频编码处理装置200可使用以下电路中的任意一种或者以下电路的任意组合来实现:一个或多个微处理器、数字信号处理器、专用集成电路、现场可编程门阵列、离散逻辑、硬件。如果部分地以软件来实施本公开,那么可将用于软件的指令存储在合适的非易失性计算机可读存储媒体中,且可使用一个或多个处理器在硬件中执行所述指令从而实施本公开方法。视频编码处理装置200可以基于上述电路实现本公开任一实施例所述的视频编码处理方法。
如图11所示,视频编码处理装置200包括:
下采样判决装置203,设置为进行视频预处理时确定是否对来自数据源的视频帧序列进行下采样,在确定进行下采样的情况下,将来自数据源的所述视频帧序列输出到下采样装置,在确定不进行下采样的情况下,将来自数据源的所述视频帧序列直接输出到视频编码器进行视频编码;
下采样装置205,设置为对输入的视频帧序列进行下采样,将下采样后的视频帧序列输出到视频编码器进行编码;
视频编码器207,设置为对来自数据源的所述视频帧序列或者下采样后的所述视频帧序列进行视频编码。
在本公开一示例性的实施例中,所述下采样判决装置203确定是否对来自数据源的视频帧序列进行下采样,包括:在满足以下条件中的任一种时,确定对来自数据源的视频帧序列进行下采样:
可用于传输视频码流的带宽小于不进行下采样时传输视频码流所需的带宽:
编码端的资源不支持对来自数据源的视频帧序列直接进行视频编码;
所述来自数据源的视频帧序列属于指定的需要进行下采样的视频帧序列。
在本公开一示例性的实施例中,所述下采样装置205对来自数据源的视频帧序列进行下采样,包括:对来自数据源的视频帧序列的图像分辨率和/或视频帧率进行下采样。
在本公开一示例性的实施例中,所述下采样判决装置203还设置为生成下采样标志并输出到所述视频编码器207,所述下采样标志用于指示编码端对来自数据源的所述视频帧序列的预处理是否包括下采样;所述视频编码器207还设置为在进行视频编码时,将所述下采样标志写入码流。此处的下采样标志可用于指示编码端对来自数据源的所述视频帧序列的预处理包括下采样,表示此处的下采样标志与来自数据源的所述视频帧序列相关,如属于同一编码单元。
如图11所示,解码端设备包括视频解码处理装置300和显示器307,显示器307可以是液晶显示器、等离子显示器、有机发光二极管显示器或其它类型的显示装置。视频解码处理装置300可使用以下电路中的任意一种或者以下电路的任意组合来实现:一个或多个微处理器、数字信号处理器、专用集成电路、现场可编程门阵列、离散逻辑、硬件。如果部分地以软件来实施本公开,可将用于软件的指令存储在合适的非易失性计算机可读存储媒体中,且可使用一个或多个处理器在硬件中执行所述指令从而实施本公开方法。视频解码处理装置300可以基于上述电路实现本公开任一实施例所述的视频解码处理方法。
视频解码处理装置300又包括:
视频解码器301,设置为对码流进行解码,得到第一视频帧序列;
超分辨判决装置303,设置为判断所述第一视频帧序列是否满足设定的超分辨条件,在满足设定的超分辨条件的情况下,将所述第一视频帧序列输出到视频超分辨网络进行视频超分辨处理;在不满足设定的超分辨条件的情况下,确定跳过对所述第一视频帧序列的视频超分辨处理;
视频超分辨网络305,设置为对所述第一视频帧序列进行视频超分辨处理,得到分辨率大于所述第一视频帧序列的第二视频帧序列;
在本公开一示例性实施例中,所述视频超分辨网络采用如本公开任一实施例所述的的视频超分辨网络。
在本公开一示例性实施例中,所述视频超分辨网络中的生成网络训练时以作为样本的第一视频帧序列 为输入数据,以真实视频帧序列为目标数据,作为样本的所述第一视频帧序列是对所述真实视频帧序列进行下采样而得到的。如此训练的视频超分辨网络,适合于将经过下采样、压缩编码和解码后的低分辨率视频帧序列恢复为高分辨率视频帧序列,具有良好的质量增强效果。
在本公开一示例性实施例中,所述视频解码器对码流进行解码,还从码流中提取一下采样标志,所述下采样标志用于指示编码端对所述第一视频帧序列的预处理是否包括下采样;所述超分辨判决装置使用的所述超分辨条件至少包括:所述下采样标志指示编码端对所述第一视频帧序列的预处理包括下采样;在一个示例中,所述超分辨判决装置在所述下采样标志指示编码端对所述第一视频帧序列的预处理不包括下采样的情况下,可以确定不对所述第一视频帧序列进行超分辨处理。
在本公开一示例性实施例中,所述超分辨判决装置使用的所述设定的超分辨条件包括以下条件中的一种或任意组合:
所述第一视频帧序列的图像质量低于设定的质量要求;
编码端对所述第一视频帧序列的预处理包括下采样;
解码端的视频超分辨功能处于可用状态;
所述超分辨判决装置在所述第一视频帧序列不满足设定的超分辨条件的情况下,可以确定跳过对所述第一视频帧序列的视频超分辨处理。
基于本公开实施例的视频编解码系统,编码端在视频预处理阶段,根据目前所检测到的带宽环境等因素,判断是否需要对视频帧序列下采样,如需要(例如带宽不足时),则选择相应的下采样倍数,对视频帧序列的空间分辨率和/或时间分辨率进行下采样,再编码成码流传输;而在解码端用对应的解码器解码,解码后的视频帧质量不高,可送入视频超分辨网络进行质量的提升,得到具有所需的空间分辨率和时间分辨率的视频。当带宽变大时,编码端可以直接将来自数据源的视频帧序列编码成码流传输,解码端可以直接解码获得高质量视频,此时不进行视频超分辨。无论编码端是否进行下采样,均可使用相同的视频编码器进行编码,编码运算相对简单,资源占用少。
本公开一实施例还提供了一种视频编码处理装置,如图12所示,包括处理器5以及存储有可在所述处理器5上运行的计算机程序的存储器6,其中,所述处理器5执行所述计算机程序时实现如本公开任一实施例所述的视频编码处理方法。
本公开一实施例还提供了一种视频解码处理装置,可参见图12,包括处理器以及存储有可在所述处理器上运行的计算机程序的存储器,其中,所述处理器执行所述计算机程序时实现如本公开任一实施例所述的视频解码处理方法。
本公开一实施例还提供了一种视频超分辨处理装置,包括处理器以及存储有可在所述处理器上运行的计算机程序的存储器,其中,所述处理器执行所述计算机程序时实现本公开任一实施例所述的视频超分辨处理方法。
本公开一实施例还提供了一种视频编解码系统,包括如本公开任一实施例所述的视频编码处理装置和本公开任一实施例所述的视频解码处理装置。
本公开一实施例还提供了一种码流,其中,所述码流包括根据本公开实施例所述的视频编码处理方法生成,所述码流中包含下采样标志。
本公开一实施例还提供了一种非瞬态计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其中,所述计算机程序时被处理器执行时实现如本公开任一实施例所述的视频编码处理方法或视频解码处理方法。
在一个或多个示例性实施例中,所描述的功能可以硬件、软件、固件或其任一组合来实施。如果以软件实施,那么功能可作为一个或多个指令或代码存储在计算机可读介质上或经由计算机可读介质传输,且由基于硬件的处理单元执行。计算机可读介质可包含对应于例如数据存储介质等有形介质的计算机可读存储介质,或包含促进计算机程序例如根据通信协议从一处传送到另一处的任何介质的通信介质。以此方式,计算机可读介质通常可对应于非暂时性的有形计算机可读存储介质或例如信号或载波等通信介质。数据存储介质可为可由一个或多个计算机或者一个或多个处理器存取以检索用于实施本公开中描述的技术的指令、代码和/或数据结构的任何可用介质。计算机程序产品可包含计算机可读介质。
举例来说且并非限制,此类计算机可读存储介质可包括RAM、ROM、EEPROM、CD-ROM或其它光盘存储装置、磁盘存储装置或其它磁性存储装置、快闪存储器或可用来以指令或数据结构的形式存储所要程序代码且可由计算机存取的任何其它介质。而且,还可以将任何连接称作计算机可读介质举例来说,如果使用同轴电缆、光纤电缆、双绞线、数字订户线(DSL)或例如红外线、无线电及微波等无线技术从网站、服务器或其它远程源传输指令,则同轴电缆、光纤电缆、双纹线、DSL或例如红外线、无线电及微波等无线技术包含于介质的定义中。然而应了解,计算机可读存储介质和数据存储介质不包含连接、载波、信号或其它瞬时(瞬态)介质,而是针对非瞬时有形存储介质。如本文中所使用,磁盘及光盘包含压缩光盘(CD)、激光光盘、光学光盘、数字多功能光盘(DVD)、软磁盘或蓝光光盘等,其中磁盘通常以磁性方式再生数据,而光盘使用激光以光学方式再生数据。上文的组合也应包含在计算机可读介质的范围内。
可由例如一个或多个数字信号理器(DSP)、通用微处理器、专用集成电路(ASIC)现场可编程逻辑阵列(FPGA)或其它等效集成或离散逻辑电路等一个或多个处理器来执行指令。因此,如本文中所使用的术语“处理器”可指上述结构或适合于实施本文中所描述的技术的任一其它结构中的任一者。另外,在一些方面中,本文描述的功能性可提供于经配置以用于编码和解码的专用硬件和/或软件模块内,或并入在组合式编解码器中。并且,可将所述技术完全实施于一个或多个电路或逻辑元件中。
本公开实施例的技术方案可在广泛多种装置或设备中实施,包含无线手机、集成电路(IC)或一组IC(例如,芯片组)。本公开实施例中描各种组件、模块或单元以强调经配置以执行所描述的技术的装置的功能方面,但不一定需要通过不同硬件单元来实现。而是,如上所述,各种单元可在编解码器硬件单元中组合或由互操作硬件单元(包含如上所述的一个或多个处理器)的集合结合合适软件和/或固件来提供。

Claims (36)

  1. 一种视频超分辨网络,包括生成网络,其中,所述生成网络包括依次连接的第一特征提取部分、第二特征提取部分和重建部分,其中:
    所述第一特征提取部分设置为接收第一视频帧序列,基于3D卷积从所述第一视频帧序列中提取第一特征并输出;
    所述第二特征提取部分,设置为接收所述第一特征,基于3D残差注意力机制从所述第一特征中提取时间和/或空间上的第二特征并输出;
    所述重建部分设置为接收所述第二特征,基于3D卷积和3D上采样实现特征融合和特征的时空超分辨,及基于3D卷积重建视频帧序列,生成第二视频帧序列,所述第二视频帧序列的分辨率大于所述第一视频帧序列的分辨率。
  2. 根据权利要求1所述的视频超分辨网络,其中:
    所述第一特征提取部分包括依次连接的3D卷积层和激活层,所述3D卷积层的输入为所述第一视频帧序列,所述激活层的输出为所述第一特征。
  3. 根据权利要求1所述的视频超分辨网络,其中:
    所述第二特征提取部分包括依次连接的多个残差注意力块RAB,第一个RAB的输入为所述第一特征,除第一个RAB之外的其他RAB的输入为前一RAB的输出,最后一个RAB的输出为所述第二特征;所述RAB包括依次连接的3D卷积层、激活层和3D注意力机制模型单元,所述RAB的输入送入所述3D卷积层,还跳跃连接与3D注意力机制模型单元的输出相加,得到的和作为所述RAB的输出。
  4. 根据权利要求3所述的视频超分辨网络,其中:
    所述3D注意力机制模型单元为3D卷积块注意力模型,所述3D卷积块注意力模型包括依次连接的3D通道注意力模块和3D空间注意力模块,所述3D注意力机制模型单元的输入送入所述3D通道注意力模块,所述3D通道注意力模块的输入与输出相乘得到的第一乘积作为所述3D空间注意力模块的输入,所述3D空间注意力模块的输出与所述第一乘积相乘得到的第二乘积,作为所述3D注意力机制模型单元的输出。
  5. 根据权利要求1所述的视频超分辨网络,其中:
    所述重建部分包括依次连接的以下单元:
    用于融合特征的3D卷积单元,包括依次连接的3D卷积层和激活层,所述用于融合特征的3D卷积单元的输入为所述第二特征;
    用于实现特征的时空超分辨的3D转置卷积单元,包括依次连接的3D转置卷积层和激活层,所述3D转置卷积单元的输入为所述用于融合特征的3D卷积单元的输出;及
    用于生成视频帧序列的3D卷积层,输入为所述3D转置卷积单元的输出,输出为所述第二视频帧序列;
    其中,所述第二视频帧序列的图像分辨率大于所述第一视频帧序列的图像分辨率,和/或,所述第二视频帧序列的视频第一视频帧序列的视频帧率。
  6. 根据权利要求2至5中任一项的视频超分辨网络,其中:
    所述激活层使用的激活函数为带参数修正线性单元PReLu。
  7. 根据权利要求1所述的视频超分辨网络,其中:
    所述视频超分辨网络还包括:判别网络,设置为在训练时,以真实视频帧序列和所述生成网络训练时生成的所述第二视频帧序列为输入,从输入的视频帧序列中提取细节特征以及视频帧之间的运动信息特征,基于所述细节特征和运动信息特征确定输入的所述视频帧序列为真实视频帧序列的概率,其中,所述真实视频帧序列的分辨率与所述第二视频帧序的分辨率相同,所述生成网络训练时接收的所述第一视频帧序列通过对所述真实视频帧序列进行下采样而得到。
  8. 根据权利要求7所述的视频超分辨网络,其中:
    所述判别网络包括第一分支、第二分支、与所述第一分支和第二分支连接的信息融合单元,以及与所述信息融合单元连接的权重计算单元,其中:
    所述第一分支设置为基于特征提取网络从输入的视频帧序列中提取细节特征,基于所述细节特征进行 真伪判断;
    所述第二分支设置为基于光流网络从输入的视频帧序列中提取视频帧之间的运动信息特征,基于所述运动信息特征进行真伪判断;
    所述信息融合单元设置为对所述第一分支和第二分支输出的真伪判断的结果进行融合;
    所述权重计算单元设置为根据所述信息融合单元输出的融合后的信息进行权重计算,得到输入的视频帧序列为真实视频帧序列的概率。
  9. 根据权利要求8所述的视频超分辨网络,其中:
    所述信息融合单元采用全连接层实现;
    所述权重计算单元采用S形函数实现。
  10. 根据权利要求8所述的视频超分辨网络,其中:
    所述第一分支包括依次连接的以下单元:
    2D卷积单元,包括依次连接的2D卷积层和激活层;
    多个2D卷积加归一化单元,所述2D卷积加归一化单元包括依次连接的2D卷积层、批量归一化BN层和激活层;
    全连接单元,包括依次连接的全连接层和激活层。
  11. 根据权利要求8所述的视频超分辨网络,其中:
    所述第二分支包括依次连接的以下单元:
    N个2D卷积加归一化单元,包括依次连接的2D卷积层、BN层和激活层,N≥2;
    M个2D反卷积单元,包括2D反卷积层和激活层,M≥2;
    全连接单元,包括依次连接的全连接层和激活层。
  12. 根据权利要求11所述的视频超分辨网络,其中:
    第2i个2D卷积加归一化单元的输出还连接到第M-i+1个2D反卷积单元的输入,1≤i≤M,N=2M+1。
  13. 根据权利要求10或11所述的视频超分辨网络,其中:
    所述激活层使用的激活函数为带泄露修正线性单元LeakReLu。
  14. 一种视频超分辨处理方法,包括:
    基于3D卷积从所述第一视频帧序列中提取第一特征;
    基于3D残差注意力机制从所述第一特征中提取时间和/或空间上的第二特征;
    基于3D卷积和3D上采样实现所述第二特征的特征融合和特征的时空超分辨;及,基于3D卷积重建视频帧序列,生成第二视频帧序列,所述第二视频帧序列的分辨率大于所述第一视频帧序列的分辨率。
  15. 根据权利要求14所述的视频超分辨处理方法,其中:
    所述视频超分辨处理方法基于如权利要求1至13中任一项所述的视频超分辨网络实现。
  16. 根据权利要求14所述的视频超分辨处理方法,其中:
    所述第一视频帧序列是对码流解码输出的已解码视频帧序列;或者
    所述第一视频帧序列是视频采集设备采集到的原始视频帧序列;或者
    所述第一视频帧序列是从视频编码器的已解码图片缓冲器中获取的需要进行上采样的参考图像;或者
    所述第一视频帧序列是包括基本层和增强层的可分级视频编码架构中产生的基本层的重建视频帧序列或增强子层的重建视频帧序列;或者
    所述第一视频帧序列是包括基本层和增强层的可分级视频解码架构中产生的基本层的已解码视频帧序列或增强子层的组合中间图像。
  17. 根据权利要求14或15所述的视频超分辨处理方法,其中:
    所述第一视频帧序列是对码流解码输出的已解码视频帧序列;
    所述视频超分辨处理方法还包括:从码流中解析出编码端发送的视频超分辨网络的网络参数信息,及, 根据所述网络参数信息设置所述视频超分辨网络的网络参数。
  18. 一种视频解码方法,包括:
    对码流进行解码,得到第一视频帧序列;
    判断所述第一视频帧序列是否满足设定的超分辨条件;
    在满足设定的超分辨条件的情况下,将所述第一视频帧序列输出到视频超分辨网络进行视频超分辨处理,得到第二视频帧序列,所述第二视频帧序列的分辨率大于所述第一视频帧序列的分辨率。
  19. 根据权利要求18所述的视频解码处理方法,其中:
    所述视频超分辨网络采用如权利要求1至13中任一项所述的视频超分辨网络。
  20. 根据权利要求18所述的视频解码处理方法,其中:
    所述视频超分辨网络包括生成网络,所述生成网络训练时,以作为样本的第一视频帧序列为输入数据,以真实视频帧序列为目标数据,其中,所述真实视频帧序列的分辨率和所述第二视频帧序列的分辨率相同,作为样本的所述第一视频帧序列通过对所述真实视频帧序列进行下采样而得到的。
  21. 根据权利要求18或19或20所述的视频解码处理方法,其中:
    所述对码流进行解码,还得到一下采样标志,所述下采样标志用于指示编码端对所述第一视频帧序列的预处理是否包括下采样;
    所述设定的超分辨条件至少包括:所述下采样标志指示编码端对所述第一视频帧序列的预处理包括下采样。
  22. 根据权利要求18所述的视频解码处理方法,其中:
    所述设定的超分辨条件包括以下条件中的一种或任意组合:所述第一视频帧序列的图像质量低于设定的质量要求;编码端对所述第一视频帧序列的预处理包括下采样;及,解码端的视频超分辨功能处于可用状态;
    在所述第一视频帧序列不满足设定的超分辨条件的情况下,跳过对所述第一视频帧序列的超分辨处理。
  23. 一种视频编码处理方法,包括:
    进行视频预处理时确定是否对来自数据源的视频帧序列进行下采样;
    在确定不进行下采样的情况下,将来自数据源的所述视频帧序列直接输入视频编码器进行视频编码;
    在确定进行下采样的情况下,对来自数据源的所述视频帧序列进行下采样,将下采样后的视频帧序列输入视频编码器进行视频编码。
  24. 根据权利要求23所述的视频编码处理方法,其中:
    所述视频编码处理方法还包括:
    进行视频编码时,将一下采样标志写入码流,所述下采样标志用于指示编码端对来自数据源的所述视频帧序列的预处理是否包括下采样。
  25. 根据权利要求23所述的视频编码处理方法,其中:
    所述确定是否对来自数据源的视频帧序列进行下采样,包括:在满足以下条件中的任一种时,确定对来自数据源的视频帧序列进行下采样:
    可用于传输视频码流的带宽小于不进行下采样时传输视频码流所需的带宽:
    编码端的资源不支持对来自数据源的视频帧序列直接进行视频编码;
    所述来自数据源的视频帧序列属于指定的需要进行下采样的视频帧序列。
  26. 根据权利要求23或24或25所述的视频编码处理方法,其中:
    所述视频编码处理方法还包括:
    进行视频编码时,获取来自数据源的所述视频帧序列对应的视频超分辨网络的网络参数,将所述网络参数写入码流。
  27. 一种视频超分辨处理装置,包括处理器以及存储有可在所述处理器上运行的计算机程序的存储器,其中,所述处理器执行所述计算机程序时实现如权利要求14至17中任一所述的视频超分辨处理方法。
  28. 一种视频解码处理装置,包括处理器以及存储有可在所述处理器上运行的计算机程序的存储器,其中,所述处理器执行所述计算机程序时实现如权利要求18至22中任一所述的视频解码处理方法。
  29. 一种视频解码处理装置,包括:
    视频解码器,设置为对码流进行解码,得到第一视频帧序列;
    超分辨判决装置,设置为判断所述第一视频帧序列是否满足设定的超分辨条件,在满足设定的超分辨条件的情况下,将所述第一视频帧序列输出到视频超分辨网络进行视频超分辨处理;在不满足设定的超分辨条件的情况下,确定跳过对所述第一视频帧序列的视频超分辨处理;
    视频超分辨网络,设置为对所述第一视频帧序列进行视频超分辨处理,得到分辨率大于所述第一视频帧序列的第二视频帧序列。
  30. 根据权利要求29所述的视频解码处理装置,其中:
    所述视频解码器对码流进行解码时,还从码流中提取一下采样标志,所述下采样标志用于指示编码端对所述第一视频帧序列的预处理是否包括下采样;
    所述超分辨判决装置使用的所述超分辨条件至少包括:所述下采样标志指示编码端对所述第一视频帧序列的预处理包括下采样。
  31. 一种视频编码处理装置,包括处理器以及存储有可在所述处理器上运行的计算机程序的存储器,其中,所述处理器执行所述计算机程序时实现如权利要求23至26中任一所述的视频编码处理方法。
  32. 一种视频编码处理装置,其中,包括:
    下采样判决模块,设置为进行视频预处理时确定是否对来自数据源的视频帧序列进行下采样,在确定进行下采样的情况下,将来自数据源的所述视频帧序列输出到下采样装置,在确定不进行下采样的情况下,将来自数据源的所述视频帧序列直接输出到视频编码器进行编码;
    下采样装置,设置为对输入的视频帧序列进行下采样,将下采样后的视频帧序列输出到视频编码器进行编码;
    视频编码器,设置为对来自数据源的所述视频帧序列或者下采样后的所述视频帧序列进行视频编码。
  33. 根据权利要求32所述的视频编码处理装置,其中:
    所述下采样判决装置还设置为生成下采样标志并输出到所述视频编码器,所述下采样标志用于指示编码端对来自数据源的所述视频帧序列的预处理是否包括下采样;
    所述视频编码器还设置为在进行视频编码时,将所述下采样标志写入码流。
  34. 一种视频编解码系统,包括如权利要求31至33中任一所述的视频编码处理装置和如权利要求28至30中任一所述的视频解码处理装置。
  35. 一种码流,其中,所述码流包括根据如权利要求24所述的视频编码处理方法生成,所述码流中包含所述下采样标志;或者,所述码流包括根据如权利要求26所述的视频编码处理方法生成,所述码流中包含所述网络参数。
  36. 一种非瞬态计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其中,所述计算机程序时被处理器执行时实现如权利要求14至26中任一所述的方法。
PCT/CN2021/107449 2021-07-20 2021-07-20 视频超分辨网络及视频超分辨、编解码处理方法、装置 WO2023000179A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN202180100597.5A CN117730338A (zh) 2021-07-20 2021-07-20 视频超分辨网络及视频超分辨、编解码处理方法、装置
PCT/CN2021/107449 WO2023000179A1 (zh) 2021-07-20 2021-07-20 视频超分辨网络及视频超分辨、编解码处理方法、装置
EP21950444.6A EP4365820A1 (en) 2021-07-20 2021-07-20 Video super-resolution network, and video super-resolution, encoding and decoding processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/107449 WO2023000179A1 (zh) 2021-07-20 2021-07-20 视频超分辨网络及视频超分辨、编解码处理方法、装置

Publications (1)

Publication Number Publication Date
WO2023000179A1 true WO2023000179A1 (zh) 2023-01-26

Family

ID=84979815

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/107449 WO2023000179A1 (zh) 2021-07-20 2021-07-20 视频超分辨网络及视频超分辨、编解码处理方法、装置

Country Status (3)

Country Link
EP (1) EP4365820A1 (zh)
CN (1) CN117730338A (zh)
WO (1) WO2023000179A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116091317A (zh) * 2023-02-02 2023-05-09 苏州大学 扫描电镜二次电子图像超分辨方法和系统
US20230254592A1 (en) * 2022-02-07 2023-08-10 Robert Bosch Gmbh System and method for reducing transmission bandwidth in edge cloud systems
CN116634209A (zh) * 2023-07-24 2023-08-22 武汉能钠智能装备技术股份有限公司 一种基于热插拔的断点视频恢复系统及方法
CN117041669A (zh) * 2023-09-27 2023-11-10 湖南快乐阳光互动娱乐传媒有限公司 视频流的超分控制方法、装置及电子设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108769682A (zh) * 2018-06-20 2018-11-06 腾讯科技(深圳)有限公司 视频编码、解码方法、装置、计算机设备和存储介质
CN112543347A (zh) * 2019-09-23 2021-03-23 腾讯美国有限责任公司 基于机器视觉编解码的视频超分辨率方法和系统
CN112801877A (zh) * 2021-02-08 2021-05-14 南京邮电大学 一种视频帧的超分辨率重构方法
CN112950471A (zh) * 2021-02-26 2021-06-11 杭州朗和科技有限公司 视频超分处理方法、装置、超分辨率重建模型、介质
CN113052764A (zh) * 2021-04-19 2021-06-29 东南大学 一种基于残差连接的视频序列超分重建方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108769682A (zh) * 2018-06-20 2018-11-06 腾讯科技(深圳)有限公司 视频编码、解码方法、装置、计算机设备和存储介质
CN112543347A (zh) * 2019-09-23 2021-03-23 腾讯美国有限责任公司 基于机器视觉编解码的视频超分辨率方法和系统
CN112801877A (zh) * 2021-02-08 2021-05-14 南京邮电大学 一种视频帧的超分辨率重构方法
CN112950471A (zh) * 2021-02-26 2021-06-11 杭州朗和科技有限公司 视频超分处理方法、装置、超分辨率重建模型、介质
CN113052764A (zh) * 2021-04-19 2021-06-29 东南大学 一种基于残差连接的视频序列超分重建方法

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230254592A1 (en) * 2022-02-07 2023-08-10 Robert Bosch Gmbh System and method for reducing transmission bandwidth in edge cloud systems
CN116091317A (zh) * 2023-02-02 2023-05-09 苏州大学 扫描电镜二次电子图像超分辨方法和系统
CN116634209A (zh) * 2023-07-24 2023-08-22 武汉能钠智能装备技术股份有限公司 一种基于热插拔的断点视频恢复系统及方法
CN116634209B (zh) * 2023-07-24 2023-11-17 武汉能钠智能装备技术股份有限公司 一种基于热插拔的断点视频恢复系统及方法
CN117041669A (zh) * 2023-09-27 2023-11-10 湖南快乐阳光互动娱乐传媒有限公司 视频流的超分控制方法、装置及电子设备
CN117041669B (zh) * 2023-09-27 2023-12-08 湖南快乐阳光互动娱乐传媒有限公司 视频流的超分控制方法、装置及电子设备

Also Published As

Publication number Publication date
EP4365820A1 (en) 2024-05-08
CN117730338A (zh) 2024-03-19

Similar Documents

Publication Publication Date Title
WO2023000179A1 (zh) 视频超分辨网络及视频超分辨、编解码处理方法、装置
US10701394B1 (en) Real-time video super-resolution with spatio-temporal networks and motion compensation
JP6245888B2 (ja) エンコーダおよび符号化方法
KR20200114436A (ko) 스케일러블 영상 부호화를 수행하는 장치 및 방법
KR20160021417A (ko) 공간적으로 확장 가능한 비디오 코딩을 위한 적응적 보간
CN108737823B (zh) 基于超分辨技术的图像编码方法和装置、解码方法和装置
WO2022068682A1 (zh) 图像处理方法及装置
JP6042899B2 (ja) 映像符号化方法および装置、映像復号方法および装置、それらのプログラム及び記録媒体
CN115606179A (zh) 用于使用学习的下采样特征进行图像和视频编码的基于学习的下采样的cnn滤波器
TWI805085B (zh) 基於機器學習的圖像解碼中色度子採樣格式的處理方法
TWI672941B (zh) 影像處理方法、設備及系統
CN112218072A (zh) 一种基于解构压缩和融合的视频编码方法
CN115552905A (zh) 用于图像和视频编码的基于全局跳过连接的cnn滤波器
WO2022011571A1 (zh) 视频处理方法、装置、设备、解码器、系统及存储介质
WO2023279961A1 (zh) 视频图像的编解码方法及装置
CN116582685A (zh) 一种基于ai的分级残差编码方法、装置、设备和存储介质
TW202239209A (zh) 用於經學習視頻壓縮的多尺度光流
Guleryuz et al. Sandwiched Image Compression: Increasing the resolution and dynamic range of standard codecs
CN112601095A (zh) 一种视频亮度和色度分数插值模型的创建方法及系统
CN113747242B (zh) 图像处理方法、装置、电子设备及存储介质
JP2024511587A (ja) ニューラルネットワークベースのピクチャ処理における補助情報の独立した配置
JP2024513693A (ja) ピクチャデータ処理ニューラルネットワークに入力される補助情報の構成可能な位置
Hu et al. Efficient image compression method using image super-resolution residual learning network
WO2023279968A1 (zh) 视频图像的编解码方法及装置
WO2022246809A1 (zh) 编解码方法、码流、编码器、解码器以及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21950444

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202180100597.5

Country of ref document: CN

WWE Wipo information: entry into national phase

Ref document number: 2021950444

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2021950444

Country of ref document: EP

Effective date: 20240202

NENP Non-entry into the national phase

Ref country code: DE