WO2023000179A1 - 视频超分辨网络及视频超分辨、编解码处理方法、装置 - Google Patents
视频超分辨网络及视频超分辨、编解码处理方法、装置 Download PDFInfo
- Publication number
- WO2023000179A1 WO2023000179A1 PCT/CN2021/107449 CN2021107449W WO2023000179A1 WO 2023000179 A1 WO2023000179 A1 WO 2023000179A1 CN 2021107449 W CN2021107449 W CN 2021107449W WO 2023000179 A1 WO2023000179 A1 WO 2023000179A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- video
- resolution
- frame sequence
- video frame
- super
- Prior art date
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 58
- 238000012545 processing Methods 0.000 claims abstract description 105
- 239000000284 extract Substances 0.000 claims abstract description 14
- 238000000034 method Methods 0.000 claims description 46
- 238000005070 sampling Methods 0.000 claims description 46
- 238000007781 pre-processing Methods 0.000 claims description 41
- 230000004913 activation Effects 0.000 claims description 37
- 230000006870 function Effects 0.000 claims description 28
- 238000012549 training Methods 0.000 claims description 28
- 238000004590 computer program Methods 0.000 claims description 26
- 230000007246 mechanism Effects 0.000 claims description 24
- 238000000605 extraction Methods 0.000 claims description 22
- 230000004927 fusion Effects 0.000 claims description 20
- 238000010606 normalization Methods 0.000 claims description 16
- 230000002123 temporal effect Effects 0.000 claims description 10
- 238000004364 calculation method Methods 0.000 claims description 8
- 230000003287 optical effect Effects 0.000 claims description 7
- 238000007906 compression Methods 0.000 abstract description 14
- 230000008569 process Effects 0.000 description 19
- 238000010586 diagram Methods 0.000 description 13
- 230000000694 effects Effects 0.000 description 12
- 238000012805 post-processing Methods 0.000 description 12
- 238000013139 quantization Methods 0.000 description 11
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 description 9
- 230000006835 compression Effects 0.000 description 9
- 230000005540 biological transmission Effects 0.000 description 6
- 230000006872 improvement Effects 0.000 description 6
- 238000004422 calculation algorithm Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 5
- 230000009466 transformation Effects 0.000 description 5
- 238000012952 Resampling Methods 0.000 description 4
- 238000003491 array Methods 0.000 description 4
- 238000001914 filtration Methods 0.000 description 4
- 238000011176 pooling Methods 0.000 description 4
- 230000003044 adaptive effect Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 3
- 230000008859 change Effects 0.000 description 3
- 238000004891 communication Methods 0.000 description 3
- 238000013500 data storage Methods 0.000 description 3
- -1 CBL3-1 Proteins 0.000 description 2
- 102100035813 E3 ubiquitin-protein ligase CBL Human genes 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 238000000926 separation method Methods 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 101150039392 CBL3 gene Proteins 0.000 description 1
- 101150109235 CBL4 gene Proteins 0.000 description 1
- 101150045049 CBL5 gene Proteins 0.000 description 1
- 101150107454 CBL6 gene Proteins 0.000 description 1
- 101150058299 Cblc gene Proteins 0.000 description 1
- 102100035275 E3 ubiquitin-protein ligase CBL-C Human genes 0.000 description 1
- 101000715390 Homo sapiens E3 ubiquitin-protein ligase CBL Proteins 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 230000000593 degrading effect Effects 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000003631 expected effect Effects 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 238000011423 initialization method Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/102—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
- H04N19/132—Sampling, masking or truncation of coding units, e.g. adaptive resampling, frame skipping, frame interpolation or high-frequency transform coefficient masking
Definitions
- the general video compression process is shown in Figure 1.
- the encoding end it includes processes such as video acquisition, video preprocessing, and video encoding.
- the decoding end it includes processes such as video decoding, video post-processing, and display playback.
- video preprocessing sometimes the frame rate of the video will be reduced due to restrictions such as bandwidth and bit rate, and the image quality will also be reduced during video compression encoding.
- the video post-processing process after video decoding is an important link to improve video quality, but the improvement effect needs to be enhanced.
- An embodiment of the present disclosure provides a video super-resolution network, including a generation network, wherein the generation network includes a first feature extraction part, a second feature extraction part, and a reconstruction part connected in sequence, wherein:
- the first feature extraction part is configured to receive a first video frame sequence, extract a first feature from the first video frame sequence based on 3D convolution and output it;
- the second feature extraction part is configured to receive the first feature, extract temporal and/or spatial second features from the first feature based on the 3D residual attention mechanism and output it;
- the reconstruction part is configured to receive the second feature, realize feature fusion and feature spatio-temporal super-resolution based on 3D convolution and 3D upsampling, and reconstruct a video frame sequence based on 3D convolution to generate a second video frame sequence, the The resolution of the second sequence of video frames is greater than that of the first sequence of video frames.
- An embodiment of the present disclosure also provides a video super-resolution processing method, including:
- the feature fusion of the second feature and the spatiotemporal super-resolution of the feature are realized, and the video frame sequence is reconstructed based on the 3D convolution to generate a second video frame sequence, and the resolution of the second video frame sequence The rate is greater than the resolution of the first sequence of video frames.
- An embodiment of the present disclosure also provides a video decoding processing method, including:
- the first video frame sequence is output to the video super-resolution network for video super-resolution processing to obtain a second video frame sequence, the resolution of which is greater than The resolution of the first sequence of video frames.
- An embodiment of the present disclosure also provides a video coding processing method, including:
- An embodiment of the present disclosure also provides a video super-resolution processing device, including a processor and a memory storing a computer program that can run on the processor, wherein, when the processor executes the computer program, the following is implemented:
- An embodiment of the present disclosure also provides a video decoding processing device, including a processor and a memory storing a computer program that can run on the processor, wherein, when the processor executes the computer program, the following The video decoding processing method described in any embodiment is disclosed.
- An embodiment of the present disclosure also provides a video decoding processing device, including:
- the super-resolution judging device is configured to judge whether the first video frame sequence satisfies the set super-resolution condition, and if the set super-resolution condition is met, output the first video frame sequence to the video super-resolution network Perform video super-resolution processing; if the set super-resolution condition is not met, determine to skip the video super-resolution processing of the first video frame sequence;
- the video super-resolution network is configured to perform video super-resolution processing on the first video frame sequence to obtain a second video frame sequence with a resolution greater than that of the first video frame sequence.
- An embodiment of the present disclosure also provides a video encoding processing device, including a processor and a memory storing a computer program that can run on the processor, wherein, when the processor executes the computer program, the following The video coding processing method described in any embodiment is disclosed.
- An embodiment of the present disclosure also provides a video encoding processing device, including:
- the down-sampling decision module is configured to determine whether to down-sample the video frame sequence from the data source when performing video preprocessing, and output the video frame sequence from the data source to the down-sampling device when it is determined to perform down-sampling , when it is determined not to perform down-sampling, directly output the sequence of video frames from the data source to a video encoder for encoding;
- a downsampling device is configured to downsample the input video frame sequence, and output the downsampled video frame sequence to a video encoder for encoding;
- a video encoder configured to perform video encoding on the sequence of video frames from a data source or the sequence of downsampled video frames.
- An embodiment of the present disclosure further provides a video encoding and decoding system, including the video encoding processing device as described in the embodiment of the present disclosure and the video decoding processing device as described in the embodiment of the present disclosure.
- An embodiment of the present disclosure further provides a code stream, wherein the code stream is generated according to the video coding processing method described in the embodiment of the present disclosure, and the code stream includes the downsampling flag.
- An embodiment of the present disclosure also provides a non-transitory computer-readable storage medium, the computer-readable storage medium stores a computer program, wherein, when the computer program is executed by a processor, any implementation of the present disclosure can be realized.
- the method described in the example is the same as the computer program.
- Fig. 1 is the schematic diagram of video compression process
- Figure 2 is an architecture diagram of a generative confrontation network
- Fig. 3 is a structural diagram of generating a network according to an embodiment of the present disclosure.
- FIG. 4 is a schematic structural diagram of a 3D residual attention mechanism model according to an embodiment of the present disclosure
- FIG. 5 is a structural diagram of a discrimination network according to an embodiment of the present disclosure.
- FIG. 6 is a flowchart of a video super-resolution processing method according to an embodiment of the present disclosure
- FIG. 7A is a schematic diagram of super-resolution of a sequence of decoded video frames according to an embodiment of the present disclosure
- FIG. 7B is a structural diagram of a video decoder according to an embodiment of the present disclosure.
- FIG. 8A is a structural diagram of a video encoder according to an embodiment of the present disclosure.
- Fig. 8B is a schematic diagram of a scalable video coding architecture according to an embodiment of the present disclosure, only showing the parts closely related to upsampling and downsampling;
- Fig. 8C is a schematic diagram of a scalable video decoding architecture according to an embodiment of the present disclosure, only showing the part closely related to upsampling;
- FIG. 9 is a flowchart of a video encoding processing method according to an embodiment of the present disclosure.
- FIG. 10 is a flowchart of a video decoding processing method according to an embodiment of the present disclosure corresponding to the video encoding processing method shown in FIG. 9;
- FIG. 11 is an architecture diagram of a video encoding and decoding system according to an embodiment of the present disclosure.
- Fig. 12 is a schematic structural diagram of a video encoding processing device according to an embodiment of the present disclosure.
- words such as “exemplary” or “for example” are used to mean an example, illustration or illustration. Any embodiment described in this disclosure as “exemplary” or “for example” should not be construed as preferred or advantageous over other embodiments.
- "And/or” in this article is a description of the relationship between associated objects, which means that there can be three relationships, for example, A and/or B, which can mean: A exists alone, A and B exist simultaneously, and there exists alone B these three situations.
- “A plurality” means two or more than two.
- words such as “first” and “second” are used to distinguish the same or similar items with basically the same function and effect. Those skilled in the art can understand that words such as “first” and “second” do not limit the number and execution order, and words such as “first” and “second” do not necessarily limit the difference.
- Video post-processing is mainly performed on quality loss during video pre-processing, video encoding, and video decoding, so as to enhance video image quality and increase the number of video frames.
- some methods use filters to filter the compressed image.
- these methods mainly improve the visual effect of the image by smoothing the distortion introduced by video compression, rather than restoring the pixel value of the image itself.
- a non-block matching based frame rate increasing algorithm and a block matching based frame rate increasing algorithm may be used.
- the non-block matching-based frame rate improvement algorithm does not consider the motion of objects in the image, but only uses adjacent video frames to perform linear interpolation to generate new video frames.
- the frame rate improvement algorithm based on block matching improves the frame rate of the video frame by estimating the motion vector of the object and interpolating on the motion trajectory of the object. The quality of the video frame obtained by interpolation will be improved, but the complexity is higher.
- Super-resolution (abbreviated as super-resolution in the text) refers to improving the resolution of an original image by means of hardware or software.
- HR High Resolution
- LR Low Resolution
- HR High Resolution
- LR Low Resolution
- Super-resolution technology can reconstruct low-resolution video into high-resolution video images through deep learning methods, bringing users a good video experience.
- GAN Generative Adversarial Networks
- the network includes a generator G that can capture data distribution, also called a generator.
- a discriminator network (discriminator) D that can estimate the probability of data originating from real samples, also known as a discriminator.
- the generation network and the discrimination network are trained at the same time, and the two networks fight against each other to achieve the best generation effect.
- the input to generative network training is a low-resolution image, and the output is a reconstructed super-resolved image.
- the input of discriminative network training is super-resolution image and real image, and the output is the probability that the input image comes from the real image.
- Low-resolution images can be obtained by downsampling the real image.
- the training process of the generation network is to maximize the probability of the discriminator making mistakes, so that the discriminator mistakenly believes that the data is a real image (true sample) rather than a super-resolution image (false sample) generated by the generator.
- the training goal of the discriminative network is to maximize the separation of real samples and fake samples. Therefore, this framework corresponds to a minimax game between two players.
- a unique equilibrium solution can be obtained, so that after the fake samples generated by the generation network enter the discriminant network, the result given by the discriminant network is a value close to 0.5.
- An embodiment of the present disclosure proposes a method for realizing super-resolution based on a generative adversarial network, and the corresponding network is a super-resolution generative adversarial network (SRGAN: Super-Resolution Generative Adversarial Networks), in the network framework of SRGAN.
- SRGAN Super-Resolution Generative Adversarial Networks
- the core of the generated network is multiple residual blocks of the same layout, using batch normalization (BN: batch-normalization layers) layer and corrected linear unit (ReLU: Rectified Linear Unit) as the activation function, using two trained sub Pixel (trained sub-pixel) convolutional layers increase the resolution of the input image.
- the discriminant network consists of 8 incremental convolutional layers, which grow from 2 to 64 to 512 kernel functions.
- the resulting 512 feature maps are followed by 2 fully connected layers (dense layers, also known as dense layers) and a The final sigmoid activation function to get the probability of the sample category.
- SRGAN cannot achieve temporal and spatial super-resolution at the same time to fully extract useful features of different dimensions, and the improvement of video quality is limited.
- Its discriminative network has a single structure, does not use optical flow information, and its discriminative ability is limited. Therefore, the quality of high-resolution images reconstructed by this network still needs to be improved.
- An embodiment of the present disclosure provides a video super-resolution network, including a generation network for realizing a video spatio-temporal super-resolution function.
- the generation network uses 3D convolution to realize the function of spatio-temporal super-separation, first performs shallow feature extraction based on 3D convolution, and then uses a series of residual attention blocks (RAB: residual attention Block) for deep feature extraction.
- RAB residual attention Block
- Each RAB block itself uses a residual learning method and a 3D attention mechanism to further improve the quality of spatio-temporal super-resolution.
- the generation network includes a first feature extraction part, a second feature extraction part and a reconstruction part connected in sequence, wherein:
- the first feature extraction part 10 is configured to receive a first video frame sequence, extract a first feature from the first video frame sequence based on 3D convolution and output it;
- the second feature extraction part 20 is configured to receive the first feature, extract the second feature in time and/or space from the first feature based on the 3D residual attention mechanism and output it;
- the reconstruction part 30 is configured to receive the second feature, realize feature fusion and feature spatio-temporal super-resolution based on 3D convolution and 3D upsampling, and reconstruct a video frame sequence based on 3D convolution to generate a second video frame sequence, the The resolution of the second sequence of video frames is greater than the resolution of the first sequence of video frames.
- the above-mentioned second features are extracted from the first features, and the first features may also be called shallow features, and the second features may be called deep features.
- the above-mentioned first video frame sequence may also be called a low-resolution video frame sequence, and the second video frame sequence may be called a high-resolution video frame sequence or a super-resolution video frame sequence.
- the aforementioned image resolution and video frame rate may be collectively referred to as resolution, where the image resolution may also be referred to as spatial resolution, and the video frame rate may also be referred to as temporal resolution.
- the first feature extraction part includes a sequentially connected 3D convolutional layer and an activation layer, such as Conv3d and PReLU in FIG. 3 , and the input of the 3D convolutional layer is the A first video frame sequence, the output of the activation layer is the first feature.
- an activation layer such as Conv3d and PReLU in FIG. 3
- the second feature extraction part includes a plurality of residual attention blocks (RABs) connected in sequence, as shown in FIG. 3 , the input of the first RAB is the first feature, the input of other RABs except the first RAB is the output of the previous RAB, and the output of the last RAB is the second feature; each of the RABs includes a sequentially connected 3D convolutional layer, activation layer, and 3D attention mechanism model units, such as Conv3d, PReLU and 3D-CBAM in Figure 5.
- the input of the RAB is sent to the 3D convolutional layer, and a skip connection (skip connection) is added to the output of the 3D attention mechanism model unit, and the obtained sum is used as the output of the RAB.
- the 3D attention mechanism model unit adopts a 3D convolutional block attention model (3D-CBAM: 3D Convolutional Block Attention Module), as shown in FIG. 4 , the 3D convolutional block Attention model includes 3D channel attention module 60 and 3D space attention module 70, and the input of described 3D attention mechanism model unit is sent into described 3D channel attention module; The input and output of described 3D channel attention module are related The first product obtained by multiplying is used as the input of the 3D spatial attention module, and the second product obtained by multiplying the output of the 3D spatial attention module with the first product is used as the 3D attention mechanism model unit output.
- 3D-CBAM 3D Convolutional Block Attention Module
- the attention mechanism is designed in two-dimensional space.
- the 3D-CBAM of the embodiment of the present disclosure expands on the two-dimensional basis, adding a depth dimension.
- 3D-CBAM sequentially infers channel attention feature maps and spatial attention feature maps.
- the input feature map can be input into the shared multi-layer perceptron after the maximum pooling and mean pooling based on the width, height and depth, and then summed based on the corresponding elements.
- the operation is activated by the sigmoid function to generate the initial channel feature map, and the initial channel feature map is multiplied by the input feature map to generate the final channel feature map.
- the above-mentioned final channel feature map can be used as the input feature map of the 3D spatial attention module, and the channel-based maximum pooling operation and mean pooling operation are performed on it, and the extracted features are then performed based on
- the merging operation of channels is then reduced into one channel through convolution operations (such as 7 ⁇ 7 convolution, 3 ⁇ 3 convolution, etc.), and then activated by the sigmoid function to generate a spatial attention feature map.
- the generated spatial attention feature map is multiplied by the input final channel feature map to obtain the feature map output by 3D-CBAM.
- the 3D attention mechanism considers both spatial and temporal changes when extracting features, which can be more suitable for the purpose of the video super-resolution network in the embodiment of the present disclosure, and better adaptive learning.
- the 3D channel attention module pays more attention to which channels play a role in the final super-resolution, and selects the features that play a decisive role in the prediction.
- the 3D spatial attention module focuses on which pixel positions will play a more important role in the prediction of the network. The joint use of these two attention mechanism modules can maximize the learning ability of the network and obtain better spatiotemporal super-resolution results. .
- the reconstruction part 30 includes the following units connected in sequence:
- a 3D convolution unit for fusion features including sequentially connected 3D convolution layers and activation layers (such as Conv3D and PReLu in Figure 3), the input of the 3D convolution unit for fusion features is the second Features;
- PReLu is a parametric rectifier linear unit (Parametric Rectifier Linear Unit).
- the 3D transposed convolution unit for realizing the spatio-temporal super-resolution of features, including the 3D transposed convolution layer and the activation layer connected in sequence (ConvTrans-3D and PReLu in Fig. 5), the 3D transposed convolution unit
- the input is the output of the 3D convolution unit used to fuse features, and the 3D transposed convolution can realize the function of upsampling;
- a 3D convolutional layer (such as Conv3D in FIG. 5 ) for generating a sequence of video frames, the input is the output of the 3D transposed convolution unit, and the output is the second sequence of video frames.
- the activation function used by the activation layer is PReLu, there are many kinds of activation functions, and other activation functions may also be used here.
- the characteristics of the above-mentioned generation network in the embodiment of the present disclosure include: using 3D convolution, the time and space features of the video can be extracted at the same time, the feature extraction is more accurate, and compared with the method of separately extracting time and space information, it further reduces The consumption of computing resources; the generation network adopts a 3D attention mechanism, which can well concentrate the features extracted by the network, so as to obtain better reconstruction results; the generation network can use a variable number of RAB blocks, and the network structure is more flexible. The number can be freely selected according to computing resources to meet the needs of different scenarios.
- the generated network can be used independently as a video super-resolution network to complete the video super-resolution function.
- the video super-resolution network further includes a discriminant network, and the overall architecture of the video super-resolution network composed of the generation network and the discriminant network is shown in FIG. 2 .
- the input during the training of the discriminant network is a sequence of real video frames and a sequence of second video frames generated by the generating network, which are respectively used as real samples and fake samples input to the discriminant network.
- the output of the judgment network is the probability that the input video frame sequence is a real video frame.
- the first video frame sequence used as input data during network training is obtained by degrading the real video frame sequence.
- the first video frame sequence used as the training set may be obtained by performing one or more of downsampling, blurring, adding noise, and compression codec on the real video frame sequence.
- downsampling There are many methods of downsampling, including linear methods, such as nearest neighbor sampling, bilinear sampling, Bicubic downsampling, mean downsampling, etc.; non-linear methods, such as neural network downsampling.
- linear methods such as nearest neighbor sampling, bilinear sampling, Bicubic downsampling, mean downsampling, etc.
- non-linear methods such as neural network downsampling.
- a variety of downsampling multiples can be set to obtain the first video frame sequence of different resolutions, so as to train multiple sets of network parameters.
- the network parameters of the video super-resolution network can be flexibly set according to needs to obtain different super-resolution effects. .
- the discrimination network includes a first branch, a second branch, an information fusion unit connected to the first branch and the second branch, and an information fusion unit connected to the A weight calculation unit connected to the information fusion unit, wherein:
- the second branch 50 is configured to extract motion information features between video frames from the input video frame sequence based on the optical flow network, and perform authenticity judgment based on the motion information features;
- the weight calculation unit is configured to perform weight calculation according to the fused information output by the information fusion unit to obtain the probability that the input video frame sequence is a real video frame sequence.
- the information fusion unit is implemented using a fully connected layer (such as dense(1) in Figure 5); the weight calculation unit uses an S-shaped function (such as the sigmod function in Figure 5 )accomplish.
- the first branch 40 includes the following units connected in sequence:
- 2D convolution unit including sequentially connected 2D convolution layer and activation layer.
- Conv_1 and LeakyReLU in Figure 5;
- a plurality of 2D convolution plus normalization units includes sequentially connected 2D convolution layers, BN layers and activation layers, the Conv_2 layer, BN layer and LeakyReLU in Figure 5 form a 2D
- the convolution plus normalization unit, and other 2D convolution plus normalization units are represented by CBL_2 to CBL_8 in the figure. 7 CBLs are used in the example of FIG. 5 , but the present disclosure is not limited to this number; the BN layer is used to speed up the network convergence rate.
- the second branch 50 includes the following units connected in sequence:
- M 2D deconvolution units including 2D deconvolution layers and activation layers, M ⁇ 2, four 2D deconvolution units are shown in Figure 5, and the 2D deconvolution layers are denoted as DeConv5, DeConv4, DeConv3 and DeConv2, the activation layer is LeakyReLU;
- the fully connected unit includes a sequentially connected fully connected layer and an activation layer, such as Dense (1024) and LeakyReLU in the second branch 50 in FIG. 5 .
- the connection relationship is shown in Figure 5. This kind of network structure can realize the extraction of motion information features between video frames and the authenticity judgment.
- the activation function used by the activation layer in the discriminant network is LeakReLu with a leaky rectified linear unit.
- activation functions There are many kinds of activation functions, and other activation functions can also be used here.
- K represents the size of the convolution kernel (kernel)
- s represents the step size (stride)
- n represents the number of convolution kernels (number).
- K3 means that the convolution kernel size is 3
- s1 means that the step size is 1
- n64 means that the number of convolution kernels is 64, and so on, the unit of convolution kernel size and step size can be pixels.
- the convolutional layer parameters in the second branch 50 are set as follows:
- the discriminant network in the embodiment of the present disclosure adopts two discriminant criteria, one is the feature of the video frame itself, and the other is the motion information between the video frames.
- the discriminant network includes two branches, the whole is a U-shaped network structure, one branch is used to extract the detailed features and judgment of the input video frame sequence, and the other branch uses the optical flow network to obtain the motion information characteristics of the input video frame sequence and judgment.
- the authenticity probability of the input video frame can be more accurately identified, that is, the real video frame sequence or the super-resolution video frame sequence (ie, the second video frame sequence).
- the use of 3D residual attention mechanism can better extract useful features in different dimensions and improve video quality.
- the embodiments of the present disclosure are based on a spatio-temporal super-resolution network of generated confrontation video, which can simultaneously improve the spatial resolution and temporal resolution of the video, that is, super-resolution in space and time, including multi-dimensional feature information. It can significantly enhance the image quality and frame rate of low-resolution video frame sequences, and use a network to achieve both effects of video frame image super-resolution and frame rate improvement.
- the video spatio-temporal super-resolution network of the embodiment of the present disclosure puts the use of motion information on the discriminative network. Compared with using optical flow information for motion estimation in the generation network part, it can further use real video information to further Improve the performance of the entire network and improve the quality of video super resolution.
- the network structure of the present disclosure may be changed on the basis of the foregoing embodiments.
- the number of RABs included in the generated network can be appropriately reduced or increased to meet the requirements of different computing capabilities in different scenarios.
- An embodiment of the disclosure also provides a method for training the video super-resolution network of the embodiment of the disclosure, including the following process:
- each HR sequence has 7 frames, the height of each HR video frame is sH, and the width is sW.
- the HR sequence can be down-sampled in time and space at the same time to obtain a blocky low-resolution video frame sequence (LR sequence for short) 5 ⁇ H ⁇ W. Setting smaller H and W values during training can reduce training time and increase the complexity of the data set. All training data are normalized so that their pixel values are in the (0, 1) interval, which is better used for network training results. Through the above processing, a sufficient number of LR sequences and HR sequences are obtained.
- the LR sequence is used as the input data of the video super-resolution network, and the HR sequence is used as the target data of the video super-resolution network to train the generation network.
- the output of the generator network is a super-resolution video frame sequence (referred to as SR sequence) of the same size as the HR sequence.
- the SR sequence that is, the fake sample
- the HR sequence that is, the real sample
- the discriminant network is sent to the discriminant network as the input data of the discriminant network training, in which the HR sequence and the SR sequence each account for 50%, and the discriminant network outputs the judgment result, that is, the true value of the input data.
- Pseudo-probability can also be said to be the probability that the input data is a HR sequence.
- the judgment results of the discriminant network on the SR sequence and HR sequence are used to calculate the loss of the discriminant network and the confrontation loss of the generating network, and the mean square error (MSE: Mean Square Error) of the SR sequence output by the generating network and the HR sequence can be used as the loss function.
- MSE Mean Square Error
- the video super-resolution network is implemented on an Nvidia GTX 1080Ti GPU using the PyTorch platform (the PyTorch platform is an open source Python machine learning library). Both the training set and the test set of the experiment use Vimeo-90K. 4 times super-resolution is achieved on the video frame image, and the frame rate is increased by 2 times.
- RAB residual attention block
- the video encoding end may not be able to provide high-resolution video due to various objective limitations. For example, the camera resolution is not enough, the network bandwidth is not enough, and the source resources are not enough. Video super-resolution based on deep learning can better restore image details. Therefore, video super-resolution processing can be used to enhance video quality, present high-quality video to users, and improve the subjective visual effect of images.
- An embodiment of the present disclosure provides a video super-resolution processing method, as shown in FIG. 6 , including:
- Step 110 extracting a first feature from the first video frame sequence based on 3D convolution
- Step 120 extracting temporal and/or spatial second features from the first features based on the 3D residual attention mechanism
- Step 130 based on 3D convolution and 3D upsampling, realize the feature fusion of the second feature and the spatiotemporal super-resolution of the feature, and reconstruct the video frame sequence based on the 3D convolution to generate a second video frame sequence, the second video frame
- the resolution of the sequence is greater than the resolution of said first sequence of video frames.
- the video super-resolution processing method is implemented based on the video super-resolution network described in any embodiment of the present disclosure
- the image resolution of the second video frame sequence is greater than the image resolution of the first video frame sequence, and/or, the video of the second video frame sequence is first The video frame rate of the sequence of video frames.
- Video super-resolution can be used in various aspects of the video compression process, such as video post-processing at the decoding end, video pre-processing at the encoding end, and video encoding and decoding. Below are a few examples to illustrate.
- One way to deal with it is to use the conventional video encoding method but increase the intensity of compression, such as increasing the quantization step size, to encode a video frame sequence with a lower bit rate, and then improve the video quality through video super-resolution at the decoding end. That is to say, video super-resolution is applied to the post-processing process in video decoding. For example, the resolution of the reconstructed video frame can be increased by performing super-resolution processing on the decoded video frame sequence output by the decoder in the video playback device.
- the first video frame sequence in the video super-resolution processing method shown in FIG. 6 is the decoded code stream output A video frame sequence
- the video super-resolution processing is used to increase the resolution of the decoded video frame sequence.
- video super-resolution processing can be used to replace the original post-filtering, or the original post-processing filtering can be retained and video super-resolution processing can be added.
- FIG. 7A is a structural block diagram of a video decoding end in this application scenario. As shown, including:
- the video decoder 101 is configured to decode an encoded video stream (code stream for short) to obtain a first video frame sequence;
- the video super-resolution network 103 is configured to perform video super-resolution processing on the first video frame sequence to obtain a second video frame sequence whose resolution is greater than that of the first video frame sequence;
- the display 105 is configured to display and play the second video frame sequence.
- the video decoder 101 in this embodiment may adopt the video decoder shown in FIG. 7B .
- the structure of the video decoder can be used for video decoding of H.264/AVC, H.265/HEVC, VVC/H.266 and other similar standards.
- the video decoder 101 may also use other types of video decoders, such as neural network-based video decoders in the end-to-end video coding and decoding technology.
- the video decoder 101 includes an entropy decoding unit 150, a prediction processing unit 152, an inverse quantization unit 154, an inverse transformation processing unit 156, a reconstruction unit 158 (indicated by a circle with a plus sign in the figure), a filter unit 159, and picture buffer 160.
- video decoder 30 may contain more, fewer or different functional components.
- the entropy decoding unit 150 may perform entropy decoding on the received code stream to extract information such as syntax elements, quantized coefficient blocks, and PU motion information.
- the prediction processing unit 152 , the inverse quantization unit 154 , the inverse transform processing unit 156 , the reconstruction unit 158 and the filter unit 159 can all perform corresponding operations based on the syntax elements extracted from the code stream.
- the inverse quantization unit 154 may inverse quantize the quantized TU-associated coefficient blocks.
- Inverse transform processing unit 156 may apply one or more inverse transforms to the inverse quantized coefficient block in order to generate the reconstructed residual block of the TU.
- Prediction processing unit 152 includes inter prediction processing unit 162 and intra prediction processing unit 164 . If the PU is encoded using intra-frame prediction, the intra-frame prediction processing unit 164 can determine the intra-frame prediction mode of the PU based on the syntax elements parsed from the code stream, and according to the determined intra-frame prediction mode and the adjacent PU obtained from the picture buffer device 60 Intra prediction is performed on the reconstructed reference information, resulting in a prediction block of the PU. If the PU is encoded using inter-prediction, inter-prediction processing unit 162 may determine one or more reference blocks for the PU based on the motion information of the PU and corresponding syntax elements to generate a predictive block for the PU.
- the reconstruction unit 158 may obtain the reconstruction block of the CU based on the reconstruction residual block associated with the TU and the prediction block of the PU generated by the prediction processing unit 152 (ie intra prediction data or inter prediction data).
- the above display 105 may be, for example, a liquid crystal display, a plasma display, an organic light emitting diode display or other types of display devices.
- the decoding end may not include the display 105, but may include other devices that can apply the decoded data.
- the embodiments of the present disclosure can be used to solve problems such as image quality loss and video frame rate drop generated in the video compression process.
- the video super-resolution network By applying the video super-resolution network to the post-processing of the decoding end, the temporal-spatial super-resolution of the decoded output video frame sequence can improve the quality of the video image.
- the frame rate In order to meet the frame rate requirements of the decoding end, the frame rate can also be increased during post-processing, so as to present users with high-quality video with higher resolution and higher frame rate.
- the video super-resolution network is used to enhance the quality of the decoded video frame sequence, and the encoding end is not required to down-sample the video frame sequence during video preprocessing.
- the first video frame sequence is a decoded video frame sequence output by decoding the code stream; the video super-resolution processing method further includes: parsing the code stream from the code stream network parameter information of the video super-resolution network, and set the network parameters of the video super-resolution network according to the network parameter information.
- different network parameters such as the number of RABs in the generation network can be configured for the video super-resolution network to achieve a better super-resolution effect.
- Appropriate network parameters can be generated by the encoding end and written into the code stream, and the decoding end can parse the network parameters from the code stream and configure them to achieve better quality enhancement effects.
- video super-resolution is applied to a video preprocessing process.
- the acquired original video frame sequence is input into the video super-resolution network of the embodiment of the present disclosure for processing, and an output video with higher resolution and higher frame rate is obtained, and then the output The video is encoded as the input video of the video encoder.
- the first video frame sequence in the video super-resolution processing method shown in FIG. 6 is the original video frame sequence collected by a video acquisition device, and the video super-resolution Processing can increase the resolution of a sequence of raw video frames.
- Adaptive Resolution Change allows the video frame sequence to transmit video frames of different resolutions according to the network status, transmit low-resolution video frames when the network bandwidth is low, and transmit original resolution video frames when the bandwidth is high.
- IDR Instantaneous Decoding Refresh
- the encoder wants to change the resolution during video transmission, it needs to insert an instant decoding refresh (IDR: Instantaneous Decoding Refresh) frame or similar frame that meets the new resolution.
- IDR Instantaneous Decoding Refresh
- the transmission of IDR frames requires a relatively high bit rate, and delays will be introduced for video conferencing applications. If the IDR frame is not inserted, the different resolutions of the current frame and the reference frame will cause problems during inter-frame prediction.
- VP9 is an open video compression standard developed by Google
- RPR reference image resampling
- VVC Versatile Video Coding
- the image-based RPR puts the reference image before resampling and after resampling into the decoded picture buffer
- DPB Decoded Picture Buffer
- the reference image of the corresponding resolution is found in the DPB for prediction.
- video super-resolution is applied to the processing of RPR in the video coding process.
- the reference image that needs to be up-sampled is obtained from the DBP of the device (it can be one or more frames of reference images, and only the image resolution can be increased).
- the video super-resolution processing can realize the up-sampling of the reference image, and obtain a reference image with a larger image resolution for selection during inter-frame prediction.
- the video encoder 1000 shown in FIG. 8A can be used to implement RPR, and it includes an image resolution adjustment unit 1115, in which the super-resolution network of the embodiment of the present disclosure can be used to realize up-sampling of reference images.
- the video encoder 207 includes a prediction processing unit 1100, a division unit 1101, a residual generation unit 1102, a transformation processing unit 1104, a quantization unit 1106, an inverse quantization unit 1108, an inverse transformation processing unit 1110, a reconstruction unit 1112, A filter unit 1113 , a decoded picture buffer 1114 , an image resolution adjustment unit 1115 , and an entropy encoding unit 1116 .
- the prediction processing unit 1100 includes an inter prediction processing unit 121 and an intra prediction processing unit 1126 .
- video encoder 20 may contain more, fewer or different functional components than this example.
- the division unit 1101 cooperates with the prediction processing unit 1100 to divide the received video data into slices (Slices), CTUs or other larger units.
- the video data received by the dividing unit 1101 may be a video sequence including video frames such as I frames, P frames, or B frames.
- the prediction processing unit 1100 may divide a CTU into CUs, and perform intra-frame predictive coding or inter-frame predictive coding on the CUs.
- the CU can be divided into one or more prediction units (PU: prediction unit).
- the inter prediction processing unit 1121 may perform inter prediction on the PU to generate prediction data of the PU, the prediction data including the prediction block of the PU, motion information of the PU and various syntax elements.
- the intra prediction processing unit 1126 may perform intra prediction on the PU to generate prediction data for the PU.
- the prediction data for a PU may include the prediction block and various syntax elements for the PU.
- the residual generation unit 1102 may generate the residual block of the CU based on the original block of the CU minus the prediction block of the PU divided by the CU.
- the transform processing unit 1104 may divide the CU into one or more transform units (TU: Transform Unit), and the residual block associated with the TU is a sub-block obtained by dividing the residual block of the CU.
- a TU-associated coefficient block is generated by applying one or more transforms to the TU-associated residual block.
- the quantization unit 1106 can quantize the coefficients in the coefficient block based on the selected quantization parameter, and the degree of quantization of the coefficient block can be adjusted by adjusting the QP value.
- the inverse quantization unit 1108 and the inverse transformation unit 1110 may respectively apply inverse quantization and inverse transformation to the coefficient blocks to obtain TU-associated reconstruction residual blocks.
- the reconstruction unit 1112 may add the reconstruction residual block and the prediction block generated by the prediction processing unit 1100 to generate a reconstruction block of the CU.
- the filter unit 1113 After the filter unit 1113 performs loop filtering on the reconstructed block, it stores it in the decoded picture buffer 1114 as a reference image.
- the intra prediction processing unit 1126 may extract reference images of blocks adjacent to the PU from the decoded picture buffer 1114 to perform intra prediction.
- the inter prediction processing unit 1121 may perform inter prediction on the PU of the current frame image using the reference image of the previous frame buffered in the decoded picture buffer 1114 .
- the image resolution adjustment unit 1115 resamples the reference images stored in the decoded picture buffer 1114 , which may include upsampling and/or downsampling, and obtains reference images of various resolutions and stores them in the decoded picture buffer 1114 .
- the entropy encoding unit 1116 may perform an entropy encoding operation on received data (such as syntax elements, quantized systematic blocks, motion information, etc.).
- Scalable video coding introduces concepts such as base layer (BL: Base Layer) and enhancement layer (EL: Enhance Layer), and transmits important information (bits) for decoding images in a guaranteed channel. This collection of important information is called the base layer.
- the secondary information (bits) is transmitted in an unguaranteed channel, and the collection of these data information is called an enhancement layer.
- some or all of the enhancement layer information is lost, and the decoder can still recover acceptable image quality from the base layer information.
- scalable video coding there are many types of scalable video coding, such as scalable coding in space domain, scalable coding in time domain, scalable coding in frequency domain, and scalable coding in quality.
- spatial scalable coding generates multiple images with different spatial resolutions for each frame of video in the video, and decodes the low-resolution images obtained from the basic layer code stream. If the enhancement layer code stream is added to the Decoder, what is obtained is a high-resolution image.
- FIG. 8B An exemplary scalable video coding framework is shown in FIG. 8B , the coding framework includes a base layer, a first enhancement sublayer, namely L1 layer, and a second enhancement sublayer, namely L2 layer. Only the parts of the encoding architecture that are closely related to upsampling and downsampling are shown in the figure.
- the input video frame sequence is sent to the basic encoder 805 for encoding after being down-sampled twice by the first down-sampling unit 801 and the second down-sampling unit 803, and the coded base layer code stream is output.
- the reconstructed video frame is up-sampled in the first up-sampling unit 807 to obtain the reconstructed video frame of the L1 layer.
- the first subtractor 806 subtracts the reconstructed video frame of the L1 layer from the original video frame of the L1 layer output by the first downsampling unit 801 to obtain a residual of the L1 layer.
- the reconstructed video frame of the L1 layer and the reconstruction residual of the L1 layer are added together in the adder 808 and then up-sampled by the second upsampling unit 809 to obtain the reconstructed video frame of the L2 layer.
- the second subtractor 810 subtracts the reconstructed video frame of the L2 layer from the input video frame sequence to obtain a residual of the L2 layer.
- the scalable video coding framework may also include 3 or more enhancement sub-layers.
- video super-resolution is applied to a video coding architecture including a base layer and an enhancement layer, such as an encoder of Low Complexity Enhanced Video Coding (LCEVC: Low Complexity Enhancement Video Coding), Generation of enhancement layer data for encoding side.
- an enhancement layer such as an encoder of Low Complexity Enhanced Video Coding (LCEVC: Low Complexity Enhancement Video Coding), Generation of enhancement layer data for encoding side.
- the video super-resolution network of the embodiments of the present disclosure can be used to implement the up-sampling unit in the scalable video coding architecture.
- the first video frame sequence in the video super-resolution processing method shown in FIG. 6 is generated in a scalable video coding architecture including a base layer and an enhancement layer.
- the reconstruction (Reconstruction) video frame sequence of the basic layer or the reconstruction video frame sequence of the enhanced sub-layer (such as the L1 layer), the video super-resolution processing can realize the up-sampling of the reconstruction video frame sequence for generating corresponding enhancement sublayer residuals,
- video super-resolution is applied to a scalable video decoding architecture including a base layer and an enhancement layer.
- An exemplary scalable video decoding architecture is shown in FIG. 8C.
- the decoding architecture includes a base layer, a first enhancement sublayer, L1 layer, and a second enhancement sublayer, L2 layer, but may also include an enhancement sublayer or 3 more than one enhancement sublayer. Only the parts of the decoding architecture that are closely related to upsampling are shown in the figure.
- the decoded video frame sequence of the base layer output by the base decoder 901 is up-sampled by the first up-sampling unit 903 to obtain an initial intermediate picture (Preliminary Intermediate Picture).
- the initial intermediate image and the decoded data of the L1 layer are added in the first adder 904 to obtain a combined intermediate image (Combined Intermediate Picture) of the L1 layer.
- the combined intermediate image is up-sampled by the second up-sampling unit 905 to obtain an initial output image (Preliminary Output Picture).
- the initial output image and the decoded data of the L2 layer are added in the second adder 906 to obtain an output video frame sequence.
- the decoded video frame sequence of the base layer or the combined intermediate image (can be one or more images) of the enhanced sub-layer, the video super-resolution processing can realize the up-sampling of the decoded video frame sequence to generate the initial intermediate image ; or implement upsampling of the combined intermediate image to generate the initial output image.
- the video encoding end before encoding the video, the video encoding end first determines whether to perform downsampling according to the current situation. For example, when resources such as bandwidth are insufficient, downsampling is used to reduce the amount of encoded data, so that the code traffic is greatly reduced. After the video decoding end completes the decoding of the code stream, it judges whether to perform super-resolution on the decoded video frame sequence.
- the network bandwidth is small, only the basic video stream encoded after downsampling is transmitted, but when the network bandwidth is large, no downsampling is performed, which is equivalent to
- the enhanced video information is transmitted to obtain self-adaptability, ensuring that most terminals with network connections can use appropriate code streams to transmit multimedia information.
- this scheme is superior to the scheme in which the encoding end directly encodes the video frame into an image with the same bit rate, and the decoding end uses a super-resolution network to enhance the quality of the decoded image.
- An embodiment of the present disclosure provides a video coding processing method, as shown in FIG. 9 , including:
- Step 210 when performing video preprocessing, determine whether to downsample the video frame sequence from the data source, if yes, perform step 220, if not, perform step 230;
- Step 220 when it is determined not to perform down-sampling, directly input the video frame sequence from the data source into the video encoder for video encoding, generate a code stream, and end;
- Step 230 if downsampling is determined, downsampling the video frame sequence from the data source, inputting the downsampled video frame sequence into a video encoder for video encoding, and generating a code stream.
- the video encoding process referred to herein includes video preprocessing and video encoding.
- the video preprocessing may include processing such as downsampling.
- the video decoding processing referred to herein includes video decoding and video post-processing, and the video post-processing may include the video super-resolution processing in the embodiments of the present disclosure.
- the down-sampling the video frame sequence from the data source includes: down-sampling the image resolution and/or video frame rate of the video frame sequence from the data source.
- an appropriate down-sampling multiple can be selected according to bandwidth and other factors, so that the encoded code rate can adapt to the bandwidth.
- the video encoding processing method further includes: when performing video encoding, writing a downsampling flag into the code stream, and the downsampling flag is used to indicate that the encoding end accepts all data from the data source Whether the preprocessing of the video frame sequence includes downsampling.
- the encoding end performs downsampling when preprocessing the video frame sequence from the data source, and the video super-resolution network is trained based on the real video frame sequence and the first video frame sequence obtained by downsampling the real video frame sequence , then the encoding end down-samples the video frames from the data source and then performs compression encoding to generate a code stream, and the decoding end decodes the code stream to reconstruct the first video frame sequence, and the decoding end uses the video super-resolution network to reconstruct the first video frame sequence A video frame sequence is subjected to video super-resolution processing, which significantly improves video quality.
- the application scenario of the video super-resolution network is similar to the training scenario at this time, both are used to restore the resolution of the downsampled video frame. And if the video frame is not down-sampled at the encoding end, even if the decoded video quality does not meet the requirements, at the decoding end, the video super-resolution network trained in the above-mentioned manner is used to enhance the quality of the decoded video frame sequence, which will affect the video quality.
- the lifting effect is limited or no effect.
- the encoding end generates the above-mentioned down-sampling flag and writes it into the code stream, so that the decoding end can determine whether to perform video super-resolution processing according to the down-sampling flag, or determine whether to perform video super-resolution processing according to the down-sampling flag and other conditions, which is beneficial
- the decoding end reasonably makes a judgment on whether to perform video super-resolution processing.
- the determining whether to downsample the sequence of video frames from the data source includes: determining to downsample the sequence of video frames from the data source when any of the following conditions is met: Downsample:
- the bandwidth available to stream the video stream is less than the bandwidth required to stream the video stream without downsampling:
- the resource at the encoding end does not support direct video encoding of the video frame sequence from the data source
- the video encoding processing method further includes: when performing video encoding, acquiring the network parameters of the video super-resolution network corresponding to the video frame sequence from the data source, and converting the network parameters to Write code stream.
- the encoder can make training samples based on the video resource in advance, and train the video super-resolution network, so as to obtain the network parameters of the video super-resolution network corresponding to the video resource, and then the The network parameters are saved together with the video resource, and when video encoding is performed on the video resource, the network parameters are read and encoded into a code stream. In this way, the decoding end can parse out the network parameters, use the network parameters to configure the video super-resolution network, and obtain the expected quality enhancement effect.
- the video encoding processing method in this embodiment can determine whether to perform downsampling when preprocessing the video frame according to the bandwidth and other conditions, so that the encoding end can adaptively select an appropriate encoding processing method to adapt to changes in the network environment and encoding resources, etc. .
- An embodiment of the present disclosure also provides a video decoding processing method, as shown in FIG. 10 , including:
- Step 310 decoding the code stream to obtain the first video frame sequence
- Step 320 judging whether the first video frame sequence satisfies the set super-resolution condition, if yes, execute step 330, if not, execute step 340;
- Step 330 when the set super-resolution condition is met, output the first video frame sequence to the video super-resolution network for video super-resolution processing to obtain a second video frame sequence, the second video frame sequence a resolution greater than that of the first sequence of video frames;
- Step 340 if the set super-resolution condition is not met, skip the video super-resolution processing on the first video frame sequence.
- video super-resolution processing is skipped, or the video super-resolution processing is performed to obtain the second video frame sequence, subsequent post-decoding processing, or video display and playback may be performed.
- the video super-resolution network adopts the video super-resolution network described in any embodiment of the present disclosure.
- other video super-resolution networks may also be used to perform the video super-resolution processing of this embodiment.
- the video super-resolution network includes a generation network, and when the generation network is trained, the first video frame sequence as a sample is used as input data, and the real video frame sequence is used as target data, wherein, The resolution of the real video frame sequence is the same as that of the second video frame sequence, and the first video frame sequence as a sample is obtained by downsampling the real video frame sequence.
- the input of the generated network training is the first video frame sequence as a sample
- the input of the generated network after training can be the decoded first video frame sequence (or the first video frame from the data source sequence, etc.)
- the resolution of the first video frame sequence as a sample and the decoded first video frame sequence are the same, and the content may be different.
- the video super-resolution network trained according to this example is suitable for restoring a low-resolution video frame sequence that has been down-sampled and then compressed, encoded and decoded to a high-resolution video frame sequence.
- the decoding of the code stream further obtains a downsampling flag, and the downsampling flag is used to indicate whether the preprocessing of the first video frame sequence by the encoding end includes downsampling
- the set super-resolution condition at least includes: the downsampling flag indicates that the preprocessing of the first video frame sequence by the encoding end includes downsampling.
- the downsampling flag indicates that the preprocessing of the first video frame sequence by the encoder does not include downsampling, it may be determined to skip the video super-resolution processing of the first video frame sequence .
- the downsampling flag itself is used to indicate whether the preprocessing of the video frame by the encoding end includes downsampling, and the downsampling flag here can be used to indicate that the preprocessing of the first video frame sequence by the encoding end includes downsampling, which means that the downsampling
- the sampling flag is related to the first video frame sequence, for example, belongs to the same coding unit.
- the down-sampling flag can help the decoder determine whether the encoder has down-sampled during video preprocessing, so as to better judge whether to Carry out video super-resolution processing, simply based on the quality of the decoded video, when the video quality does not reach a fixed threshold, video super-resolution is performed, and when the video quality reaches the threshold, video super-resolution is not performed, regardless of The expected effect of video super-resolution is relatively mechanical and has limitations. If the encoding end performs down-sampling and the decoded video quality just reaches the threshold, video super-resolution can also be performed at this time to improve the video quality.
- the encoding end has not performed downsampling, due to other factors such as poor resolution of the camera itself, large noise on the transmission path, etc., the quality of the decoded video cannot reach the threshold, and video super-resolution may not be performed at this time.
- the set super-resolution conditions include one or any combination of the following conditions:
- the image quality of the first video frame sequence is lower than a set quality requirement
- the preprocessing of the first video frame sequence by the encoding end includes downsampling
- the video super-resolution function of the decoding end is available
- the super-resolution conditions listed above can be used in combination. For example, when the image quality of the first video frame sequence is lower than the set quality requirement, the encoding end performs down-sampling on the first video frame sequence, and the video super-resolution of the decoding end When the function is available, it is determined to perform super-resolution processing on the first video frame sequence. But the conditions here are not exhaustive, and there may be other conditions.
- the above quality requirements can be expressed by the set evaluation indicators such as PSNR Peak Signal to Noise Ratio (PSNR), Structural Similarity (SSIM: Structural Similarity), and Mean Square Error (MSE: Mean Square Error).
- the video super-resolution network is applied to the video processing process.
- the video is sampled up and down in space and time, which greatly reduces the amount of video data that needs to be encoded; after decoding, use the trained video super-resolution network to perform corresponding up-sampling to restore the original video.
- the code rate is significantly reduced, the coding efficiency is greatly improved, and the transmission code stream is reduced.
- An embodiment of the present disclosure also provides a video encoding and decoding system, as shown in FIG. 11 , including an encoding end device and a decoding end device.
- the encoding end device includes a data source 201 and a video encoding processing device 200.
- the data source 201 may be a video capture device (for example, a video camera), an archive containing previously captured data, a feed-in interface for receiving data from a content provider, and A computer graphics system that generates the data, or a combination of these sources.
- the video encoding processing device 200 can be realized by using any one of the following circuits or any combination of the following circuits: one or more microprocessors, digital signal processors, application specific integrated circuits, field programmable gate arrays, discrete logic, hardware .
- instructions for the software may be stored in a suitable non-transitory computer-readable storage medium and executed in hardware using one or more processors to thereby Implement the disclosed method.
- the video encoding processing apparatus 200 may implement the video encoding processing method described in any embodiment of the present disclosure based on the above circuit.
- the video encoding processing device 200 includes:
- the down-sampling device 205 is configured to down-sample the input video frame sequence, and output the down-sampled video frame sequence to a video encoder for encoding;
- the video encoder 207 is configured to perform video encoding on the sequence of video frames from a data source or the sequence of downsampled video frames.
- the downsampling judging device 203 determines whether to downsample the video frame sequence from the data source, including: when any of the following conditions is met, determining whether to downsample the video frame sequence from the data source A sequence of video frames is downsampled:
- the bandwidth available to stream the video stream is less than the bandwidth required to stream the video stream without downsampling:
- the resource at the encoding end does not support direct video encoding of the video frame sequence from the data source
- the video frame sequence from the data source belongs to the specified video frame sequence that needs to be down-sampled.
- the downsampling device 205 performs downsampling on the video frame sequence from the data source, including: performing an image resolution and/or video frame rate on the video frame sequence from the data source Downsample.
- the downsampling judging device 203 is further configured to generate a downsampling flag and output it to the video encoder 207, the downsampling flag is used to indicate Whether the preprocessing of the video frame sequence includes downsampling; the video encoder 207 is also configured to write the downsampling flag into the code stream when performing video encoding.
- the downsampling flag here can be used to indicate that the preprocessing of the video frame sequence from the data source by the encoder includes downsampling, indicating that the downsampling flag here is related to the video frame sequence from the data source, if they belong to the same coding unit.
- the decoding end device includes a video decoding processing device 300 and a display 307 , and the display 307 may be a liquid crystal display, a plasma display, an organic light emitting diode display or other types of display devices.
- the video decoding processing device 300 can be realized by using any one of the following circuits or any combination of the following circuits: one or more microprocessors, digital signal processors, application specific integrated circuits, field programmable gate arrays, discrete logic, hardware . If the present disclosure is implemented partially in software, instructions for the software may be stored in a suitable non-transitory computer-readable storage medium and executed in hardware using one or more processors to implement The disclosed method.
- the video decoding processing apparatus 300 may implement the video decoding processing method described in any embodiment of the present disclosure based on the above circuit.
- the video decoding processing device 300 further includes:
- the video decoder 301 is configured to decode the code stream to obtain the first video frame sequence
- the super-resolution judging device 303 is configured to judge whether the first video frame sequence satisfies the set super-resolution condition, and if the set super-resolution condition is met, output the first video frame sequence to the video super-resolution
- the network performs video super-resolution processing; if the set super-resolution condition is not satisfied, it is determined to skip the video super-resolution processing of the first video frame sequence;
- the video super-resolution network 305 is configured to perform video super-resolution processing on the first video frame sequence to obtain a second video frame sequence whose resolution is greater than that of the first video frame sequence;
- the video super-resolution network adopts the video super-resolution network described in any embodiment of the present disclosure.
- the first video frame sequence as a sample is used as input data
- the real video frame sequence is used as target data
- the first video frame sequence as a sample is A video frame sequence is obtained by down-sampling the real video frame sequence.
- the video super-resolution network trained in this way is suitable for restoring the low-resolution video frame sequence after downsampling, compression encoding and decoding to a high-resolution video frame sequence, and has a good quality enhancement effect.
- the video decoder decodes the code stream, and further extracts a down-sampling flag from the code stream, and the down-sampling flag is used to indicate the encoding end's Whether the preprocessing includes downsampling;
- the superresolution condition used by the superresolution decision device at least includes: the downsampling flag indicates that the preprocessing of the first video frame sequence by the encoding end includes downsampling; in an example
- the super-resolution judging device may determine not to perform super-resolution processing on the first video frame sequence when the downsampling flag indicates that the preprocessing of the first video frame sequence by the encoding end does not include downsampling.
- the set super-resolution condition used by the super-resolution decision device includes one or any combination of the following conditions:
- the image quality of the first video frame sequence is lower than a set quality requirement
- the preprocessing of the first video frame sequence by the encoding end includes downsampling
- the video super-resolution function of the decoding end is available
- the super-resolution judging device may determine to skip video super-resolution processing on the first video frame sequence when the first video frame sequence does not meet the set super-resolution condition.
- the encoding end judges whether it is necessary to downsample the video frame sequence according to factors such as the currently detected bandwidth environment, and if necessary (for example, when the bandwidth is insufficient), then Select the corresponding down-sampling multiple, down-sample the spatial resolution and/or time resolution of the video frame sequence, and then encode it into a code stream for transmission; and use the corresponding decoder to decode at the decoding end, the quality of the decoded video frame is not high. High, it can be sent to the video super-resolution network for quality improvement, and a video with the required spatial and temporal resolution can be obtained.
- the encoding end can directly encode the video frame sequence from the data source into a code stream for transmission, and the decoding end can directly decode to obtain high-quality video. At this time, video super-resolution is not performed. Regardless of whether the encoding end performs down-sampling, the same video encoder can be used for encoding.
- the encoding operation is relatively simple and the resource occupation is small.
- An embodiment of the present disclosure also provides a video encoding processing device, as shown in FIG. 5.
- a video encoding processing device as shown in FIG. 5.
- An embodiment of the present disclosure also provides a video decoding processing device, as shown in FIG. 12 , including a processor and a memory storing a computer program that can run on the processor, wherein the processor executes the computer
- the program implements the video decoding processing method described in any embodiment of the present disclosure.
- An embodiment of the present disclosure further provides a video encoding and decoding system, including the video encoding processing device described in any embodiment of the present disclosure and the video decoding processing device described in any embodiment of the present disclosure.
- An embodiment of the present disclosure further provides a code stream, wherein the code stream is generated according to the video encoding processing method described in the embodiment of the present disclosure, and the code stream includes a downsampling flag.
- An embodiment of the present disclosure also provides a non-transitory computer-readable storage medium, the computer-readable storage medium stores a computer program, wherein, when the computer program is executed by a processor, any implementation of the present disclosure can be realized.
- the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over, as one or more instructions or code, a computer-readable medium and executed by a hardware-based processing unit.
- Computer-readable media may include computer-readable storage media that correspond to tangible media such as data storage media, or communication media including any medium that facilitates transfer of a computer program from one place to another, eg, according to a communication protocol. In this manner, a computer-readable medium may generally correspond to a non-transitory tangible computer-readable storage medium or a communication medium such as a signal or carrier wave.
- Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for implementation of the techniques described in this disclosure.
- a computer program product may comprise a computer readable medium.
- such computer-readable storage media may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk or other magnetic storage, flash memory, or may be used to store instructions or data Any other medium that stores desired program code in the form of a structure and that can be accessed by a computer.
- any connection could also be termed a computer-readable medium. For example, if a connection is made from a website, server or other remote source for transmitting instructions, coaxial cable, fiber optic cable, dual wire, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium.
- the technical solutions of the embodiments of the present disclosure may be implemented in a wide variety of devices or devices, including a wireless handset, an integrated circuit (IC), or a set of ICs (eg, a chipset).
- IC integrated circuit
- Various components, modules, or units are described in the disclosed embodiments to emphasize functional aspects of devices configured to perform the described techniques, but do not necessarily require realization by different hardware units. Rather, as described above, the various units may be combined in a codec hardware unit or provided by a collection of interoperable hardware units (comprising one or more processors as described above) in combination with suitable software and/or firmware.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
Abstract
Description
Conv_1 | Conv_2 | Conv_3 | Conv_4 |
K3 s1 n64 | K3 s2 n64 | K3 s1 n128 | K3 s2 n128 |
Conv_5 | Conv_6 | Conv_7 | Conv_8 |
K3 s1 n256 | K3 s2 n256 | K3 s1 n512 | K3 s1 n512 |
Conv1 | Conv2 | Conv3 | Conv3-1 | Conv4 |
K7 s2 n64 | K5 s2 n128 | K3 s3 n256 | K3 s1 n256 | K3 s2 n512 |
Conv4-1 | Conv5 | Conv5-1 | Conv6 | |
K3 s1 n512 | K3 s2 n512 | K3 s1 n512 | K3 s2 n1024 | |
DeConv5 | DeConv4 | DeConv3 | DeConv2 | |
K4 s2 n512 p1 | K4 s2 n256 p1 | K4 s2 n128 p1 | K4 s2 n64 p1 |
Claims (36)
- 一种视频超分辨网络,包括生成网络,其中,所述生成网络包括依次连接的第一特征提取部分、第二特征提取部分和重建部分,其中:所述第一特征提取部分设置为接收第一视频帧序列,基于3D卷积从所述第一视频帧序列中提取第一特征并输出;所述第二特征提取部分,设置为接收所述第一特征,基于3D残差注意力机制从所述第一特征中提取时间和/或空间上的第二特征并输出;所述重建部分设置为接收所述第二特征,基于3D卷积和3D上采样实现特征融合和特征的时空超分辨,及基于3D卷积重建视频帧序列,生成第二视频帧序列,所述第二视频帧序列的分辨率大于所述第一视频帧序列的分辨率。
- 根据权利要求1所述的视频超分辨网络,其中:所述第一特征提取部分包括依次连接的3D卷积层和激活层,所述3D卷积层的输入为所述第一视频帧序列,所述激活层的输出为所述第一特征。
- 根据权利要求1所述的视频超分辨网络,其中:所述第二特征提取部分包括依次连接的多个残差注意力块RAB,第一个RAB的输入为所述第一特征,除第一个RAB之外的其他RAB的输入为前一RAB的输出,最后一个RAB的输出为所述第二特征;所述RAB包括依次连接的3D卷积层、激活层和3D注意力机制模型单元,所述RAB的输入送入所述3D卷积层,还跳跃连接与3D注意力机制模型单元的输出相加,得到的和作为所述RAB的输出。
- 根据权利要求3所述的视频超分辨网络,其中:所述3D注意力机制模型单元为3D卷积块注意力模型,所述3D卷积块注意力模型包括依次连接的3D通道注意力模块和3D空间注意力模块,所述3D注意力机制模型单元的输入送入所述3D通道注意力模块,所述3D通道注意力模块的输入与输出相乘得到的第一乘积作为所述3D空间注意力模块的输入,所述3D空间注意力模块的输出与所述第一乘积相乘得到的第二乘积,作为所述3D注意力机制模型单元的输出。
- 根据权利要求1所述的视频超分辨网络,其中:所述重建部分包括依次连接的以下单元:用于融合特征的3D卷积单元,包括依次连接的3D卷积层和激活层,所述用于融合特征的3D卷积单元的输入为所述第二特征;用于实现特征的时空超分辨的3D转置卷积单元,包括依次连接的3D转置卷积层和激活层,所述3D转置卷积单元的输入为所述用于融合特征的3D卷积单元的输出;及用于生成视频帧序列的3D卷积层,输入为所述3D转置卷积单元的输出,输出为所述第二视频帧序列;其中,所述第二视频帧序列的图像分辨率大于所述第一视频帧序列的图像分辨率,和/或,所述第二视频帧序列的视频第一视频帧序列的视频帧率。
- 根据权利要求2至5中任一项的视频超分辨网络,其中:所述激活层使用的激活函数为带参数修正线性单元PReLu。
- 根据权利要求1所述的视频超分辨网络,其中:所述视频超分辨网络还包括:判别网络,设置为在训练时,以真实视频帧序列和所述生成网络训练时生成的所述第二视频帧序列为输入,从输入的视频帧序列中提取细节特征以及视频帧之间的运动信息特征,基于所述细节特征和运动信息特征确定输入的所述视频帧序列为真实视频帧序列的概率,其中,所述真实视频帧序列的分辨率与所述第二视频帧序的分辨率相同,所述生成网络训练时接收的所述第一视频帧序列通过对所述真实视频帧序列进行下采样而得到。
- 根据权利要求7所述的视频超分辨网络,其中:所述判别网络包括第一分支、第二分支、与所述第一分支和第二分支连接的信息融合单元,以及与所述信息融合单元连接的权重计算单元,其中:所述第一分支设置为基于特征提取网络从输入的视频帧序列中提取细节特征,基于所述细节特征进行 真伪判断;所述第二分支设置为基于光流网络从输入的视频帧序列中提取视频帧之间的运动信息特征,基于所述运动信息特征进行真伪判断;所述信息融合单元设置为对所述第一分支和第二分支输出的真伪判断的结果进行融合;所述权重计算单元设置为根据所述信息融合单元输出的融合后的信息进行权重计算,得到输入的视频帧序列为真实视频帧序列的概率。
- 根据权利要求8所述的视频超分辨网络,其中:所述信息融合单元采用全连接层实现;所述权重计算单元采用S形函数实现。
- 根据权利要求8所述的视频超分辨网络,其中:所述第一分支包括依次连接的以下单元:2D卷积单元,包括依次连接的2D卷积层和激活层;多个2D卷积加归一化单元,所述2D卷积加归一化单元包括依次连接的2D卷积层、批量归一化BN层和激活层;全连接单元,包括依次连接的全连接层和激活层。
- 根据权利要求8所述的视频超分辨网络,其中:所述第二分支包括依次连接的以下单元:N个2D卷积加归一化单元,包括依次连接的2D卷积层、BN层和激活层,N≥2;M个2D反卷积单元,包括2D反卷积层和激活层,M≥2;全连接单元,包括依次连接的全连接层和激活层。
- 根据权利要求11所述的视频超分辨网络,其中:第2i个2D卷积加归一化单元的输出还连接到第M-i+1个2D反卷积单元的输入,1≤i≤M,N=2M+1。
- 根据权利要求10或11所述的视频超分辨网络,其中:所述激活层使用的激活函数为带泄露修正线性单元LeakReLu。
- 一种视频超分辨处理方法,包括:基于3D卷积从所述第一视频帧序列中提取第一特征;基于3D残差注意力机制从所述第一特征中提取时间和/或空间上的第二特征;基于3D卷积和3D上采样实现所述第二特征的特征融合和特征的时空超分辨;及,基于3D卷积重建视频帧序列,生成第二视频帧序列,所述第二视频帧序列的分辨率大于所述第一视频帧序列的分辨率。
- 根据权利要求14所述的视频超分辨处理方法,其中:所述视频超分辨处理方法基于如权利要求1至13中任一项所述的视频超分辨网络实现。
- 根据权利要求14所述的视频超分辨处理方法,其中:所述第一视频帧序列是对码流解码输出的已解码视频帧序列;或者所述第一视频帧序列是视频采集设备采集到的原始视频帧序列;或者所述第一视频帧序列是从视频编码器的已解码图片缓冲器中获取的需要进行上采样的参考图像;或者所述第一视频帧序列是包括基本层和增强层的可分级视频编码架构中产生的基本层的重建视频帧序列或增强子层的重建视频帧序列;或者所述第一视频帧序列是包括基本层和增强层的可分级视频解码架构中产生的基本层的已解码视频帧序列或增强子层的组合中间图像。
- 根据权利要求14或15所述的视频超分辨处理方法,其中:所述第一视频帧序列是对码流解码输出的已解码视频帧序列;所述视频超分辨处理方法还包括:从码流中解析出编码端发送的视频超分辨网络的网络参数信息,及, 根据所述网络参数信息设置所述视频超分辨网络的网络参数。
- 一种视频解码方法,包括:对码流进行解码,得到第一视频帧序列;判断所述第一视频帧序列是否满足设定的超分辨条件;在满足设定的超分辨条件的情况下,将所述第一视频帧序列输出到视频超分辨网络进行视频超分辨处理,得到第二视频帧序列,所述第二视频帧序列的分辨率大于所述第一视频帧序列的分辨率。
- 根据权利要求18所述的视频解码处理方法,其中:所述视频超分辨网络采用如权利要求1至13中任一项所述的视频超分辨网络。
- 根据权利要求18所述的视频解码处理方法,其中:所述视频超分辨网络包括生成网络,所述生成网络训练时,以作为样本的第一视频帧序列为输入数据,以真实视频帧序列为目标数据,其中,所述真实视频帧序列的分辨率和所述第二视频帧序列的分辨率相同,作为样本的所述第一视频帧序列通过对所述真实视频帧序列进行下采样而得到的。
- 根据权利要求18或19或20所述的视频解码处理方法,其中:所述对码流进行解码,还得到一下采样标志,所述下采样标志用于指示编码端对所述第一视频帧序列的预处理是否包括下采样;所述设定的超分辨条件至少包括:所述下采样标志指示编码端对所述第一视频帧序列的预处理包括下采样。
- 根据权利要求18所述的视频解码处理方法,其中:所述设定的超分辨条件包括以下条件中的一种或任意组合:所述第一视频帧序列的图像质量低于设定的质量要求;编码端对所述第一视频帧序列的预处理包括下采样;及,解码端的视频超分辨功能处于可用状态;在所述第一视频帧序列不满足设定的超分辨条件的情况下,跳过对所述第一视频帧序列的超分辨处理。
- 一种视频编码处理方法,包括:进行视频预处理时确定是否对来自数据源的视频帧序列进行下采样;在确定不进行下采样的情况下,将来自数据源的所述视频帧序列直接输入视频编码器进行视频编码;在确定进行下采样的情况下,对来自数据源的所述视频帧序列进行下采样,将下采样后的视频帧序列输入视频编码器进行视频编码。
- 根据权利要求23所述的视频编码处理方法,其中:所述视频编码处理方法还包括:进行视频编码时,将一下采样标志写入码流,所述下采样标志用于指示编码端对来自数据源的所述视频帧序列的预处理是否包括下采样。
- 根据权利要求23所述的视频编码处理方法,其中:所述确定是否对来自数据源的视频帧序列进行下采样,包括:在满足以下条件中的任一种时,确定对来自数据源的视频帧序列进行下采样:可用于传输视频码流的带宽小于不进行下采样时传输视频码流所需的带宽:编码端的资源不支持对来自数据源的视频帧序列直接进行视频编码;所述来自数据源的视频帧序列属于指定的需要进行下采样的视频帧序列。
- 根据权利要求23或24或25所述的视频编码处理方法,其中:所述视频编码处理方法还包括:进行视频编码时,获取来自数据源的所述视频帧序列对应的视频超分辨网络的网络参数,将所述网络参数写入码流。
- 一种视频超分辨处理装置,包括处理器以及存储有可在所述处理器上运行的计算机程序的存储器,其中,所述处理器执行所述计算机程序时实现如权利要求14至17中任一所述的视频超分辨处理方法。
- 一种视频解码处理装置,包括处理器以及存储有可在所述处理器上运行的计算机程序的存储器,其中,所述处理器执行所述计算机程序时实现如权利要求18至22中任一所述的视频解码处理方法。
- 一种视频解码处理装置,包括:视频解码器,设置为对码流进行解码,得到第一视频帧序列;超分辨判决装置,设置为判断所述第一视频帧序列是否满足设定的超分辨条件,在满足设定的超分辨条件的情况下,将所述第一视频帧序列输出到视频超分辨网络进行视频超分辨处理;在不满足设定的超分辨条件的情况下,确定跳过对所述第一视频帧序列的视频超分辨处理;视频超分辨网络,设置为对所述第一视频帧序列进行视频超分辨处理,得到分辨率大于所述第一视频帧序列的第二视频帧序列。
- 根据权利要求29所述的视频解码处理装置,其中:所述视频解码器对码流进行解码时,还从码流中提取一下采样标志,所述下采样标志用于指示编码端对所述第一视频帧序列的预处理是否包括下采样;所述超分辨判决装置使用的所述超分辨条件至少包括:所述下采样标志指示编码端对所述第一视频帧序列的预处理包括下采样。
- 一种视频编码处理装置,包括处理器以及存储有可在所述处理器上运行的计算机程序的存储器,其中,所述处理器执行所述计算机程序时实现如权利要求23至26中任一所述的视频编码处理方法。
- 一种视频编码处理装置,其中,包括:下采样判决模块,设置为进行视频预处理时确定是否对来自数据源的视频帧序列进行下采样,在确定进行下采样的情况下,将来自数据源的所述视频帧序列输出到下采样装置,在确定不进行下采样的情况下,将来自数据源的所述视频帧序列直接输出到视频编码器进行编码;下采样装置,设置为对输入的视频帧序列进行下采样,将下采样后的视频帧序列输出到视频编码器进行编码;视频编码器,设置为对来自数据源的所述视频帧序列或者下采样后的所述视频帧序列进行视频编码。
- 根据权利要求32所述的视频编码处理装置,其中:所述下采样判决装置还设置为生成下采样标志并输出到所述视频编码器,所述下采样标志用于指示编码端对来自数据源的所述视频帧序列的预处理是否包括下采样;所述视频编码器还设置为在进行视频编码时,将所述下采样标志写入码流。
- 一种视频编解码系统,包括如权利要求31至33中任一所述的视频编码处理装置和如权利要求28至30中任一所述的视频解码处理装置。
- 一种码流,其中,所述码流包括根据如权利要求24所述的视频编码处理方法生成,所述码流中包含所述下采样标志;或者,所述码流包括根据如权利要求26所述的视频编码处理方法生成,所述码流中包含所述网络参数。
- 一种非瞬态计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其中,所述计算机程序时被处理器执行时实现如权利要求14至26中任一所述的方法。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202180100597.5A CN117730338A (zh) | 2021-07-20 | 2021-07-20 | 视频超分辨网络及视频超分辨、编解码处理方法、装置 |
PCT/CN2021/107449 WO2023000179A1 (zh) | 2021-07-20 | 2021-07-20 | 视频超分辨网络及视频超分辨、编解码处理方法、装置 |
EP21950444.6A EP4365820A1 (en) | 2021-07-20 | 2021-07-20 | Video super-resolution network, and video super-resolution, encoding and decoding processing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2021/107449 WO2023000179A1 (zh) | 2021-07-20 | 2021-07-20 | 视频超分辨网络及视频超分辨、编解码处理方法、装置 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023000179A1 true WO2023000179A1 (zh) | 2023-01-26 |
Family
ID=84979815
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/107449 WO2023000179A1 (zh) | 2021-07-20 | 2021-07-20 | 视频超分辨网络及视频超分辨、编解码处理方法、装置 |
Country Status (3)
Country | Link |
---|---|
EP (1) | EP4365820A1 (zh) |
CN (1) | CN117730338A (zh) |
WO (1) | WO2023000179A1 (zh) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116091317A (zh) * | 2023-02-02 | 2023-05-09 | 苏州大学 | 扫描电镜二次电子图像超分辨方法和系统 |
US20230254592A1 (en) * | 2022-02-07 | 2023-08-10 | Robert Bosch Gmbh | System and method for reducing transmission bandwidth in edge cloud systems |
CN116634209A (zh) * | 2023-07-24 | 2023-08-22 | 武汉能钠智能装备技术股份有限公司 | 一种基于热插拔的断点视频恢复系统及方法 |
CN117041669A (zh) * | 2023-09-27 | 2023-11-10 | 湖南快乐阳光互动娱乐传媒有限公司 | 视频流的超分控制方法、装置及电子设备 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108769682A (zh) * | 2018-06-20 | 2018-11-06 | 腾讯科技(深圳)有限公司 | 视频编码、解码方法、装置、计算机设备和存储介质 |
CN112543347A (zh) * | 2019-09-23 | 2021-03-23 | 腾讯美国有限责任公司 | 基于机器视觉编解码的视频超分辨率方法和系统 |
CN112801877A (zh) * | 2021-02-08 | 2021-05-14 | 南京邮电大学 | 一种视频帧的超分辨率重构方法 |
CN112950471A (zh) * | 2021-02-26 | 2021-06-11 | 杭州朗和科技有限公司 | 视频超分处理方法、装置、超分辨率重建模型、介质 |
CN113052764A (zh) * | 2021-04-19 | 2021-06-29 | 东南大学 | 一种基于残差连接的视频序列超分重建方法 |
-
2021
- 2021-07-20 CN CN202180100597.5A patent/CN117730338A/zh active Pending
- 2021-07-20 EP EP21950444.6A patent/EP4365820A1/en not_active Withdrawn
- 2021-07-20 WO PCT/CN2021/107449 patent/WO2023000179A1/zh active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108769682A (zh) * | 2018-06-20 | 2018-11-06 | 腾讯科技(深圳)有限公司 | 视频编码、解码方法、装置、计算机设备和存储介质 |
CN112543347A (zh) * | 2019-09-23 | 2021-03-23 | 腾讯美国有限责任公司 | 基于机器视觉编解码的视频超分辨率方法和系统 |
CN112801877A (zh) * | 2021-02-08 | 2021-05-14 | 南京邮电大学 | 一种视频帧的超分辨率重构方法 |
CN112950471A (zh) * | 2021-02-26 | 2021-06-11 | 杭州朗和科技有限公司 | 视频超分处理方法、装置、超分辨率重建模型、介质 |
CN113052764A (zh) * | 2021-04-19 | 2021-06-29 | 东南大学 | 一种基于残差连接的视频序列超分重建方法 |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20230254592A1 (en) * | 2022-02-07 | 2023-08-10 | Robert Bosch Gmbh | System and method for reducing transmission bandwidth in edge cloud systems |
CN116091317A (zh) * | 2023-02-02 | 2023-05-09 | 苏州大学 | 扫描电镜二次电子图像超分辨方法和系统 |
CN116634209A (zh) * | 2023-07-24 | 2023-08-22 | 武汉能钠智能装备技术股份有限公司 | 一种基于热插拔的断点视频恢复系统及方法 |
CN116634209B (zh) * | 2023-07-24 | 2023-11-17 | 武汉能钠智能装备技术股份有限公司 | 一种基于热插拔的断点视频恢复系统及方法 |
CN117041669A (zh) * | 2023-09-27 | 2023-11-10 | 湖南快乐阳光互动娱乐传媒有限公司 | 视频流的超分控制方法、装置及电子设备 |
CN117041669B (zh) * | 2023-09-27 | 2023-12-08 | 湖南快乐阳光互动娱乐传媒有限公司 | 视频流的超分控制方法、装置及电子设备 |
Also Published As
Publication number | Publication date |
---|---|
EP4365820A1 (en) | 2024-05-08 |
CN117730338A (zh) | 2024-03-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2023000179A1 (zh) | 视频超分辨网络及视频超分辨、编解码处理方法、装置 | |
US10701394B1 (en) | Real-time video super-resolution with spatio-temporal networks and motion compensation | |
JP6245888B2 (ja) | エンコーダおよび符号化方法 | |
KR20200114436A (ko) | 스케일러블 영상 부호화를 수행하는 장치 및 방법 | |
KR20160021417A (ko) | 공간적으로 확장 가능한 비디오 코딩을 위한 적응적 보간 | |
CN108737823B (zh) | 基于超分辨技术的图像编码方法和装置、解码方法和装置 | |
WO2022068682A1 (zh) | 图像处理方法及装置 | |
JP6042899B2 (ja) | 映像符号化方法および装置、映像復号方法および装置、それらのプログラム及び記録媒体 | |
CN115606179A (zh) | 用于使用学习的下采样特征进行图像和视频编码的基于学习的下采样的cnn滤波器 | |
TWI805085B (zh) | 基於機器學習的圖像解碼中色度子採樣格式的處理方法 | |
TWI672941B (zh) | 影像處理方法、設備及系統 | |
CN112218072A (zh) | 一种基于解构压缩和融合的视频编码方法 | |
CN115552905A (zh) | 用于图像和视频编码的基于全局跳过连接的cnn滤波器 | |
WO2022011571A1 (zh) | 视频处理方法、装置、设备、解码器、系统及存储介质 | |
WO2023279961A1 (zh) | 视频图像的编解码方法及装置 | |
CN116582685A (zh) | 一种基于ai的分级残差编码方法、装置、设备和存储介质 | |
TW202239209A (zh) | 用於經學習視頻壓縮的多尺度光流 | |
Guleryuz et al. | Sandwiched Image Compression: Increasing the resolution and dynamic range of standard codecs | |
CN112601095A (zh) | 一种视频亮度和色度分数插值模型的创建方法及系统 | |
CN113747242B (zh) | 图像处理方法、装置、电子设备及存储介质 | |
JP2024511587A (ja) | ニューラルネットワークベースのピクチャ処理における補助情報の独立した配置 | |
JP2024513693A (ja) | ピクチャデータ処理ニューラルネットワークに入力される補助情報の構成可能な位置 | |
Hu et al. | Efficient image compression method using image super-resolution residual learning network | |
WO2023279968A1 (zh) | 视频图像的编解码方法及装置 | |
WO2022246809A1 (zh) | 编解码方法、码流、编码器、解码器以及存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21950444 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 202180100597.5 Country of ref document: CN |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2021950444 Country of ref document: EP |
|
ENP | Entry into the national phase |
Ref document number: 2021950444 Country of ref document: EP Effective date: 20240202 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |