CN117099372A

CN117099372A - Video encoding method and apparatus using deep learning based loop filter for inter prediction

Info

Publication number: CN117099372A
Application number: CN202280022083.7A
Authority: CN
Inventors: 姜制远; 金娜莹; 李订炅; 朴胜煜
Original assignee: Hyundai Motor Co; Industry Collaboration Foundation of Ewha University; Kia Corp
Current assignee: Hyundai Motor Co; Industry Collaboration Foundation of Ewha University; Kia Corp
Priority date: 2021-03-31
Filing date: 2022-03-24
Publication date: 2023-11-21

Abstract

Video encoding methods and apparatus for using a deep learning based loop filter for inter prediction are disclosed. Video encoding methods and apparatuses are provided that utilize a deep learning-based loop filter for inter-frame prediction of P-frames and B-frames in order to mitigate various levels of image distortion according to QP (quantization parameter) values present in the P-frames and the B-frames.

Description

Video encoding method and apparatus using deep learning based loop filter for inter prediction

Technical Field

The present disclosure relates to video encoding methods and apparatus using a deep learning based loop filter for inter prediction.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

Since video data has a large amount of data compared to audio or still image data, the video data requires a large amount of hardware resources (including memory) to store or transmit the video data without a process of compression.

Thus, encoders are commonly used to compress and store or transmit video data. The decoder receives the compressed video data, decompresses the received compressed video data, and plays the decompressed video data. Video compression techniques include h.264/AVC, high Efficiency Video Coding (HEVC), and Versatile Video Coding (VVC), which have improved coding efficiency of about 30% or greater compared to HEVC.

However, since the image size, resolution, and frame rate gradually increase, the amount of data to be encoded also increases. Thus, new compression techniques that provide higher coding efficiency and improved image enhancement than existing compression techniques are needed.

Recently, image processing techniques based on deep learning have been applied to existing coding basic techniques. Coding efficiency may be improved by applying deep learning based image processing techniques to existing coding techniques, in particular, for example, compression techniques such as inter-prediction, intra-prediction, loop filters, transforms, etc. Representative examples of applications include inter-prediction based on virtual reference frames generated by a deep learning model, and loop filters based on a denoising model.

In particular, since predicted frames (P frames) and bi-directionally predicted frames (B frames) cause different levels of image distortion according to different Quantization Parameter (QP) values between frames even within a single video sequence, a loop filter adapted to this case is required. Accordingly, it is desirable to provide video encoding/decoding with a deep learning-based loop filter applied to inter prediction to improve encoding efficiency.

[ Prior Art ]

[ non-patent literature ]

(non-patent document 1) Learning Deformable Kernels for Image and Video Denoising, learning Deformable Kernels for Image and Video Denoising, arxiv (190) 4.06903.

(non-patent document 2) Ren Yang, mai Xu, zulin Wang, tianyi Li, multi-Frame Quality Enhancement for Compressed Video, CVPR 2018, arxiv: (180) 3.04680.

(non-patent document 3) Fuzhi Yang, huan Yang, jianlong Fu, hongtao Lu, baing Guo, learning Texture Transformer Network for Image Super-Resolution, CVPR 2020, arxiv:2006.0439.

Disclosure of Invention

[ problem ]

The present disclosure is directed to a video encoding method and apparatus using a deep learning-based loop filter for inter-frame prediction of P-frames and B-frames in order to mitigate various levels of image distortion according to Quantization Parameter (QP) values present in the P-frames and the B-frames.

[ technical solution ]

At least one aspect of the present disclosure provides an apparatus for video quality enhancement. The apparatus comprises an input unit configured to obtain a reconstructed current frame and decoded quantization parameters. The apparatus further comprises: and a quantization parameter preprocessor configured to calculate an embedding vector from the quantization parameter by using an embedding function based on deep learning. Alternatively, the quantization parameter preprocessor is configured to estimate image distortion due to quantization parameters by using a deep learning based estimation model. The apparatus also includes a denoising configured to generate an enhanced frame by removing quantization noise from the current frame using a deep learning based denoising model. The denoising model generates an enhancement frame using the calculated embedded vector or the estimated image distortion.

Another aspect of the present disclosure provides a method performed by a computing device for enhancing image quality of a current frame. The method includes obtaining a reconstructed current frame and decoded quantization parameters. The method further includes calculating an embedding vector from the quantization parameter by using a deep learning based embedding function. Alternatively, the method further comprises estimating image distortion due to quantization parameters by using a deep learning based estimation model. The method also includes generating an enhanced frame by removing quantization noise from the current frame using a deep learning based denoising model. Generating the enhancement frame includes causing the denoising model to utilize the calculated embedded vector or the estimated image distortion.

[ beneficial effects ]

As described above, the present disclosure provides a video encoding method and apparatus using a deep learning based loop filter for inter-prediction of P-frames and B-frames. Therefore, by reducing different levels of image distortion according to QP values present in P and B frames, coding efficiency is improved.

Drawings

Fig. 1 is a block diagram illustrating a video encoding device that may implement the techniques of this disclosure.

Fig. 2 illustrates a method for partitioning blocks using a quadtree plus binary tree trigeminal tree (QTBTTT) structure.

Fig. 3a and 3b illustrate a plurality of intra prediction modes including a wide-angle intra prediction mode.

Fig. 4 is a block diagram illustrating the neighboring blocks of the current block.

Fig. 5 is a block diagram illustrating a video decoding device that may implement the techniques of this disclosure.

Fig. 6 is a diagram illustrating a hierarchical coding structure according to a Random Access (RA) mode.

Fig. 7 is a diagram illustrating a video quality enhancement device in accordance with at least one embodiment of the present disclosure.

Fig. 8a and 8b are diagrams illustrating a single video compression artifact removal network (S-VCARN) using a single network.

Fig. 9a and 9b are diagrams illustrating an S-VCARN utilizing an embedding function in accordance with at least one embodiment of the present disclosure.

Fig. 10a and 10b are diagrams illustrating S-VCARN using quantized noise estimation according to another embodiment of the present disclosure.

Fig. 11 is a diagram illustrating an S-VCARN using a mask map according to another embodiment of the present disclosure.

Fig. 12a and 12b are diagrams illustrating offsets in a reference frame in accordance with at least one embodiment of the present disclosure.

Fig. 13a and 13b are diagrams illustrating a reference video compression artifact removal network (R-VCARN) according to at least one embodiment of the present disclosure.

FIG. 14 is a diagram illustrating an R-VCARN utilizing an embedded function according to another embodiment of the present disclosure.

FIG. 15 is a schematic diagram showing a combined VCARN after combining S-VCARN and R-VCARN.

Fig. 16 is a flowchart illustrating a video quality enhancement method using S-VCARN in accordance with at least one embodiment of the present disclosure.

Fig. 17 is a flowchart illustrating a video quality enhancement method using R-VCARN in accordance with at least one embodiment of the present disclosure.

Detailed Description

Hereinafter, some embodiments of the present disclosure are described in detail with reference to the accompanying drawings. In the following description, like reference numerals denote like elements although the elements are shown in different drawings. Furthermore, in the following description of some embodiments, a detailed description of related known components and functions that are believed to obscure the subject matter of the present disclosure has been omitted for the sake of clarity and conciseness.

Fig. 1 is a block diagram of a video encoding device in which the techniques of this disclosure may be implemented. Hereinafter, a video encoding apparatus and components of the apparatus are described with reference to the diagram of fig. 1.

The encoding apparatus may include a picture divider (110), a predictor (120), a subtractor (130), a transformer (140), a quantizer (145), a rearrangement unit (150), an entropy encoder (155), a dequantizer (160), an inverse transformer (165), an adder (170), a loop filter unit (180), and a memory (190).

Each component of the encoding apparatus may be implemented as hardware or software or as a combination of hardware and software. Further, the function of each component may be implemented as software, and a microprocessor may also be implemented to execute the function of the software corresponding to each component.

A video is made up of one or more sequences comprising a plurality of pictures. Each picture is divided into a plurality of regions, and encoding is performed on each region. For example, a picture is segmented into one or more tiles or/and slices. Herein, one or more tiles may be defined as a set of tiles. Each tile or/and slice is partitioned into one or more Coding Tree Units (CTUs). In addition, each CTU is partitioned into one or more Coding Units (CUs) by a tree structure. Information applied to each CU is encoded as a syntax of the CU, and information commonly applied to CUs included in one CTU is encoded as a syntax of the CTU. Further, information commonly applied to all blocks in one slice is encoded as a syntax of a slice header, and information applied to all blocks constituting one or more pictures is encoded as a Picture Parameter Set (PPS) or a picture header. Furthermore, information commonly referred to by a plurality of pictures is encoded to a Sequence Parameter Set (SPS). In addition, information commonly referenced by one or more SPS's is encoded to a Video Parameter Set (VPS). Further, information commonly applied to one tile or group of tiles may also be encoded as a syntax of the tile or group of tiles header. The syntax included in the SPS, PPS, slice header, tile, or tile group header may be referred to as a high level syntax.

The picture divider (110) determines the size of a Coding Tree Unit (CTU). Information about the size of the CTU (CTU size) is encoded as a syntax of the SPS or PPS and transmitted to the video decoding apparatus.

A picture divider (110) divides each picture constituting a video into a plurality of Coding Tree Units (CTUs) having a predetermined size, and then recursively divides the CTUs by using a tree structure. Leaf nodes in the tree structure become Coding Units (CUs), which are the basic units of coding.

The tree structure may be a Quadtree (QT) in which a higher node (or parent node) is divided into four lower nodes (or child nodes) having the same size. The tree structure may also be a Binary Tree (BT) in which a higher node is split into two lower nodes. The tree structure may also be a Trigeminal Tree (TT), wherein the higher nodes are represented by 1:2: the ratio of 1 is split into three lower nodes. The tree structure may also be a structure in which two or more of the QT structure, the BT structure, and the TT structure are mixed. For example, a quadtree plus binary tree (QTBT) structure may be used or a quadtree plus binary tree trigeminal tree (QTBTTT) structure may be used. Here, BTTT is added to the tree structure to be referred to as a multi-type tree (MTT).

Fig. 2 is a diagram for describing a method of dividing a block by using the QTBTTT structure.

As shown in fig. 2, CTUs may be first divided into QT structures. Quadtree partitioning may be recursive until the size of the partitioned block reaches a minimum block size (MinQTSize) of leaf nodes allowed in QT. A first flag (qt_split_flag) indicating whether each node of the QT structure is split into four nodes of a lower layer is encoded by an entropy encoder (155) and signaled to a video decoding apparatus. When the leaf node of QT is not greater than the maximum block size (MaxBTSize) of the root node allowed in BT, the leaf node may be further divided into at least one of BT structure or TT structure. There may be multiple directions of segmentation in the BT structure and/or the TT structure. For example, there may be two directions, i.e., a direction in which the block of the corresponding node is divided horizontally and a direction in which the block of the corresponding node is divided vertically. As shown in fig. 2, when the MTT segmentation starts, a second flag (MTT _split_flag) indicating whether the node is segmented and a flag additionally indicating a segmentation direction (vertical or horizontal) and/or a flag indicating a segmentation type (binary or ternary) in case the node is segmented are encoded by an entropy encoder (155) and signaled to the video decoding device.

Alternatively, a CU split flag (split_cu_flag) indicating whether a node is split may also be encoded before encoding a first flag (qt_split_flag) indicating whether each node is split into four nodes of the lower layer. When the value of the CU partition flag (split_cu_flag) indicates that each node is not partitioned, the block of the corresponding node becomes a leaf node in the partition tree structure and becomes a CU as a basic unit of encoding. When the value of the CU partition flag (split_cu_flag) indicates that each node is partitioned, the video encoding apparatus first starts encoding the first flag through the above scheme.

When QTBT is used as another example of the tree structure, there may be two types, i.e., a type in which a block of a corresponding node is horizontally divided into two blocks having the same size (i.e., symmetrical horizontal division) and a type in which a block of a corresponding node is vertically divided into two blocks having the same size (i.e., symmetrical vertical division). A partition flag (split_flag) indicating whether each node of the BT structure is partitioned into lower-layer blocks and partition type information indicating a partition type are encoded by an entropy encoder (155) and transmitted to a video decoding apparatus. Meanwhile, there may additionally be a type in which a block of a corresponding node is divided into two blocks in an asymmetric form with each other. The asymmetric form may include where the blocks of the corresponding nodes are partitioned to have 1:3, or may also include a form in which blocks of corresponding nodes are divided in a diagonal direction.

A CU may have various sizes according to QTBT or QTBTTT partitions from CTUs. Hereinafter, a block corresponding to a CU to be encoded or decoded (i.e., a leaf node of QTBTTT) is referred to as a "current block". Since QTBTTT segmentation is used for segmentation, the shape of the current block may be rectangular in shape in addition to square shape.

A predictor (120) predicts a current block to generate a predicted block. The predictors (120) include an intra predictor (122) and an inter predictor (124).

In general, each of the current blocks in a picture may be predictively encoded. In general, prediction of a current block may be performed by using an intra prediction technique (using data from a picture including the current block) or an inter prediction technique (using data from a picture encoded before the picture including the current block). Inter prediction includes both unidirectional prediction and bi-directional prediction.

An intra predictor (122) predicts pixels in a current block by using pixels (reference pixels) located on neighbors of the current block in a current picture including the current block. Depending on the prediction direction, there are multiple intra prediction modes. For example, as shown in fig. 3a, the plurality of intra prediction modes may include 2 non-directional modes including a planar mode and a DC mode, and may include 65 directional modes. The adjacent pixels and the arithmetic equation to be used are differently defined according to each prediction mode.

For efficient directional prediction of a current block having a rectangular shape, directional modes (# 67 to # 80), intra prediction modes # -1 to # -14) as indicated by dashed arrows in fig. 3b may be additionally used. The orientation mode may be referred to as a "wide-angle intra prediction mode". In fig. 3b, the arrows indicate the corresponding reference samples for prediction and do not represent the prediction direction. The predicted direction is opposite to the direction indicated by the arrow. When the current block has a rectangular shape, the wide-angle intra prediction mode is a mode in which prediction is performed in a direction opposite to a specific orientation mode without additional bit transmission. In this case, in the wide-angle intra prediction mode, some wide-angle intra prediction modes available for the current block may be determined by a ratio of the width and the height of the current block having a rectangular shape. For example, when the current block has a rectangular shape with a height smaller than a width, wide-angle intra prediction modes (intra prediction modes #67 to # 80) having angles smaller than 45 degrees are available. When the current block has a rectangular shape having a width greater than a height, a wide-angle intra prediction mode having an angle greater than-135 degrees may be used.

An intra predictor (122) may determine an intra prediction to be used for encoding the current block. In some examples, the intra predictor (122) may encode the current block by using a plurality of intra prediction modes, and also select an appropriate intra prediction mode to use from the test modes. For example, the intra predictor (122) may calculate a rate-distortion (rate-distortion) value by using a rate-distortion analysis for a plurality of tested intra prediction modes, and also select an intra prediction mode having the best rate-distortion characteristics among the tested modes.

An intra predictor (122) selects one intra prediction mode among a plurality of intra prediction modes, and predicts a current block by using neighboring pixels (reference pixels) and an arithmetic equation determined according to the selected intra prediction mode. Information about the selected intra prediction mode is encoded by an entropy encoder (155) and transmitted to a video decoding device.

An inter predictor (124) generates a prediction block for a current block by using a motion compensation process. An inter predictor (124) searches for a block most similar to a current block in a reference picture encoded and decoded earlier than the current picture, and generates a prediction block for the current block by using the searched block. In addition, a Motion Vector (MV) is generated, which corresponds to a shift between a current block in the current picture and a prediction block in the reference picture. In general, motion estimation is performed on a luminance component, and a motion vector calculated based on the luminance component is used for both the luminance component and the chrominance component. Motion information including information on a reference picture and information on a motion vector for predicting a current block is encoded by an entropy encoder (155) and transmitted to a video decoding apparatus.

The inter predictor (124) may also perform interpolation on the reference picture or reference block in order to increase the accuracy of the prediction. In other words, sub-samples between two consecutive integer samples are interpolated by applying the filter coefficients to a plurality of consecutive integer samples comprising the two integer samples. When the process of searching for a block most similar to the current block is performed for the interpolated reference picture, it is possible to represent not an integer-sampling-unit precision but a decimal-unit precision for the motion vector. The precision or resolution of the motion vector may be set differently for each target region to be encoded (e.g., units such as slices, tiles, CTUs, CUs, etc.). When this Adaptive Motion Vector Resolution (AMVR) is applied, information on the motion vector resolution to be applied to each target region should be signaled for each target region. For example, when the target area is a CU, information about the resolution of a motion vector applied to each CU is signaled. The information on the resolution of the motion vector may be information representing the accuracy of a motion vector difference, which will be described below.

On the other hand, the inter predictor (124) may perform inter prediction by using bi-directional prediction. In the case of bi-prediction, two reference pictures and two motion vectors representing block positions most similar to the current block in each reference picture are used. An inter predictor (124) selects a first reference picture and a second reference picture from reference picture list0 (RefPicList 0) and reference picture list1 (RefPicList 1), respectively. The inter predictor (124) also searches for a block most similar to the current block among the respective reference pictures to generate a first reference block and a second reference block. In addition, a prediction block for the current block is generated by averaging or weighted-averaging the first reference block and the second reference block. In addition, motion information including information on two reference pictures for predicting a current block and information on two motion vectors is transmitted to an entropy encoder (155). Here, the reference picture list0 may be composed of pictures preceding the current picture in display order among the pre-restored pictures, and the reference picture list1 may be composed of pictures following the current picture in display order among the pre-restored pictures. However, although not particularly limited thereto, a pre-restored picture following the current picture in display order may be additionally included in the reference picture list 0. Conversely, a picture preceding the current picture may be additionally included in the reference picture list 1.

In order to minimize the amount of bits consumed for encoding motion information, various methods may be used.

For example, when a reference picture and a motion vector of a current block are identical to those of a neighboring block, information capable of identifying the neighboring block is encoded to transmit motion information of the current block to a video decoding apparatus. This approach is called merge mode.

In the merge mode, the inter predictor (124) selects a predetermined number of merge candidate blocks (hereinafter, referred to as "merge candidates") from neighboring blocks of the current block.

As shown in fig. 4, all or some of a left block (A0), a lower left block (A1), an upper block (B0), an upper right block (B1), and an upper left block (B2) adjacent to the current block in the current picture may be used as neighboring blocks for deriving a merge candidate. Further, blocks located within a reference picture (which may be the same as or different from a reference picture used to predict the current block) other than the current picture in which the current block is located may also be used as merging candidates. For example, a block co-located with the current block within the reference picture or a block adjacent to the co-located block may be additionally used as a merge candidate. If the number of merging candidates selected by the method described above is less than the preset number, a zero vector is added to the merging candidates.

An inter predictor (124) configures a merge list including a predetermined number of merge candidates by using neighboring blocks. A merge candidate to be used as motion information of the current block is selected from among the merge candidates included in the merge list, and merge index information for identifying the selected candidate is generated. The generated merging index information is encoded by an entropy encoder (155) and transmitted to a video decoding apparatus.

The merge skip mode is a special case of the merge mode. After quantization, when all transform coefficients used for entropy encoding are close to zero, only neighboring block selection information is transmitted without transmitting a residual signal. By using the merge skip mode, relatively high encoding efficiency can be achieved for images with slight motion, still images, picture content images, and the like.

Hereinafter, the merge mode and the merge skip mode are collectively referred to as a merge/skip mode.

Another method for encoding motion information is Advanced Motion Vector Prediction (AMVP) mode.

In the AMVP mode, the inter predictor (124) obtains a motion vector prediction candidate for a motion vector of a current block by using neighboring blocks of the current block. As the neighboring blocks for obtaining the motion vector prediction amount candidates, all or some of a left block (A0), a lower left block (A1), an upper block (B0), an upper right block (B1), and an upper left block (B2) adjacent to the current block in the current picture shown in fig. 4 may be used. In addition, a block located within a reference picture (which may be the same as or different from a reference picture used to predict the current block) other than the current picture in which the current block is located may also be used as a neighboring block used to obtain a motion vector predictor candidate. For example, a co-located block with the current block or a block adjacent to the co-located block within the reference picture may be used. If the number of motion vector candidates selected by the above method is less than a preset number, a zero vector is added to the motion vector candidates.

An inter predictor (124) obtains a motion vector predictor candidate by using motion vectors of neighboring blocks and determines a motion vector predictor for a motion vector of a current block by using the motion vector predictor candidate. In addition, a motion vector difference is calculated by subtracting a motion vector prediction amount from a motion vector of the current block.

The motion vector predictor may be obtained by applying a predefined function (e.g., center value and average value calculation, etc.) to the motion vector predictor candidates. In this case, the video decoding device is also aware of the predefined function. In addition, since the neighboring block used to obtain the motion vector predictor candidate is a block for which encoding and decoding have been completed, the video decoding apparatus may also already know the motion vector of the neighboring block. Therefore, the video encoding apparatus does not need to encode information for identifying the motion vector predictor candidates. Thus, in this case, information on a motion vector difference and information on a reference picture for predicting a current block are encoded.

On the other hand, the motion vector prediction amount may also be determined by selecting a scheme of any one of the motion vector prediction amount candidates. In this case, information for identifying the selected motion vector prediction amount candidate is additionally encoded in combination with information on the motion vector difference and information on the reference picture for predicting the current block.

A subtractor (130) generates a residual block by subtracting a prediction block generated by the intra predictor (122) or the inter predictor (124) from the current block.

A transformer (140) converts a residual signal in a residual block having pixel values of a spatial domain into transform coefficients of the frequency domain. The transformer (140) may transform a residual signal in the residual block by using the total size of the residual block as a transform unit, or may also divide the residual block into a plurality of sub-blocks, and may perform the transform by using the sub-blocks as transform units. Alternatively, the residual block is divided into two sub-blocks, a transform region and a non-transform region, respectively, to transform the residual signal using only the transform region sub-block as a transform unit. Here, the transform region sub-block may be 1 with a horizontal axis (or vertical axis) based: 1, one of two rectangular blocks of size ratio. In this case, a flag (cu_sbt_flag) indicates that only the sub-block is transformed, and directional (vertical/horizontal) information (cu_sbt_horizontal_flag) and/or position information (cu_sbt_pos_flag) is encoded by an entropy encoder (155) and signaled to a video decoding apparatus. Furthermore, the transform region sub-block may have a size based on a horizontal axis (or vertical axis) 1: 3. In this case, the marks (cu_sbt_quad_flag) of the respective partitions are additionally encoded by an entropy encoder (155) and signaled to the video decoding device.

Meanwhile, the transformer (140) may perform transformation on the residual block separately in the horizontal direction and the vertical direction. For the transformation, different types of transformation functions or transformation matrices may be used. For example, a pair of transform functions for horizontal transforms and vertical transforms may be defined as a Multiple Transform Set (MTS). The transformer (140) may select one transform function pair having the highest transform efficiency in the MTS and may transform the residual block in each of the horizontal and vertical directions. Information (mts_idx) of the transform function pairs in the MTS is encoded by an entropy encoder (155) and signaled to the video decoding device.

The quantizer (145) quantizes the transform coefficient output from the transformer (140) using the quantization parameter, and outputs the quantized transform coefficient to the entropy encoder (155). The quantizer (145) may also directly quantize the correlated residual block without a transform for any block or frame. The quantizer (145) may also apply different quantization coefficients (scaling values) depending on the position of the transform coefficients in the transform block. A quantization matrix applied to quantized transform coefficients arranged in a two-dimensional manner may be encoded and signaled to a video decoding apparatus.

The rearrangement unit (150) may perform rearrangement of coefficient values for the quantized residual values.

The rearrangement unit (150) may change the 2D coefficient array to a 1D coefficient sequence by using coefficient scanning. For example, the rearrangement unit (150) may output a 1D coefficient sequence by scanning the DC coefficient into a high frequency domain coefficient using a zig-zag scan or a diagonal scan. Instead of zig-zag scanning, vertical scanning that scans the 2D coefficient array in the column direction and horizontal scanning that scans the 2D block type coefficients in the row direction may also be used, depending on the size of the transform unit and the intra prediction mode. In other words, according to the size of the transform unit and the intra prediction mode, a scan method to be used may be determined in zig-zag scan, diagonal scan, vertical scan, and horizontal scan.

An entropy encoder (155) generates a bitstream by encoding a sequence of 1D quantized transform coefficients output from a rearrangement unit (150) using various encoding schemes including context-based adaptive binary arithmetic coding (CABAC), exponential golomb, and the like.

Further, the entropy encoder (155) encodes information related to block division (such as CTU size, CTU division flag, QT division flag, MTT division type, MTT division direction, etc.) that allows the video decoding apparatus to equally divide the block to the video encoding apparatus. Further, the entropy encoder (155) encodes information on a prediction type indicating whether the current block is encoded by intra prediction or inter prediction. An entropy encoder (155) encodes intra prediction information (i.e., information about an intra prediction mode) or inter prediction information (a merge index in the case of a merge mode; and information about a reference picture index and a motion vector difference in the case of an AMVP mode) according to a prediction type. Furthermore, the entropy encoder (155) encodes information related to quantization, i.e., information on quantization parameters and information on quantization matrices.

The dequantizer (160) dequantizes the quantized transform coefficients output from the quantizer (145) to generate transform coefficients. An inverse transformer (165) transforms the transform coefficients output from the dequantizer (160) from the frequency domain to the spatial domain to recover a residual block.

An adder (170) adds the restored residual block to the prediction block generated by the predictor (120) to restore the current block. When intra-predicting the next ordered block, the pixels in the restored current block may be used as reference pixels.

A loop filter unit (180) performs filtering on the restored pixels in order to reduce block artifacts, ring artifacts, blurring artifacts, etc. caused by block-based prediction and transform/quantization. The loop filter unit (180) as a loop filter may include all or some of a deblocking filter (182), a Sample Adaptive Offset (SAO) filter (184), and an Adaptive Loop Filter (ALF) (186).

A deblocking filter (182) filters boundaries between restored blocks to remove block artifacts occurring due to block unit encoding/decoding, and an SAO filter (184) and an ALF (186) perform additional filtering for the deblocked filtered video. SAO filters (184) and ALF (186) are filters used to compensate for differences between restored pixels and original pixels due to lossy encoding. An SAO filter (184) applies the offset as a CTU unit to enhance subjective image quality and coding efficiency. On the other hand, ALF (186) performs block unit filtering, and compensates for distortion by applying different filters by dividing boundaries of respective blocks and the degree of variation. Information about the filter coefficients to be used for the ALF may be encoded and signaled to the video decoding apparatus.

The restored blocks filtered by the deblocking filter (182), the SAO filter (184), and the ALF (186) are stored in a memory (190). When all blocks in one picture are restored, the restored picture may be used as a reference picture for inter-predicting blocks within a picture to be encoded later.

Fig. 5 is a functional block diagram of a video decoding device in which the techniques of this disclosure may be implemented. Hereinafter, with reference to fig. 5, a video decoding apparatus and components of the apparatus are described.

The video decoding apparatus may include an entropy decoder (510), a rearrangement unit (515), a dequantizer (520), an inverse transformer (530), a predictor (540), an adder (550), a loop filter unit (560), and a memory (570).

Similar to the video encoding apparatus of fig. 1, each component of the video decoding apparatus may be implemented as hardware or software or as a combination of hardware and software. Further, the function of each component may be implemented as software, and a microprocessor may also be implemented to execute the function of the software corresponding to each component.

An entropy decoder (510) extracts information related to block segmentation by decoding a bitstream generated by a video encoding apparatus to determine a current block to be decoded, and extracts prediction information required to restore the current block and information on a residual signal.

An entropy decoder (510) determines the size of a CTU by extracting information on the CTU size from a Sequence Parameter Set (SPS) or a Picture Parameter Set (PPS), and partitions a picture into CTUs having the determined size. In addition, the CTU is determined to be the highest layer of the tree structure, i.e., the root node, and the division information of the CTU may be extracted to divide the CTU using the tree structure.

For example, when dividing CTUs using the QTBTTT structure, first a first flag (qt_split_flag) related to the division of QT is extracted to divide each node into four nodes of the lower layer. Further, for a node corresponding to a leaf node of QT, a second flag (MTT _split_flag), a split direction (vertical/horizontal), and/or a split type (binary/ternary) related to the split of the MTT are extracted to split the corresponding leaf node into an MTT structure. As a result, each node below the leaf node of QT is recursively partitioned into BT or TT structures.

As another example, when a CTU is divided by using the QTBTTT structure, a CU division flag (split_cu_flag) indicating whether the CU is divided is extracted. The first flag (qt_split_flag) may also be extracted when the corresponding block is partitioned. During the segmentation process, 0 or more recursive MTT segmentations may occur after 0 or more recursive QT segmentations for each node. For example, for CTUs, MTT partitioning may occur directly, or conversely, QT partitioning may occur only multiple times.

For another example, when the CTU is divided using the QTBT structure, a first flag (qt_split_flag) related to the division of QT is extracted to divide each node into four nodes of the lower layer. Further, a split flag (split_flag) indicating whether a node corresponding to a leaf node of QT is further split into BT and split direction information are extracted.

Meanwhile, when the entropy decoder (510) determines a current block to be decoded by using the partition of the tree structure, the entropy decoder (510) extracts information on a prediction type indicating whether the current block is intra-predicted or inter-predicted. When the prediction type information indicates intra prediction, the entropy decoder (510) extracts syntax elements for intra prediction information (intra prediction mode) of the current block. When the prediction type information indicates inter prediction, the entropy decoder (510) extracts information representing syntax elements (i.e., motion vectors and reference pictures to which the motion vectors refer) for the inter prediction information.

Further, the entropy decoder (510) extracts quantization related information and extracts information on quantized transform coefficients of the current block as information on a residual signal.

The rearrangement unit (515) may change the sequence of the 1D quantized transform coefficients entropy-decoded by the entropy decoder (510) into a 2D coefficient array (i.e., block) again in an order opposite to the coefficient scanning order performed by the video encoding apparatus.

The dequantizer (520) dequantizes the quantized transform coefficients, and dequantizes the quantized transform coefficients by using the quantization parameters. The dequantizer (520) may also apply different quantized coefficients (scaling values) to quantized transform coefficients arranged in 2D. The dequantizer (520) may perform dequantization by applying a matrix of quantized coefficients (scaled values) from a video encoding device to a 2D array of quantized transform coefficients.

An inverse transformer (530) restores a residual signal by inversely transforming the dequantized transform coefficients from the frequency domain to the spatial domain, thereby generating a residual block for the current block.

Further, when the inverse transformer (530) inversely transforms a partial region (sub-block) of the transform block, the inverse transformer (530) extracts a flag (cu_sbt_flag) where only the sub-block of the transform block is transformed, direction (vertical/horizontal) information (cu_sbt_horizontal_flag) of the sub-block, and/or position information (cu_sbt_pos_flag) of the sub-block. The inverse transformer (530) also inversely transforms transform coefficients of the corresponding sub-block from the frequency domain to the spatial domain to restore a residual signal, and fills an area that is not inversely transformed with a value of "0" as the residual signal to generate a final residual block for the current block.

Further, when applying MTS, the inverse transformer (530) determines a transform index or a transform matrix applied in each of the horizontal direction and the vertical direction by using MTS information (mts_idx) signaled from the video encoding apparatus. The inverse transformer (530) also performs inverse transformation on the transform coefficients in the transform block in the horizontal direction and the vertical direction by using the determined transform function.

The predictors (540) may include an intra predictor (542) and an inter predictor (544). The intra predictor (542) is activated when the prediction type of the current block is intra prediction, and the inter predictor (544) is activated when the prediction type of the current block is inter prediction.

The intra predictor (542) determines an intra prediction mode of the current block among a plurality of intra prediction modes according to a syntax element for the intra prediction mode extracted from the entropy decoder (510). The intra predictor (542) also predicts the current block by using neighboring reference pixels of the current block according to an intra prediction mode.

The inter predictor (544) determines a motion vector of the current block and a reference picture to which the motion vector refers by using syntax elements for an inter prediction mode extracted from the entropy decoder (510).

The adder (550) restores the current block by adding the residual block output from the inverse transformer (530) to the predicted block output from the inter predictor (544) or the intra predictor (542). In intra prediction of a block to be decoded later, pixels within the restored current block are used as reference pixels.

The loop filter unit (560) as a loop filter may include a deblocking filter (562), an SAO filter (564), and an ALF (566). A deblocking filter (562) performs deblocking filtering on boundaries between restored blocks to remove block artifacts occurring due to block unit decoding. The SAO filter (564) and ALF (566) perform additional filtering on the restored block after deblocking filtering to compensate for differences between restored pixels and original pixels due to lossy encoding. The filter coefficients of the ALF are determined by using information on the filter coefficients decoded from the bitstream.

The restored blocks filtered by the deblocking filter (562), the SAO filter (564), and the ALF (566) are stored in a memory (570). When all blocks in one picture are restored, the restored picture may be used as a reference picture for inter-predicting blocks within a picture to be encoded later.

In some implementations, the present disclosure relates to encoding and decoding video images as described above. More particularly, the present disclosure provides a video encoding method and apparatus using a deep learning based loop filter for inter-prediction of P-frames and B-frames in order to mitigate various levels of image distortion according to Quantization Parameter (QP) values present in the P-frames and the B-frames.

The following embodiments may be commonly applied to a loop filter unit (180) in a video encoding apparatus and a loop filter unit (560) in a video decoding apparatus at a portion where a deep learning technique is utilized.

I. Hierarchical coding structure

In the random access mode, the video encoding apparatus refers to pictures encoded and decoded at earlier and later times relative to the current frame. In the hierarchical coding structure in the random access mode shown in fig. 6, the size of a group of pictures (GOP) is 8. If the GOP is set to 16 or 32, the layered coding structure can change accordingly, and the reference frame of the current frame to be coded can also be variable.

In the example shown in fig. 6, the numbers in the squares represent the coding order. The video encoding device encodes the I frames (intra) first, with an I frame ordering of 0. The video encoding device then encodes P frames (predicted frames) with reference to the I frames, and then encodes B frames (bi-predictive frames) between the I frames and the P frames. These three frames are encoded by using quantization parameters qp= I, QP =i+1 and qp=i+2, respectively. These frames are represented as the lowest depth of the hierarchy and are represented by temporary layer id=0.

The video encoding apparatus encodes frames located between frames included in the temporary layer id=0. For example, a frame in the middle between the 0 th encoded frame and the second encoded frame is next encoded. Further, a frame at the middle between the first encoded frame and the second encoded frame is encoded next. These frames are assigned a temporary layer id=1. The frame corresponding to the temporary layer id=1 is encoded by using the quantization parameter qp=i+3. Similarly, a frame to which a temporary layer id=2 is assigned is encoded. The frame corresponding to the temporary layer id=2 is encoded by using the quantization parameter qp=i+4.

Thus, as the temporary layer increases, the quantization parameter also increases. That is, frames with lower temporal layers are compressed to have higher peak signal-to-noise ratio (PSNR) and higher video quality by using lower quantization parameters. On the other hand, by using a relatively high quantization parameter, a frame in which inter prediction is performed with reference to a frame having a lower temporal layer may be compressed to have a lower PSNR.

On the other hand, POC (picture count) is an index allocated in the GOP according to the time order. That is, the 0 th to 8 th frames are assigned values poc=0 to poc=8 in order.

II video compression artificial deletion network (VCARN)

VCARN represents a deep learning based denoising model that removes noise or artifacts caused by quantization noise during video compression. VCARN may be performed based on a Convolutional Neural Network (CNN), which is a process of removing video noise, similar to video denoising. Unlike image denoising, video denoising can utilize previously encoded frames to further improve denoising performance.

The video quality enhancement device according to the present embodiment may include all or part of an input unit (702), a Quantization Parameter (QP) pre-processor (704), and a denoising (706). Such a video quality enhancement device may be used as one of a loop filter within a loop filter unit (180) in a video encoding device and a loop filter unit (560) in a video decoding device. When included in a loop filter unit (180) in a video encoding apparatus, components included in the video encoding apparatus according to the present embodiment are not necessarily limited to the illustrated components. For example, the video encoding apparatus may be further equipped with a training unit (not shown) for training a deep learning model included in the video quality enhancement apparatus, or the video encoding apparatus may be implemented in a configuration to interact with an external training unit.

An input unit (702) obtains the current frame and decoded QP. Here, the current frame may be a P frame or a B frame reconstructed according to inter prediction. In addition, the input unit (702) may select a reference frame from a reference list, which will be described in more detail below.

The QP preprocessor (704) calculates an embedding vector for the QP by using a deep learning based embedding function or estimates image distortion due to the QP by using a deep learning based estimation model. The QP preprocessor (704) also transmits the embedded vector or estimated image distortion to a denoiser (706).

The denoising (706) uses a deep learning based denoising model to generate a quality enhancement frame from the current frame (i.e., the P/B frame). The denoising (706) may utilize the embedded vector or the estimated image distortion. When using embedded vectors, the denoiser (706) may employ a conventional VCARN as the denoising model. In addition, when using the estimated image distortion, the denoising unit (706) may use the normalization module as a denoising model to generate an enhanced image from the current frame.

As another example, the denoiser (706) may generate a similar frame from the reference frame using VCARN, and may then generate an enhancement frame using the current frame and the similar frame. The denoising (706) may utilize the embedded vector in generating the enhancement frame.

Even within a single video sequence, P/B frames may contain varying levels of image distortion based on varying QP values. This embodiment provides an example of a VCARN that is suitable for use in this environment, such that the enhanced signal can more closely approximate the original signal.

In addition, the VCARN may be classified into an S-VCARN in which quantization noise is removed using only a single current frame and an R-VCARN in which a reference frame is used.

S-VCARN is used for improving current frame x _t Which can be expressed as shown in equation 1. S-VCARN may be designed to adaptively target different quantization noise qp _x And performing an action.

[ equation 1]

x _hat，t，s ＝f(x _t ，qp _x )

Further, R-VCARN is used for using the reference frame x _r To improve the current frame x _t It can be expressed as shown in equation 2. The following description selects reference frame x _r A method of generating a similar frame caused by a reference frame to approximate a current frame.

[ equation 2]

x _hat，t，r ＝g(x _t ，x _r )

The R-VCARN may also be designed to operate on different quantization noise.

Alternatively, the two models described above may be combined to produce a combined VCARN. The combined VCARN may also be designed to operate on different quantization noise.

In addition to the loop filter of the inter prediction signal described above, the S-VCARN, R-VCARN, and combined VCARN according to the present embodiment may be applied to improve inter prediction signals, may be applied to post-process compressed video signals to enhance quality, and may be applied to improve performance of the VCARN itself.

On the other hand, the conventional VCARN has the following problems when used as a loop filter for an inter prediction signal.

Using only the VCARN of the current frame may face domain shifting problems, which is common to all deep learning based techniques. The domain shift problem is the following phenomenon: if the probability distributions of the training samples and the test samples are different, or if the training samples are not sufficiently generalized, the performance of the resulting VCARN is degraded. For example, a VVC (universal video coding) with QPs ranging from 0 to 63 may require VCARNs trained on 63 environments, making it difficult to use a network trained on each QP. Thus, VCARN requires the use of one or a small number of networks to process video or video frames distorted by a wide variety of QPs. This diversity of QPs may be determined at the video sequence level or may occur based on temporal layers within a group of pictures (GOP).

Hereinafter, the term "current frame" and the term "input video" may be used interchangeably.

Fig. 8a and 8b are diagrams illustrating an S-VCARN using a single network.

The S-VCARN may use, as the single network f, a CLB-network as a continuous stack of convolutional layer blocks or a DefC-network as a network having a deformable convolutional structure (see non-patent reference 1).

As shown in fig. 8a, the CLB-network may output an enhanced image x of a current frame using a cascade structure of a Residual Block (RB) and a convolution layer _hat,t,s . Here, RB is a convolution block having a skip path between its input and output, which allows the convolution block to output a residual between its input and output.

As shown in fig. 8b, the DefC-network generates the kernel offsets Δi and Δj from the input image by using an embedded deep learning model U-network. A sampler in the DefC-network samples the input image using the generated offset. The convolution layer in the DefC-network generates calibrated kernels, i.e. weights, from the input image, the output signature of the U-network, and the sampled input image. Finally, the output convolution layer in the DefC-network can output a quality enhanced input image as image x by applying convolution to the sampled input image using a calibrated kernel _hat,t,s . The diagram of FIG. 8b includes a portion of generating an offset for a kernel over a U-network, but does not includeA sampler for sampling the input image, a convolution layer for generating a calibration kernel, and an output convolution layer for generating an enhanced image.

For such S-VCARN to be trained by the training unit, a loss function as shown in equation 3 may be used.

[ equation 3]

MSE＝∑(y _t -x _hat，t，s ) ²

Here, y _t Is the target video for training, i.e., the true value (GT).

When the difference between the reference frame and the current frame is large, VCARN using the reference frame may suffer from performance degradation. Typical factors that cause this difference between the reference frame and the current frame are the temporal distance between the two frames and the different QPs of the two frames.

In the following description, as described above, the conventional S-VCARN is represented by the deep learning model f, and as described above, the conventional R-VCARN is represented by the deep learning model g. Video quality enhancement devices according to some embodiments of the present disclosure illustrate the enhancement of the deep learning models f and g. In the above illustration of fig. 7, VCARN included in the input unit (702) and QP preprocessor (704) and denoising (706) are described, respectively, although not limited thereto. In the following description, an enhanced VCARN may be described as including all or a portion of an input unit (702) and a QP preprocessor (704).

III Structure and operation of S-VCARN according to the present disclosure

In at least one embodiment, the video quality enhancement device may operate adaptively with a different QP by converting QP values into embedded vectors and by applying the embedded vectors to one of the convolutional layers in the S-VCARN. In other words, the QP preprocessor (704) can process the current frame x _t Corresponding QP _x As shown in equations 4 and 5, and the embedded vector may be applied to the kth convolutional layer in S-VCARN, where k is a natural number.

[ equation 4]

λ _OP ＝e(QP _x )

[ equation 5]

k _t ＝Conv(|λ _OP |·k _t )

Here, e denotes an embedding function, which can be learned and can be implemented as an embedding layer and a plurality of fully connected layers. The embedding layer is an input layer that converts the quantization parameter into a vector form, and the embedding function e finally generates an embedding vector corresponding to the quantization parameter. In addition, conv () is a network containing a plurality of convolution layers, and k _t Representing the characteristics of the kth convolutional layer. S-VCARN may be generated by embedding the vector lambda generated according to equation 4 as shown in equation 5 _QP Is multiplied by k _t To update the characteristics of the kth convolutional layer. The updated features may then be input to the (k+1) th convolution layer.

Based on equations 4 and 5, the S-VCARN including the above-described CLB-network may be operated as shown in FIG. 9a such that k may be adaptively changed based on QP values _t Is a value of (2).

As another example, S-VCARN may vary based on the utilization of the embedded vector. For CLB-networks, the embedded vector λ can be multiplied by all features of the convolutional layer _QP To change S-VCARN. Alternatively, for CLB-networks, the embedding vector λ may be multiplied by the last layer of the last RB _QP Is used to modify the S-VCARN.

As another example, for a DefC-network, the embedded vector λ may be obtained by multiplying all or some of the convolutional layers (i.e., weights) that generate the calibration kernel by the embedded vector λ _QP S-VCARN is modified as shown in fig. 9 b. Alternatively, for a DefC-network, the S-VCARN may be modified to apply equation 5 to the calibration kernel such that the calibration kernel of the output is generated differently depending on QP.

According to equation 4, the QP preprocessor (704) may use, but is not necessarily limited to, a QP that is specifically used as an input to the embedding function to compensate for the quantization noise level. As another example, the QP preprocessor (704) may utilize one or a combination of QP, lagrangian factor for calculating rate distortion, temporal layers within the GOP, and type of frame (P-frame or B-frame) as inputs to the embedding function.

Fig. 10a and 10b are diagrams illustrating S-VCARN using quantized noise estimation according to other embodiments of the present disclosure.

In another embodiment, the video quality enhancement device may adapt the characteristics of the convolutional layer to the distortion of the input image by using a Conditional Instance Normalization (CIN), which is a normalization module for correcting the degree of distortion, added to the backbone network, which is a deep learning based estimation model. In other words, the denoiser (706) may use the estimated input image distortion directly for distortion correction instead of QP.

As shown in fig. 10a and 10b, the QP preprocessor (704) estimates the distortion of the input image by using an estimation model. In addition, the denoiser (706) normalizes the characteristics of the convolutional layer by using the normalization module CIN. The normalized operation of CIN can be represented by equation 6.

[ equation 6]

In equation 6, μ (x) represents the average value of x and σ (x) represents the standard deviation of x. In addition, γ and β represent affine matrices that can be learned. In equation 6, x may be a characteristic of a convolution layer, which in this embodiment may be the input image x _t 。

The QP preprocessor (704) may generate normalized parameters γ and β reflecting image distortion as follows. First, the QP preprocessor (704) extracts the input image x from the input image x by using the backbone network h _t Extract noise figure omega (x) _t ). Here, the backbone network may be a neural network based on a U network. In order to estimate the noise figure ω (x) _t ) The degree of distortion caused by QP when adding classifier for quantization parameter QP for input image _x Classification is performed. The backbone network performs prediction of quantization parameter values by connecting features in the U-network structure that have undergone up-convolution to a classifier. QP based on the prediction _x The backbone network can extract features such that for the input image x _t Generating an appropriate noise map ω (x _t ). In the diagrams of fig. 10a and 10b, the classifier f _C Representing a fully connected layer.

Meanwhile, as shown in equation 7, a loss function for predicting quantization parameter values may be defined by using cross entropy.

[ equation 7]

In equation 7, C is the number of sortable quantization parameters, and unet_down (x _t ) Representing the characteristics before undergoing an up-convolution in the U-network structure.

The QP preprocessor (704) uses the additional convolution layer from ω (x) _t ) The normalization parameters gamma and beta are extracted. The estimation model includes a backbone network, a classifier, and a convolution layer that generates γ and β.

Finally, using equation 6 including these components, the denoising device (706) may apply CIN to the input image x _t As shown in fig. 10a and 10 b. The denoising model within the denoiser (706) includes CIN and output convolution layers. For example, in the example of fig. 10a, the denoising model applies the normalization module CIN to a convolutional network that includes a hopping path. That is, CIN is applied to the residual between the input image and the enhanced image. The denoising model applies convolution and activation functions to the normalized residuals to generate an enhanced image.

On the other hand, in the example of fig. 10b, the denoising model applies CIN directly to the input image and then applies convolution and activation functions that can be applied to the normalized image to generate an enhanced image, where the activation functions represented by modified linear units (ReLU) are connected to the output of the convolution layer as shown in fig. 10a and 10 b.

Meanwhile, the entire network including the denoising model and the estimation model is trained end-to-end (end-to-end), and the loss function may be expressed as shown in equation 8.

[ equation 8]

Loss＝MDE+α·CE

In equation 8, MSE is the loss associated with image enhancement of the estimation model and the denoising model, and CE is the loss associated with prediction of the quantization parameter of the classifier, as shown in equation 7. Furthermore, α is a hyper-parameter that controls the coupling ratio between MSE and CE.

As described above, when an enhanced frame is generated from a single current frame by using a conventional S-VCARN, the prediction result may be different depending on the QP due to the domain shift problem. To improve this problem, the input frame and enhancement frame may be mixed adaptively to QP.

Fig. 11 is a diagram illustrating an S-VCARN using a mask map according to still another embodiment of the present disclosure.

In yet another embodiment, as shown in FIG. 11, S-VCARN may include the use of x _t And x _hat,t,s A plurality of convolution layers as inputs. Furthermore, x _t Is a reconstructed frame that has been decoded, and x _hat,t,s Is a frame generated by a conventional S-VCARNf.

As shown in equation 9, the embedding vector λ by using QP _QP And a residual network CNN (), S-VCARN can generate an enhanced residual signal and combine the enhanced residual signal with x _hat,t,s Adding to generate the final enhancement frame x _hat,t 。

[ equation 9]

x _hat，t ＝|λ _QP |·CNN([x _t ，x _hat，t，s )]+x _hat，t，s Alternatively, as shown in equation 10, S-VCARN may calculate a mask map m _t The mask map may then be used to select the region to reflect from each of the input image and the enhancement frame.

[ equation 10]

m _t ＝|λ _QP |·CNN([x _t ，x _hat,t,s ])

x _hat,t ＝m _t ·x _t +(1-m _t )·× _hat,t,s

The reason for using the mask map is that even if the input image passes through the conventional S-VCARN f, as described above, the input image does not necessarily result inThe best enhancement signal. In the example of fig. 11, the network represented by the convolutional layer may perform the process as shown in equations 9 and 10. In this case, the lower the QP, the lower the final enhancement signal x _hat,t An input image x reflected in the image data _t The more and the higher the QP, the more frame x enhanced by the conventional S-VCARN f can be reflected _hat,t,s The more.

Structure and operation of R-VCARN according to the present disclosure

A method of improving the performance of conventional R-VCARN g according to the present disclosure is described below. First, the input unit (702) may select the reference frame x used by R-VCARN as follows _r 。

The frame with the lowest temporal layer in the reference list may be selected as reference frame x _r 。

Alternatively, the frame with the lowest QP in the reference list may be selected as reference frame x _r 。

Alternatively, a frame having a minimum picture count (POC) difference from the current frame in the reference list may be selected as the reference frame x _r 。

Alternatively, the reference frame x may be selected by using an algorithm for selecting a Peak Quality Frame (PQF) _r (see non-patent document 2).

If there is more than one reference frame satisfying the aforementioned condition, a frame earlier in display order may be selected as reference frame x _r 。

Alternatively, if there is more than one reference frame satisfying the aforementioned conditions, all frames satisfying these conditions may be selected as reference frame x _r 。

The R-VCARN then shifts the selected reference frame to be similar to the current frame. In one example, the similar frame may be generated by shifting the reference frame in the pixel domain, as shown in equation 11.

[ equation 11]

x _hat，r→t ＝warp(x _r ；x _t )

Here, warping (warping) may be performed using an optical flow calculated based on the reference frame, or warping may be performed using a DefC-network as described above.

As another example, as shown in equation 12, a dummy frame may be generated by shifting a reference frame in the feature domain.

[ equation 12]

x _hat，r→t ＝warp(ConvNet(x _r )；ConvNet(x _t ))

Here, convNet () is a network for extracting features from a reference frame or a current frame. As shown in fig. 12a, by calculating and then using the optical flow on each channel of the feature, warping can be performed on each channel of the extracted feature. Alternatively, as illustrated in fig. 12b, warping may be performed by selecting and shifting the vector most similar to the spatial portion. That is, the vector may be selected and moved based on coordinates in space. On the other hand, warping based on displacement in the spatial portion may be performed using a texture transformer (see non-patent document 3).

By combining the above-described reference frame selection method and reference frame shift method, an enhanced R-VCARN can be generated.

Fig. 13a and 13b are diagrams illustrating an R-VCARN in accordance with at least one embodiment of the present disclosure.

In the example of FIG. 13a, R-VCARN selects reference frame x _r And predicts the optical flow such that x _r Become similar to x _t . R-VCARN can use optical flow to shift x in the pixel domain _r To generate similar frame x _hat,r→t And then x can be used _hat,r→t And x _t As input to generate enhancement frame x _hat,t,r . At this time, the R-VCARN may combine the frames x using a convolution layer _t And frame x _hat,t,r . At the same time, R-VCARN may be trained by the training unit to make x _hat,t,r And true value GT y _t Similarly.

On the other hand, in the example of FIG. 13b, R-VCARN selects and shifts reference frame x in the feature domain _r . In other words, in extracting x _t And x _r After corresponding features of R-VCARN calculates the relationship between the two features and will x _r Features recombined with x _t Is similar in features.

To recombine x _r The R-VCARN may be implemented as a texture transformer as shown in the example of fig. 13 b. Texture transformer can recombine x by using an attention function _r Is characterized by (3). The attention function takes Q, K and V as inputs, which represent a query matrix, a keyword matrix, and a value matrix, respectively. By combining the current frame x _t Features input into Q and reference frame x _r Is input to K and V to calculate the attention function, R-VCARN can calculate x _t Features and x of (2) _r And thus recombine x _r Is characterized by (3).

The R-VCARN can then combine the recombined features with x _t Feature combinations of (a) to generate enhancement frame x _hat,t,r . In this process, R-VCARN may combine frames x using a convolutional layer _t And frame x _hat,t,r . At the same time, the training unit can train R-VCARN to make x _hat,t,r With GT y _t Similarly.

In another embodiment, R-VCARN may be adaptively trained to reflect QP values. In this case, in order to reflect distortion caused by QP in R-VCARN, a method applied to S-VCARN, such as using an embedding function according to equation 4, using CIN according to equation 6, and using a mask map according to equation 10, may be utilized.

For example, as shown in fig. 14, R-VCARN that utilizes shifting in the pixel domain may reflect QP values by using an embedding function according to equation 5. In other words, the R-VCARN may input the embedded vector generated from the QP value into any convolutional layer in the network that performs loop filtering.

In another embodiment, the combined VCARN may be implemented by combining S-VCARN and R-VCARN.

FIG. 15 is a schematic diagram of a combined VCARN combining S-VCARN and R-VCARN.

As shown in fig. 15, the combined VCARN may predict frame x using S-VCARN _hat,t,s And frame x using R-VCARN prediction _hat,t,r Combining to generate the final frame x _hat,t . To combine frame x _hat,t,s And frame x _hat,t,r The combined VCARN may use several convolutional layers or masks.

The following is a description of a video quality enhancement method performed by the video quality enhancement apparatus, with reference to fig. 16 and 17.

As described above, the video quality enhancement method can be performed by the loop filter unit (180) in the video encoding apparatus and the loop filter unit (560) in the video decoding apparatus.

Fig. 16 is a flow chart of a video quality enhancement method utilizing S-VCARN in accordance with at least one embodiment of the present disclosure.

The video quality enhancement device obtains the reconstructed current frame and the decoded quantization parameter (S1600). Here, the current frame may be a P frame or a B frame reconstructed based on inter prediction of the video encoding apparatus.

The video quality enhancement apparatus calculates an embedding vector from quantization parameters by using an embedding function based on deep learning, or estimates image distortion based on quantization parameters by using an estimation model based on deep learning (S1602).

The embedding function includes an embedding layer and a plurality of fully connected layers. The embedding layer is an input layer for converting the quantization parameter into a vector form, and the embedding function finally generates an embedding vector corresponding to the quantization parameter.

Further, the embedding function may take as input one or a combination of quantization parameters, lagrangian factors for calculating rate distortion, temporal layers within the GOP, and the type of frame (P-frame or B-frame).

The estimation model may include a U-network for extracting a noise figure from the current frame, may include a classifier for predicting quantization parameters from features before undergoing an up-convolution in the U-network structure, and may include a convolution layer for extracting normalization parameters representing image distortion from the noise figure.

The video quality enhancement apparatus generates an enhanced frame by removing quantization noise from the current frame using a deep learning-based denoising model (S1604).

In one example, the denoising model (which is S-VCARN) is a CLB-network that includes a concatenation of RBs and convolutional layers, and the concatenation may be used to generate enhancement frames. Each RB is a convolutional block with a skip path between its input and output. This denoising model may be changed by multiplying the features generated by the current one of the convolutional layers by the absolute value of the embedded vector to generate an enhancement frame. Alternatively, the denoising model may be changed by multiplying each feature of the convolutional layer by a common absolute value of the embedded vector. Alternatively, the denoising model may be changed by multiplying the last layer of the last RB by the absolute value of the embedded vector.

As another example, the denoising model may be a DefC-network that includes a convolution layer for generating the calibration kernel. Such a denoising model may be changed by multiplying the features generated by a preset one of the convolution layers by the absolute value of the embedded vector. Alternatively, the denoising model may be changed by multiplying the calibration kernel by the absolute value of the embedded vector.

When an estimation model is used, the denoising model may include: a normalization module for normalizing the current frame by using the normalization parameter; and an output convolution layer for generating an enhancement frame from the normalized current frame. The estimation model and the denoising model may be trained end-to-end. This end-to-end trained loss function can be expressed as the sum of (1) the loss of the estimation model and the denoising model used to estimate the enhancement frame and (2) the loss of the classifier used to predict the quantization parameter, as shown in equation 8.

As another example, the denoising model may further include a convolution layer, and the current frame may be mixed with an enhancement frame suitable for the quantization parameter using the convolution layer. For example, the absolute values of the embedded vector and convolution layer may be used to generate a residual signal between the current frame and the enhancement frame, and the denoising model may then add the residual signal to the enhancement frame to generate the final enhancement frame.

Alternatively, after calculating the mask map of the current frame and the enhanced frame using the absolute value of the embedded vector and the convolution layer, the denoising model may use the mask map to combine the current frame and the enhanced frame.

Fig. 17 is a flow chart of a video quality enhancement method utilizing R-VCARN in accordance with at least one embodiment of the present disclosure.

The video quality enhancement device obtains the current frame and the decoded quantization parameter (S1700). Here, the current frame may be a P frame or a B frame that has been reconstructed according to inter prediction of the video encoding apparatus.

The video quality enhancement device selects a reference frame from the reference list (S1702). The video quality enhancement device may select the frame with the lowest temporal layer in the reference list as the reference frame or may select the frame with the lowest quantization parameter in the reference list as the reference frame.

The video quality enhancement apparatus calculates an embedding vector of quantization parameters by using an embedding function based on deep learning (S1704).

As described above, the embedding function includes an embedding layer and a plurality of fully connected layers. The embedding layer is an input layer for converting the quantization parameter into a vector form, and the embedding function finally generates an embedding vector corresponding to the quantization parameter.

The video quality enhancement device generates an enhancement frame by generating a similar frame from the reference frame using the deep learning-based denoising model and then using the current frame and the similar frame (S1706).

In one example, the denoising model shifts the reference frame in the pixel domain. The denoising model may predict optical flow from a reference frame, and may use the optical flow to generate a similar frame from the reference frame.

As another example, the denoising model may shift the reference frame in the feature domain. The denoising model may extract features of the current frame and the reference frame, respectively, and may use the features of the current frame and the features of the reference frame to recombine the features of the reference frame in a feature domain. The denoising model may combine the recombined features of the reference frame with features of the current frame to generate a similar frame.

The video quality enhancement device may utilize the embedded vectors in generating the enhancement frames. For example, the denoising model may be modified by multiplying features generated by a preset convolution layer of convolution layers in the denoising model by the embedded vector.

Although the steps in the various flowcharts are described as being performed sequentially, these steps merely exemplify the technical concepts of some embodiments of the present disclosure. Accordingly, one of ordinary skill in the art to which the present disclosure pertains may perform the steps by changing the order depicted in the various figures or by performing more than two steps in parallel. Therefore, the steps in the respective flowcharts are not limited to the time series order shown.

It should be understood that the above description presents illustrative embodiments that may be implemented in various other ways. The functionality described in some embodiments may be implemented by hardware, software, firmware, and/or combinations thereof. It should also be appreciated that the functional components described in this specification are labeled with a "… unit" to strongly emphasize their independent implementation possibilities.

Meanwhile, various methods or functions described in some embodiments may be implemented as instructions stored in a non-transitory recording medium that can be read and executed by one or more processors. For example, the non-transitory recording medium may include various types of recording devices in which data is stored in a form readable by a computer system. For example, the non-transitory recording medium may include a storage medium such as an erasable programmable read-only memory (EPROM), a flash memory drive, an optical disk drive, a magnetic hard disk drive, a Solid State Drive (SSD), and the like.

Although embodiments of the present disclosure have been described for illustrative purposes, those skilled in the art to which the present disclosure pertains will appreciate that various modifications, additions and substitutions are possible, without departing from the scope and spirit of the present disclosure. Accordingly, embodiments of the present disclosure have been described for brevity and clarity. The scope of the technical idea of the embodiments of the present disclosure is not limited by the drawings. Thus, it will be understood by those of ordinary skill in the art to which this disclosure pertains that the scope of this disclosure should not be limited by the embodiments explicitly described above, but rather by the claims and their equivalents.

(reference numerals)

180. Loop filter unit

560. Loop filter unit

702. Input unit

704 QP preprocessor

706. Denoising device

Cross Reference to Related Applications

The present application claims priority from korean patent application No. 10-2021-0042090, filed on 3 months of 2021, 31, and korean patent application No. 10-2022-0036249, filed on 3 months of 2022, 23, the respective disclosures of which are incorporated herein by reference in their entireties.

Claims

1. An apparatus for video quality enhancement, comprising:

an input unit configured to obtain a current frame that has been reconstructed and quantization parameters that have been decoded;

a quantization parameter pre-processor configured to calculate an embedding vector from the quantization parameter by using a depth-learning-based embedding function, or to estimate image distortion due to the quantization parameter by using a depth-learning-based estimation model; and

a denoising configured to generate an enhanced frame by removing quantization noise from the current frame using a deep learning-based denoising model,

wherein the denoising model utilizes the calculated embedded vector or the estimated image distortion to generate the enhancement frame.

2. The device of claim 1, wherein the current frame is a P-frame (predicted frame) or a B-frame (bi-predicted frame) reconstructed by a video encoding device from inter-frame prediction.

3. The device of claim 1, wherein the embedding function comprises:

an embedded layer and a plurality of fully connected layers.

4. The device of claim 1, wherein the embedding function takes as input all or part of the quantization parameter, a lagrangian factor for calculating rate distortion, a temporal layer of the current frame, a type of the current frame, or any combination thereof.

5. The apparatus of claim 1, wherein,

the denoising model includes a cascade structure of RBs (residual blocks) and convolutional layers and generates the enhanced frame using the cascade structure, and

each RB is a convolutional block with a skip path between input and output.

6. The apparatus of claim 5, wherein the denoiser is configured to multiply a feature generated by a preset one of the convolutional layers by an absolute value of the embedded vector.

7. The apparatus of claim 1, wherein the denoising model comprises:

a U-network, the U-network being a deep learning model configured to generate an offset of a kernel from the current frame;

a sampler configured to sample the current frame by using the offset;

A convolution layer configured to generate a calibration kernel from an input image, an output feature map of the U-network, and a sampled current frame; and

an output convolution layer configured to apply a convolution to a current frame of the samples using the calibration kernel to generate the enhancement frame.

8. The apparatus of claim 7, wherein the denoiser is configured to multiply the calibration kernel by an absolute value of the embedded vector.

9. The apparatus of claim 1, wherein the estimation model comprises:

a U-network configured to extract a noise figure from the current frame;

a classifier configured to predict the quantization parameter from a feature prior to deconvolution in the U-network structure; and

a convolution layer configured to extract from the noise figure a normalized parameter representative of the image distortion.

10. The apparatus of claim 9, wherein the denoising model comprises:

a normalization module configured to provide normalization of the current frame with the normalization parameters; and

an output convolution layer configured to generate the enhancement frame from the normalized current frame.

11. The apparatus of claim 9, wherein,

the estimation model and the denoising model undergo end-to-end training, and

The end-to-end trained loss function is expressed as a sum of the loss of the estimation model and the denoising model for estimating the enhancement frame and the loss of the classifier for estimating the quantization parameter.

12. The apparatus of claim 1, wherein,

the denoising model further includes a combined convolution layer,

the denoising model generates a residual signal between the current frame and the enhanced frame by using the absolute value of the embedded vector and the combined convolution layer, and

the denoising model sums the residual signal with the enhancement frame.

13. A method performed by a computing device to enhance image quality of a current frame, the method comprising:

obtaining the reconstructed current frame and the decoded quantization parameter;

calculating an embedding vector from the quantization parameter by using a deep learning-based embedding function, or estimating image distortion due to the quantization parameter by using a deep learning-based estimation model; and

an enhanced frame is generated by removing quantization noise from the current frame using a deep learning based denoising model,

wherein generating the enhancement frame includes causing the denoising model to utilize the calculated embedded vector or the estimated image distortion.

14. The method of claim 13, wherein obtaining the current frame and the quantization parameter comprises:

a P frame (predicted frame) or a B frame (bi-directionally predicted frame) reconstructed from inter-frame prediction of a video encoding apparatus is obtained as the current frame.

15. The method of claim 13, wherein the embedding function comprises:

an embedded layer and a plurality of fully connected layers.

16. The method of claim 13, wherein,

each RB is a convolutional block with a skip path between input and output.

17. The method of claim 16, wherein generating the enhancement frame comprises:

multiplying the features generated by a preset convolution layer in the convolution layers by the absolute value of the embedded vector.