CN117044211A

CN117044211A - Video encoding method and apparatus using pre-processing and post-processing

Info

Publication number: CN117044211A
Application number: CN202280023124.4A
Authority: CN
Inventors: 安镕照; 李钟石; 朴胜煜
Original assignee: Hyundai Motor Co; Kia Corp; DigitalInsights Inc
Current assignee: Hyundai Motor Co; Kia Corp; DigitalInsights Inc
Priority date: 2021-04-02
Filing date: 2022-03-30
Publication date: 2023-11-10

Abstract

The present disclosure relates to a video encoding method and apparatus using preprocessing and post-processing, and the present embodiment provides a video encoding method and apparatus for modeling a noise model or a subjective image quality model of a current video, performing preprocessing and post-processing on an image based on the respective models, and signaling information of the respective models.

Description

Video encoding method and apparatus using pre-processing and post-processing

Technical Field

The present disclosure relates to video encoding methods and apparatus using pre-processing and post-processing.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

Because video data has a large amount of data compared to audio or still picture data, the video data requires a large amount of hardware resources (including memory) to store or transmit the video data without processing for compression.

Thus, encoders are commonly used to compress and store or transmit video data. The decoder receives the compressed video data, decompresses the received compressed video data, and plays the decompressed video data. Video compression techniques include h.264/AVC, high Efficiency Video Coding (HEVC), and Versatile Video Coding (VVC) with improved coding efficiency of about 30% or greater compared to HEVC.

However, since the picture size, resolution, and frame rate gradually increase, the amount of data to be encoded also increases. Thus, new compression techniques are needed that provide higher coding efficiency and improved picture enhancement than existing compression techniques.

For a video encoding method and apparatus, a series of processes applied to an input video before the input video is encoded by an encoder is defined as a preprocessing process. Further, a series of processes applied to the restored video before the restored video is stored or displayed by the decoder is defined as a post-processing process. Meanwhile, since the input video includes noise or has an excessively high resolution, the encoding efficiency may be reduced or the video quality may be deteriorated from the standpoint of the encoder and decoder. Therefore, in order to improve coding efficiency and improve video quality, efficient pre-processing (pre-processing) and post-processing (post-processing) processes need to be considered.

Disclosure of Invention

Technical problem

The present disclosure seeks to provide video coding methods and apparatus that model a noise model or a subjective video quality model of a current video. A video encoding method and apparatus pre-processes and post-processes video based on a model and signal information of the model.

Technical proposal

At least one aspect of the present disclosure provides a video decoding method performed by a video decoding apparatus. The video decoding method comprises the following steps: the restored video is generated by decoding the bitstream and decoding parameters of the perceptual model from the bitstream. The perceptual model is a model reflecting perceptual visual properties in terms of perceptual quality. The video decoding method further includes generating an enhanced video by post-processing the restored video using parameters of the perceptual model. The video decoding method further includes generating a final restored video using the enhanced video and the restored video.

Another aspect of the present invention provides a video decoding apparatus. The video decoding device includes a decoder configured to decode the bitstream to generate a restored video. The decoder is configured to decode parameters of the perceptual model-based video quality enhancement method and the perceptual model from the bitstream. The perceptual model is a model reflecting perceptual visual properties in terms of perceptual quality. The video decoding device further comprises a perceptual quality enhancer configured to post-process the restored video using a perceptual model-based video quality enhancement method to generate an enhanced video. The video decoding apparatus further includes an adder configured to generate a final restored video using the enhanced video and the restored video.

Yet another aspect of the present invention provides a video encoding method performed by a video encoding apparatus. Video coding methods include determining removable elements from a perceptual model by analyzing an input video in terms of perceptual quality. The perceptual model is a model reflecting perceptual visual properties in terms of perceptual quality. The video coding method further comprises estimating parameters of the perceptual model. The video encoding method further includes preprocessing the input video by removing removable elements from the input video using parameters of the perceptual model. The video encoding method further includes generating a bitstream by encoding the preprocessed input video. The video coding method further includes coding parameters of the perceptual model and combining the coded parameters with the bitstream.

Advantageous effects

As described above, the present disclosure provides a video encoding method and apparatus for modeling a noise model or a subjective video quality model of a current video. Video coding methods and apparatus may pre-process and post-process video based on a model and may signal information of the model to improve coding efficiency and enhance video quality.

Drawings

Fig. 1 is a block diagram of a video encoding apparatus that may implement the techniques of this disclosure.

Fig. 2 illustrates a method for partitioning blocks using a quadtree plus binary tree trigeminal tree (QTBTTT) structure.

Fig. 3a and 3b illustrate a plurality of intra prediction modes including a wide-angle intra prediction mode.

Fig. 4 shows neighboring blocks of the current block.

Fig. 5 is a block diagram of a video decoding apparatus that may implement the techniques of this disclosure.

Fig. 6 is an explanatory diagram showing a video encoding apparatus and a video decoding apparatus including preprocessing and post-processing according to an embodiment of the present disclosure.

Fig. 7 is a schematic diagram illustrating a bilateral filter for preprocessing and post-processing according to an embodiment of the present disclosure.

Fig. 8 is a schematic diagram illustrating a video encoding apparatus and a video decoding apparatus including preprocessing and post-processing according to another embodiment of the present disclosure.

Fig. 9 is a schematic diagram illustrating a video encoding apparatus and a video decoding apparatus including preprocessing and post-processing according to another embodiment of the present disclosure.

Fig. 10 is a schematic diagram illustrating a video encoding method including preprocessing according to an embodiment of the present disclosure.

Fig. 11 is an explanatory diagram showing a video decoding method including post-processing according to an embodiment of the present disclosure.

Fig. 12 is a schematic diagram illustrating a video encoding method including preprocessing according to another embodiment of the present disclosure.

Fig. 13 is an explanatory diagram showing a video decoding method including post-processing according to another embodiment of the present disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure are described in detail with reference to the accompanying drawings. In the following description, like reference numerals denote like elements, although these elements may be shown in different drawings. Furthermore, in the following description of some embodiments, detailed descriptions of related known components and functions may be omitted for clarity and conciseness when it may be considered to obscure the subject matter of the present disclosure.

Fig. 1 is a block diagram of a video encoding device in which the techniques of this disclosure may be implemented. Hereinafter, a video encoding apparatus and components of the apparatus are described with reference to the diagram of fig. 1.

The encoding apparatus may include a picture divider 110, a predictor 120, a subtractor 130, a transformer 140, a quantizer 145, a rearrangement unit 150, an entropy encoder 155, an inverse quantizer 160, an inverse transformer 165, an adder 170, a loop filter unit 180, and a memory 190.

Each component of the encoding apparatus may be implemented as hardware or software or as a combination of hardware and software. Further, the function of each component may be implemented as software, and a microprocessor may also be implemented to execute the function of the software corresponding to each component.

A video is made up of one or more sequences comprising a plurality of pictures. Each picture is divided into a plurality of regions, and encoding is performed on each region. For example, a picture is partitioned into more than one tile or/and slice. Herein, one or more tiles may be defined as a tile set. Each tile or/and slice is partitioned into one or more Coding Tree Units (CTUs). In addition, each CTU is partitioned into one or more Coding Units (CUs) according to a tree structure. Information applied to each CU is encoded as a syntax of the CU, and information commonly applied to CUs included in one CTU is encoded as a syntax of the CTU. Further, information commonly applied to all blocks in one slice is encoded as a syntax of a slice header, and information applied to all blocks constituting one or more pictures is encoded as a Picture Parameter Set (PPS) or a picture header. Furthermore, information commonly referred to by a plurality of pictures is encoded to a Sequence Parameter Set (SPS). In addition, information commonly referenced by one or more SPS's is encoded to a Video Parameter Set (VPS). Further, information commonly applied to one tile or group of tiles may also be encoded as a syntax of a tile group header or tile. The syntax included in the SPS, PPS, slice header, tile, or tile group header may be referred to as a high level syntax.

The picture divider 110 determines the size of a Coding Tree Unit (CTU). Information about the size of the CTU (CTU size) is encoded as a syntax of the SPS or PPS and delivered to the video decoding device.

The picture divider 110 divides each picture constituting a video into a plurality of Coded Tree Units (CTUs) having a predetermined size, and then recursively divides the CTUs by using a tree structure. Leaf nodes in the tree structure become Coding Units (CUs), which are the basic units of coding.

The tree structure may be a Quadtree (QT) in which a higher node (or parent node) is partitioned into four lower nodes (or child nodes) of the same size. The tree structure may also be a Binary Tree (BT) in which a higher node is split into two lower nodes. The tree structure may also be a Trigeminal Tree (TT), wherein the higher nodes are represented by 1:2: the ratio of 1 is split into three lower nodes. The tree structure may also be a structure in which two or more of a QT structure, a BT structure, and a TT structure are mixed. For example, a quadtree plus binary tree (QTBT) structure may be used or a quadtree plus binary tree trigeminal tree (QTBTTT) structure may be used. Here, BTTT is added to the tree structure to be referred to as a multi-type tree (MTT).

Fig. 2 is a diagram for describing a method of dividing a block by using the QTBTTT structure.

As shown in fig. 2, CTUs may be first partitioned into QT structures. Quadtree partitioning may be recursive until the size of the partitioned block reaches the minimum block size (MinQTSize) of leaf nodes allowed in QT. A first flag (qt_split_flag) indicating whether each node of the QT structure is partitioned into four nodes of a lower layer is encoded by the entropy encoder 155 and signaled to the video decoding apparatus. When the leaf node of QT is not greater than the maximum block size (MaxBTSize) of the root node allowed in BT, the leaf node may be further divided into at least one of BT structure or TT structure. There may be multiple directions of segmentation in the BT structure and/or the TT structure. For example, there may be two directions, i.e., a direction in which the block of the corresponding node is divided horizontally and a direction in which the block of the corresponding node is divided vertically. As shown in fig. 2, when the MTT division starts, a second flag (MTT _split_flag) indicating whether a node is divided and a flag additionally indicating a division direction (vertical or horizontal) and/or a flag indicating a division type (binary or ternary) if a node is divided are encoded by the entropy encoder 155 and signaled to the video decoding apparatus.

Alternatively, a CU partition flag (split_cu_flag) indicating whether or not each node is partitioned into four nodes of a lower layer may be encoded before encoding a first flag (qt_split_flag) indicating whether or not the node is partitioned. When the value of the CU partition flag (split_cu_flag) indicates that each node is not partitioned, the block of the corresponding node becomes a leaf node in the partition tree structure and becomes a CU as a basic unit of encoding. When the value of the CU partition flag (split_cu_flag) indicates that each node is partitioned, the video encoding apparatus first starts encoding the first flag through the above scheme.

When QTBT is used as another embodiment of the tree structure, there may be two types, i.e., a type in which a block of a corresponding node is horizontally divided into two blocks having the same size (i.e., symmetrical horizontal division) and a type in which a block of a corresponding node is vertically divided into two blocks having the same size (i.e., symmetrical vertical division). A partition flag (split_flag) indicating whether each node of the BT structure is partitioned into lower-layer blocks and partition type information indicating a partition type are encoded by the entropy encoder 155 and delivered to the video decoding apparatus. Meanwhile, there may be additionally a type in which a block of a corresponding node is divided into two blocks in an asymmetric form with each other. The asymmetric form may include where the blocks of the respective nodes are partitioned to have 1:3, or may also include a form in which blocks of respective nodes are divided in a diagonal direction.

A CU may have various sizes according to QTBT or QTBTTT partitions from CTUs. Hereinafter, a block corresponding to a CU to be encoded or decoded (i.e., a leaf node of QTBTTT) is referred to as a "current block". Since QTBTTT segmentation is employed, the shape of the current block may be rectangular in shape in addition to square shape.

The predictor 120 predicts the current block to generate a prediction value. Predictor 120 includes an intra predictor 122 and an inter predictor 124.

In general, each of the current blocks in a picture may be predictively encoded. In general, prediction of a current block may be performed by using an intra prediction technique (using data from a picture including the current block) or an inter prediction technique (using data from a picture encoded before the picture including the current block). Inter prediction includes both unidirectional prediction and bi-directional prediction.

The intra predictor 122 predicts pixels in the current block by using pixels (reference pixels) located on neighbors of the current block in the current picture including the current block. Depending on the prediction direction, there are multiple intra prediction modes. For example, as shown in fig. 3a, the plurality of intra prediction modes may include 2 non-directional modes including a planar mode and a DC mode, and may include 65 directional modes. The adjacent pixels and the arithmetic equation to be used are differently defined according to each prediction mode.

For efficient direction prediction of the current block having a rectangular shape, direction modes (# 67 to # 80), intra prediction modes # -1 to # -14) as indicated by dotted arrows in fig. 3b may be additionally used. The direction mode may be referred to as a "wide-angle intra prediction mode". In fig. 3b, the arrows indicate the corresponding reference samples for prediction and do not represent the prediction direction. The predicted direction is opposite to the direction indicated by the arrow. When the current block has a rectangular shape, the wide-angle intra prediction mode is a mode in which prediction is performed in a direction opposite to a specific direction mode without additional bit transmission. In this case, in the wide-angle intra prediction mode, some of the wide-angle intra prediction modes available for the current block may be determined by a ratio of the width and the height of the current block having a rectangular shape. For example, when the current block has a rectangular shape having a height smaller than a width, wide-angle intra prediction modes (intra prediction modes #67 to # 80) having angles smaller than 45 degrees are available. When the current block has a rectangular shape having a width greater than a height, a wide-angle intra prediction mode having an angle greater than-135 degrees may be used.

The intra predictor 122 may determine intra prediction to be used for encoding the current block. In some embodiments, intra predictor 122 may encode the current block by using a plurality of intra prediction modes, and also select an appropriate intra prediction mode to be used from among the modes under test. For example, the intra predictor 122 may calculate a rate-distortion value by using a rate-distortion analysis for a plurality of measured intra prediction modes, and also select an intra prediction mode having the best rate-distortion characteristic among the measured modes.

The intra predictor 122 selects one intra prediction mode among a plurality of intra prediction modes, and predicts the current block by using neighboring pixels (reference pixels) and an arithmetic equation determined according to the selected intra prediction mode. Information about the selected intra prediction mode is encoded by the entropy encoder 155 and delivered to a video decoding device.

The inter predictor 124 generates a prediction value for the current block by using a motion compensation process. The inter predictor 124 searches for a block most similar to the current block in a reference picture encoded and decoded earlier than the current picture, and generates a prediction value for the current block by using the searched block. In addition, a Motion Vector (MV) is generated, which corresponds to a displacement between a current block in a current picture and a predicted value in a reference picture. In general, motion estimation is performed on a luminance component, and a motion vector calculated based on the luminance component is used for both the luminance component and the chrominance component. Motion information including information on a reference picture and information on a motion vector for predicting a current block is encoded by the entropy encoder 155 and delivered to a video decoding apparatus.

The inter predictor 124 may also perform interpolation on reference pictures or reference blocks in order to increase the accuracy of prediction. In other words, sub-samples between two consecutive integer samples are interpolated by applying the filter coefficients to a plurality of consecutive integer samples comprising the two integer samples. When the process of searching for a block most similar to the current block is performed with respect to the interpolated reference picture, it is possible to represent not an integer-sampling-unit precision but a decimal-unit precision with respect to the motion vector. The precision or resolution of the motion vector may be set differently for each target region to be encoded (e.g., units such as slices, tiles, CTUs, CUs, etc.). When such Adaptive Motion Vector Resolution (AMVR) is applied, information on the motion vector resolution to be applied to each target region should be signaled for each target region. For example, when the target area is a CU, information about the resolution of a motion vector applied to each CU is signaled. The information on the resolution of the motion vector may be information representing the accuracy of a motion vector difference, which will be described below.

Meanwhile, the inter predictor 124 may perform inter prediction by using bi-directional prediction. In the case of bi-prediction, two reference pictures and two motion vectors representing block positions most similar to the current block in each reference picture are used. The inter predictor 124 selects a first reference picture and a second reference picture from the reference picture list0 (RefPicList 0) and the reference picture list1 (RefPicList 1), respectively. The inter predictor 124 also searches for a block most similar to the current block in each reference picture to generate a first reference block and a second reference block. In addition, a prediction value for the current block is generated by averaging or weighted-averaging the first reference block and the second reference block. In addition, motion information including information on two reference pictures for predicting the current block and information on two motion vectors is delivered to the entropy encoder 155. Here, the reference picture list0 may be composed of pictures preceding the current picture in display order among the pre-restored pictures, and the reference picture list1 may be composed of pictures following the current picture in display order among the pre-restored pictures. However, although not particularly limited thereto, a pre-restored picture following the current picture in display order may be additionally included in the reference picture list 0. Conversely, a pre-restored picture preceding the current picture may be additionally included in the reference picture list 1.

In order to minimize the number of bits consumed for encoding motion information, various methods may be used.

For example, when a reference picture and a motion vector of a current block are identical to those of a neighboring block, information capable of identifying the neighboring block is encoded to deliver motion information of the current block to a video decoding apparatus. This approach is called merge mode.

In the merge mode, the inter predictor 124 selects a predetermined number of merge candidate blocks (hereinafter, referred to as "merge candidates") from neighboring blocks of the current block.

As shown in fig. 4, all or some of a left block A0, a lower left block A1, an upper block B0, an upper right block B1, and an upper left block B2 adjacent to the current block in the current picture may be used as neighboring blocks for deriving a merge candidate. Further, blocks positioned within reference pictures other than the current picture (at which the current block is positioned), which may be the same as or different from the reference picture used to predict the current block, may also be used as merging candidates. For example, a block co-located with the current block within the reference picture or a block adjacent to the co-located block may be additionally used as a merge candidate. If the number of merging candidates selected by the method described above is smaller than the preset number, a zero vector is added to the merging candidates.

The inter predictor 124 configures a merge list including a predetermined number of merge candidates by using neighboring blocks. A merge candidate to be used as motion information of the current block is selected from among the merge candidates included in the merge list, and merge exponent information for identifying the selected candidate is generated. The generated merging exponent information is encoded by the entropy encoder 155 and delivered to a video decoding device.

The merge skip mode is a special case of the merge mode. After quantization, when all transform coefficients used for entropy encoding are close to zero, only neighboring block selection information is transmitted without transmitting a residual signal. By using the merge skip mode, relatively high encoding efficiency can be achieved for images with slight motion, still images, screen content images, and the like.

Hereinafter, the merge mode and the merge skip mode are collectively referred to as a merge/skip mode.

Another method for encoding motion information is Advanced Motion Vector Prediction (AMVP) mode.

In the AMVP mode, the inter predictor 124 derives a motion vector predictor candidate for a motion vector of a current block by using neighboring blocks of the current block. As neighboring blocks used to derive the motion vector predictor candidates, all or some of a left block A0, a lower left block A1, an upper block B0, an upper right block B1, and an upper left block B2 adjacent to the current block in the current picture shown in fig. 4 may be used. Furthermore, blocks located within reference pictures (which may be the same as or different from the reference picture used to predict the current block) other than the current picture (where the current block is located) may also be used as neighboring blocks for deriving motion vector predictor candidates. For example, a block co-located with the current block within the reference picture or a block adjacent to the co-located block may be used. If the number of motion vector candidates selected by the above method is less than a preset number, a zero vector is added to the motion vector candidates.

The inter predictor 124 derives a motion vector predictor candidate by using the motion vector of the neighboring block and determines a motion vector predictor for the motion vector of the current block by using the motion vector predictor candidate. In addition, a motion vector difference is calculated by subtracting a motion vector predictor from a motion vector of the current block.

The motion vector predictor may be obtained by applying a predefined function (e.g., a center value and an average value calculation, etc.) to the motion vector predictor candidates. In this case, the video decoding device is also aware of the predefined function. In addition, since the neighboring block used to derive the motion vector predictor candidate is a block for which encoding and decoding have been completed, the video decoding apparatus may also already know the motion vector of the neighboring block. Therefore, the video encoding apparatus does not need to encode information for identifying motion vector predictor candidates. Thus, in this case, information on a motion vector difference and information on a reference picture for predicting a current block are encoded.

Meanwhile, the motion vector predictor may be determined according to a scheme of selecting any one of the motion vector predictor candidates. In this case, information for identifying the selected motion vector predictor candidate is additionally encoded together with information on a motion vector difference and information on a reference picture for predicting a current block.

The subtractor 130 generates a residual block by subtracting the prediction block generated by the intra predictor 122 or the inter predictor 124 from the current block.

The transformer 140 converts a residual signal in a residual block having pixel values of a spatial domain into transform coefficients of a frequency domain. The transformer 140 may transform a residual signal in the residual block by using the total size of the residual block as a transform unit, or may also divide the residual block into a plurality of sub-blocks, and may perform the transform by using the sub-blocks as transform units. Alternatively, the residual block is divided into two sub-blocks (a transform region and a non-transform region) to transform the residual signal by using only the transform region sub-block as a transform unit. Here, the transform region sub-block may be 1 with a horizontal axis (or vertical axis) based: 1, one of two rectangular blocks of size ratio. In this case, a flag (cu_sbt_flag) indicates that only the sub-block is transformed, and direction (vertical/horizontal) information (cu_sbt_horizontal_flag) and/or position information (cu_sbt_pos_flag) are encoded by the entropy encoder 155 and signaled to the video decoding device. Furthermore, the transform region sub-block may have a size of 1 based on the horizontal axis (or vertical axis): 3. In this case, a flag (cu_sbt_quad_flag) dividing the corresponding division is additionally encoded by the entropy encoder 155 and signaled to the video decoding apparatus.

Meanwhile, the transformer 140 may perform transformation on the residual block separately in the horizontal direction and the vertical direction. For the transformation, different types of transformation functions or transformation matrices may be used. For example, a pair of transform functions for horizontal transforms and vertical transforms may be defined as a Multiple Transform Set (MTS). The transformer 140 may select one transform function pair having the highest transform efficiency in the MTS and may transform the residual block in each of the horizontal and vertical directions. Information about the transform function pairs in the MTS (mts_idx) is encoded by the entropy encoder 155 and signaled to the video decoding device.

The quantizer 145 quantizes the transform coefficients output from the transformer 140 using quantization parameters, and outputs the quantized transform coefficients to the entropy encoder 155. The quantizer 145 may also immediately quantize the relevant residual block without a transform for any block or frame. The quantizer 145 may also apply different quantization coefficients (scaling values) according to the positions of the transform coefficients in the transform block. A quantization matrix applied to transform coefficients quantized in a two-dimensional arrangement may be encoded and transmitted to a video decoding apparatus.

The rearrangement unit 150 may perform rearrangement of coefficient values for quantized residual values.

The rearrangement unit 150 may change the 2D coefficient array to a 1D coefficient sequence by using coefficient scanning. For example, the rearrangement unit 150 may output a 1D coefficient sequence by scanning the DC coefficient into a high frequency domain coefficient using a zig-zag scan or a diagonal scan. Instead of zig-zag scanning, vertical scanning that scans the 2D coefficient array in the column direction and horizontal scanning that scans the 2D block type coefficients in the row direction may also be used, depending on the size of the transform unit and the intra prediction mode. In other words, according to the size of the transform unit and the intra prediction mode, a scan method to be used may be determined in zig-zag scan, diagonal scan, vertical scan, and horizontal scan.

The entropy encoder 155 generates a bitstream by encoding a sequence of 1D quantized transform coefficients output from the rearrangement unit 150 using various encoding schemes including context-based adaptive binary arithmetic coding (CABAC), exponential golomb, and the like.

Further, the entropy encoder 155 encodes information related to block division, such as a CTU size, a CTU division flag, a QT division flag, an MTT division type, an MTT division direction, and the like, to allow the video decoding apparatus to equally divide blocks with the video encoding apparatus. Further, the entropy encoder 155 encodes information on a prediction type indicating whether the current block is encoded by intra prediction or inter prediction. The entropy encoder 155 encodes intra prediction information (i.e., information about an intra prediction mode) or inter prediction information (in the case of a merge mode, a merge index, and in the case of an AMVP mode, information about a reference picture index and a motion vector difference) according to a prediction type. Further, the entropy encoder 155 encodes information related to quantization (i.e., information about quantization parameters and information about quantization matrices).

The inverse quantizer 160 dequantizes the quantized transform coefficients output from the quantizer 145 to generate transform coefficients. The inverse transformer 165 transforms the transform coefficients output from the inverse quantizer 160 from the frequency domain to the spatial domain to restore a residual block.

The adder 170 adds the restored residual block to the prediction block generated by the predictor 120 to restore the current block. When intra prediction is performed on the next-order block, pixels in the restored current block may be used as reference pixels.

The loop filter unit 180 performs filtering on the restored pixels in order to reduce a blocking effect, a ringing effect, a blurring effect, etc., which occur due to block-based prediction and transform/quantization. The loop filter unit 180 as a loop filter may include all or some of a deblocking filter 182, a Sample Adaptive Offset (SAO) filter 184, and an Adaptive Loop Filter (ALF) 186.

The deblocking filter 182 filters boundaries between restored blocks in order to remove blocking artifacts occurring due to block unit encoding/decoding, and the SAO filter 184 and ALF 186 perform additional filtering of the filtered video for deblocking. The SAO filter 184 and ALF 186 are filters for compensating for differences between restored pixels and original pixels that occur due to lossy encoding. The SAO filter 184 applies an offset as a CTU unit to enhance subjective image quality and coding efficiency. On the other hand, the ALF 186 performs block unit filtering, and compensates for distortion by applying different filters by dividing boundaries of respective blocks and the degree of variation. Information about filter coefficients to be used for ALF may be encoded and signaled to a video decoding apparatus.

The restored blocks filtered by the deblocking filter 182, the SAO filter 184, and the ALF 186 are stored in the memory 190. When all blocks in one picture are restored, the restored picture may be used as a reference picture for inter-predicting blocks within a picture to be encoded later.

Fig. 5 is a functional block diagram of a video decoding device in which the techniques of this disclosure may be implemented. Hereinafter, with reference to fig. 5, a video decoding apparatus and components of the apparatus are described.

The video decoding apparatus may include an entropy decoder 510, a rearrangement unit 515, an inverse quantizer 520, an inverse transformer 530, a predictor 540, an adder 550, a loop filter unit 560, and a memory 570.

Similar to the video encoding apparatus of fig. 1, each component of the video decoding apparatus may be implemented as hardware or software or as a combination of hardware and software. Further, the function of each component may be implemented as software, and a microprocessor may also be implemented to execute the function of the software corresponding to each component.

The entropy decoder 510 extracts information related to block segmentation by decoding a bitstream generated by a video encoding apparatus to determine a current block to be decoded, and extracts prediction information required to restore the current block and information about a residual signal.

The entropy decoder 510 determines the size of a CTU by extracting information about the CTU size from a Sequence Parameter Set (SPS) or a Picture Parameter Set (PPS), and partitions a picture into CTUs having the determined size. In addition, the CTU is determined to be the highest layer of the tree structure, i.e., the root node, and the partition information for the CTU may be extracted to partition the CTU by using the tree structure.

For example, when dividing CTUs using the QTBTTT structure, first a first flag (qt_split_flag) related to the division of QT is extracted to divide each node into four nodes of the lower layer. Further, for a node corresponding to a leaf node of QT, a second flag (MTT _split_flag), a split direction (vertical/horizontal), and/or a split type (binary/ternary) related to the split of the MTT are extracted to split the corresponding leaf node into an MTT structure. As a result, each node below the leaf node of QT is recursively partitioned into BT or TT structures.

As another embodiment, when the CTU is divided by using the QTBTTT structure, a CU division flag (split_cu_flag) indicating whether the CU is divided is extracted. The first flag (qt_split_flag) may also be extracted when the corresponding block is partitioned. During the segmentation process, 0 or more recursive MTT segmentations may occur after 0 or more recursive QT segmentations for each node. For example, for CTUs, MTT partitioning may occur immediately, or conversely, QT partitioning may occur only multiple times.

For another example, when the CTU is divided by using the QTBT structure, a first flag (qt_split_flag) related to the division of QT is extracted to divide each node into four nodes of the lower layer. Further, a split flag (split_flag) indicating whether a node corresponding to a leaf node of QT is further split into BT and split direction information are extracted.

Meanwhile, when the entropy decoder 510 determines a current block to be decoded by using the partition of the tree structure, the entropy decoder 510 extracts information on a prediction type indicating whether the current block is intra-predicted or inter-predicted. When the prediction type information indicates intra prediction, the entropy decoder 510 extracts syntax elements for intra prediction information (intra prediction mode) of the current block. When the prediction type information indicates inter prediction, the entropy decoder 510 extracts information representing syntax elements (i.e., motion vectors and reference pictures to which the motion vectors refer) for the inter prediction information.

Further, the entropy decoder 510 extracts quantization related information and extracts information on quantized transform coefficients of the current block as information on a residual signal.

The rearrangement unit 515 may change the sequence of the 1D quantized transform coefficients entropy-decoded by the entropy decoder 510 into a 2D coefficient array (i.e., block) again in the reverse order of the coefficient scan order performed by the video encoding apparatus.

The inverse quantizer 520 dequantizes the quantized transform coefficients, and dequantizes the quantized transform coefficients by using quantization parameters. The inverse quantizer 520 may also apply different quantized coefficients (scaling values) to the quantized transform coefficients arranged in 2D. The inverse quantizer 520 may perform dequantization by applying a matrix of quantized coefficients (scaled values) from a video encoding device to a 2D array of quantized transform coefficients.

The inverse transformer 530 generates a residual block for the current block by restoring a residual signal through inverse transforming the dequantized transform coefficients from the frequency domain to the spatial domain.

Further, when the inverse transformer 530 inversely transforms a partial region (sub-block) of the transform block, the inverse transformer 530 extracts a flag (cu_sbt_flag) where only the sub-block of the transform block is transformed, direction (vertical/horizontal) information (cu_sbt_horizontal_flag) of the sub-block, and/or position information (cu_sbt_pos_flag) of the sub-block. The inverse transformer 530 also inversely transforms transform coefficients of the corresponding sub-block from the frequency domain to the spatial domain to restore a residual signal, and fills the region that is not inversely transformed with a value of "0" as the residual signal to generate a final residual block for the current block.

In addition, when applying MTS, the inverse transformer 530 determines a transform index or a transform matrix applied in each of the horizontal direction and the vertical direction by using MTS information (mts_idx) signaled from the video encoding apparatus. The inverse transformer 530 also performs inverse transformation on the transform coefficients in the transform block in the horizontal direction and the vertical direction by using the determined transform function.

The predictor 540 may include an intra predictor 542 and an inter predictor 544. The intra predictor 542 is activated when the prediction type of the current block is intra prediction, and the inter predictor 544 is activated when the prediction type of the current block is inter prediction.

The intra predictor 542 determines an intra prediction mode of the current block among a plurality of intra prediction modes according to a syntax element of the intra prediction mode extracted from the entropy decoder 510. The intra predictor 542 also predicts the current block by using neighboring reference pixels of the current block according to an intra prediction mode.

The inter predictor 544 determines a motion vector of the current block and a reference picture to which the motion vector refers by using syntax elements for the inter prediction mode extracted from the entropy decoder 510.

The adder 550 restores the current block by adding the residual block output from the inverse transformer 530 to the prediction block output from the inter predictor 544 or the intra predictor 542. In intra prediction of a block to be decoded later, pixels within the restored current block are used as reference pixels.

The loop filter unit 560, which is a loop filter, may include a deblocking filter 562, an SAO filter 564, and an ALF 566. Deblocking filter 562 performs deblocking filtering on boundaries between restored blocks to remove blocking artifacts occurring due to block unit decoding. The SAO filter 564 and ALF 566 perform additional filtering on the restored block after deblocking filtering to compensate for differences between restored pixels and original pixels that occur due to lossy encoding. The filter coefficients of the ALF are determined by using information on the filter coefficients decoded from the bitstream.

The restored blocks filtered by the deblocking filter 562, the SAO filter 564, and the ALF 566 are stored in the memory 570. When all blocks in one picture are restored, the restored picture may be used as a reference picture for inter-predicting blocks within a picture to be encoded later.

In some embodiments, the present invention relates to encoding and decoding video images as described above. More particularly, the present disclosure provides a video encoding method and apparatus for modeling a noise model or a subjective video quality model of a current video, for preprocessing and post-processing the video based on the model, and signaling information of the model.

In the following description, for convenience, the embodiment of fig. 1 shows an encoder, and the embodiment of fig. 5 shows a decoder, so as to represent a video encoding/decoding apparatus including preprocessing and post-processing. Thus, the video encoding apparatus may include a component that performs preprocessing processing, and the video encoding apparatus may include an encoder. The video decoding apparatus may include a component that performs post-processing, and the video decoding apparatus may include a decoder.

< example 1> pretreatment/post-treatment method Using noise model

The video encoding apparatus pre-processes noise of an input video, encodes the input video whose noise has been pre-processed to generate a bitstream, and then transmits the bitstream to the video decoding apparatus. The video encoding apparatus includes all or some of an encoder 600, a noise reducer 602, a noise analyzer 604, a noise estimator 606, and a differencer 608. The video decoding apparatus may generate a restored video from the bitstream and may apply post-processed noise to the restored video. The video decoding apparatus includes all or some of a decoder 610, a noise generator 612, and an adder 614.

Hereinafter, an operation of the video encoding apparatus as shown in fig. 6 will be described.

The noise reducer 602 removes noise added to the input video before the video is input to the encoder 600 to generate a noise-reduced video. The noise reducer 602 may use a noise reduction method corresponding to a characteristic of noise added to the input video based on analysis of the input video.

Alternatively, the noise reducer 602 may use a method of removing noise using predefined pixel operations without analyzing the type of noise added to the input video. In this case, the predefined pixel operation may represent different types of filtering methods, such as low-pass filtering, bilateral filtering, and bilinear filtering. The noise reducer 602 may selectively use one of these different types of filtering methods. Specifically, in bilateral filtering, the weight may depend on a difference in pixel value in addition to the spatial distance of the pixel, unlike in the related art in which an FIR filter that uses a weight depending on the spatial distance of the pixel. Bilateral filtering is a representative embodiment of a filtering method that functions to preserve meaningful boundaries even when filtering is performed at the boundaries of objects in a video. Hereinafter, the present embodiment proposes a method and apparatus for effectively removing noise by performing bilateral filtering on an input video before the input video is input to an encoder 600.

The noise reducer 602 provides the noise reduced video as an input to the encoder 600. Also, the noise reducer 602 may provide the noise reduced video to the noise analyzer 604.

The noise analyzer 604 may take as input two channels including the original input video and the noise reduced video. Alternatively, the noise analyzer 604 may acquire as input a difference video between the original input video and the noise reduced video. In this case, the difference video may be generated by the differencing unit 608.

The noise analyzer 604 analyzes two channels or difference videos to analyze characteristics of noise removed from the original input video. Here, the characteristics of the noise may be gaussian distribution, uniform distribution, or the like.

The noise estimator 606 reflects characteristics of the noise analyzed by the noise analyzer 604 to generate parameters of the noise removed from the original input video. Further, the noise parameter may include a noise generation method corresponding to the noise characteristic. Here, the noise generation method may be used later in the video decoding apparatus. As another embodiment, a noise generation method may be agreed in advance between the video encoding apparatus and the video decoding apparatus according to noise characteristics.

Meanwhile, the noise parameters are supplied to the noise reducer 602, and the noise reducer 602 may use a noise reduction method corresponding to the characteristics of noise.

The encoder 600 encodes the noise reduced video to generate a bitstream. In this case, the bitstream may include a result of encoding the noise parameter. The video encoding device may transmit the bitstream to a video decoding device.

Hereinafter, an operation of the video decoding apparatus as shown in fig. 6 will be described.

The decoder 610 generates restored video from the bitstream. As described above, the bitstream may include noise parameters generated by the video encoding device.

Meanwhile, as noise parameters, such as Supplemental Enhancement Information (SEI) or Video Usability Information (VUI), included in the independent bitstream, the noise parameters may be transmitted from the video encoding apparatus to the video decoding apparatus.

The decoded noise parameters are provided to a noise generator 612. As described above, the noise parameters may include a noise generation method. Alternatively, a noise generation method agreed in advance between the video encoding apparatus and the video decoding apparatus may be used.

The noise generator 612 generates noise based on the noise parameters using a noise generation method.

The video decoding device may use the generated noise and the restored video to generate a final restored video. In this case, as shown in fig. 6, noise may be added to the restored video in an offset form by adder 614. Alternatively, a predefined filter is applied to the restored video to generate a final restored video so that noise may be added.

As an example, the bilateral filter may be a 3×3 filter, as shown in fig. 7. Alternatively, a 5×5 filter, a 7×7 filter, or the like may be used in addition to the 3×3 filter.

Meanwhile, for a bilateral filter having a size of 3×3, the filter coefficients may be composed of weights calculated at each position. As an embodiment, the weights of the bilateral filter may be expressed as equation 1.

[ equation 1]

Here, the first term of the exponential function is determined by the spatial distance using pixels, and the second term is determined by the luminance value of each pixel (i.e., the difference of pixel values at each position).

In equation 1, the weight w (i, j, k, m) of the bilateral filter represents the weight to be applied to the pixel at the position (k, m) in order to apply the filter to the pixel at the position (i, j). Therefore, as shown in fig. 7, the weight value is maximum at the center of the bilateral filter corresponding to i=k and j=m.

Furthermore, σ in the first term of the exponential function _d May be determined according to the size of the current block and may be a different value according to the coding scheme (i.e., intra prediction or inter prediction) for the current block. Sigma in the second term of the exponential function _r May be determined by the quantization parameter of the current block.

These σ values can be directly calculated in the video encoding apparatus and the video decoding apparatus according to equation 1. Alternatively, the value may be signaled from the video encoding device to the video decoding device using a value encoded in the bitstream with specific syntax information. Further, as a size of a filter to be applied to pixels in a block according to a weight of the bilateral filter, one of 3×3, 5×5, and 7×7 as described above may be selected and then used.

< example 2> perception preprocessing/post-processing method

The video encoding device analyzes the input video, removes perceptually removable elements from the video, encodes the input video from which the elements have been removed, and generates a bitstream. The video encoding device transmits the bit stream to the video decoding device. The video encoding apparatus includes all or some of an encoder 600, a perceptual quality pre-processor 802, a perceptual model analyzer 804, a perceptual model estimator 806, and a differencer 608. The video decoding apparatus generates a restored video from the bitstream and then post-processes the restored video to improve subjective video quality. The video decoding apparatus includes all or some of the decoder 610, the perceptual quality enhancer 812, or the adder 614.

Hereinafter, an operation of the video encoding apparatus as shown in fig. 8 will be described.

Before video is input to encoder 600, perceptual quality pre-processor 802 removes the perceptually removable elements added to the input video to generate a video from which the removable elements have been removed. The perceptual quality pre-processor 802 analyzes the input video and removes removable elements based on various subjective video quality measurement models. Further, the perceptual quality pre-processor 802 may remove elements that are determined to improve coding efficiency when removed.

In this case, the perceptually removable element represents a changing element that is not identifiable in the human visual system when the visual characteristics of the person are considered. Examples of such perceptual visual properties include Contrast Sensitivity Function (CSF) effects, contrast Masking (CM) or Texture Masking (TM) effects, and brightness adaptation (LA) effects.

The CSF effect means that characteristics of the human perceptual visual system exhibit characteristics such as a band-pass filter on a frequency axis. CSF effects can be classified into spatial contrast sensitivity function effects and temporal contrast sensitivity function effects according to the type of frequency axis.

The CM or TM effect represents a characteristic that modifies the visual characteristics of video according to masking. For example, the human visual system exhibits a characteristic that the visibility of distortion is reduced in a high texture region, and distortion can be easily recognized in a flat region or in the vicinity of a boundary.

The LA effect refers to a characteristic that distortion is not easily recognized in a dark or bright luminance region as compared with a luminance region having medium luminance.

Meanwhile, there are various perceptual models that reflect perceptual visual characteristics according to perceptual quality and reflect those characteristics in video. The minimum noticeable error (JND) model is a representative perceptual model. Here, JND represents the minimum error in human onset of visual perception. The present implementation examples propose methods and apparatus for removing imperceptible error elements in terms of subjective video quality using a perceptual quality pre-processor 802. As an example, the perceptual quality pre-processor 802 may remove imperceptible error elements based on the JND model described above.

The perceptual quality pre-processor 802 provides video from which non-perceptual error elements are removed as input to the encoder 600. Further, the perceptual quality pre-processor 802 may provide video from which imperceptible error elements are removed to the perceptual model analyzer 804.

The perceptual model analyzer 804 may take as input two channels including an original input video and a video from which unrecognizable error elements have been removed. Alternatively, the perceptual model analyzer 804 may obtain as input a difference video between the original input video and the video from which the imperceptible error elements are removed. In this case, the difference video may be generated by the differencing unit 608.

The perceptual model analyzer 804 predicts whether a perceptually removable element is present in the original input video, and the perceptual model analyzer 804 predicts an improvement in coding efficiency when the removable element is removed. Further, the perceptual model analyzer 804 analyzes the two channels or difference video to ascertain characteristics of the imperceptible error elements removed from the original input video.

Based on the characteristics of the imperceptible error elements, the perception model analyzer 804 may determine a dominant characteristic among the above-described perceived visual characteristics and may select a model suitable for corresponding to the dominant characteristic as the perception model. As described above, the JND model may be selected as the perception model.

The perception model estimator 806 generates parameters of the perception model selected by the perception model analyzer 804. For example, when the perceptual model is a JND model, the perceptual model estimator 806 may obtain a threshold of pixel values of the input video in terms of perceptual quality. Here, the threshold value represents a maximum value to which pixel values of the input video can be changed in terms of perceived quality. The perception model parameters may include such thresholds.

Further, the perceptual model estimator 806 may select a video quality compensation method corresponding to the perceptual model and may include the video quality compensation method as the perceptual model parameters. Here, the video quality compensation method may be used later in the video decoding apparatus. As another embodiment, a video quality compensation method may be pre-agreed between a video encoding apparatus and a video decoding apparatus according to a perceptual visual property.

At the same time, perceptual model parameters may be provided to the perceptual quality pre-processor 802. The perceptual quality pre-processor 802 may generate a video from which perceptually removable elements have been removed from the input video using a method corresponding to the perceptual model. For example, when the perceptual model is a JND model, the perceptual quality pre-processor 802 may apply operations to the input video within the threshold limits described above. Here, the operation may be filtering, convolution operation, or change of pixel value using offset.

The encoder 600 encodes the preprocessed input video to generate a bitstream. In this case, the bitstream may include parameters of the perceptual model. The video encoding device may send the bitstream to a video decoding device.

Hereinafter, the operation of the video decoding apparatus as shown in fig. 8 is described.

The decoder 610 generates restored video from the bitstream. As described above, the bitstream may include parameters of a perceptual model generated by the video encoding device.

Meanwhile, the perceptual model parameters may be transmitted from the video encoding apparatus to the video decoding apparatus using the perceptual model parameters included in an independent bitstream (such as SEI or VUI).

Decoding parameters of the perceptual model are provided to a perceptual quality enhancer 812. As described above, the parameters of the perceptual model may include a video quality enhancement method. Alternatively, a video quality enhancement method agreed in advance between the video encoding apparatus and the video decoding apparatus may be used. When the perceptual model is a JND model, the perceptual model parameters may include a threshold (to which pixel values of the restored video in terms of perceived quality may change).

The perceptual quality enhancer 812 post-processes the restored video based on the perceptual model parameters to generate an enhanced video to improve the subjective video quality of the restored video. When the perceptual model is a JND model, the perceptual quality enhancer 812 can apply operations to the restored video according to the video quality compensation method within the above-described threshold limits. Here, the operation may be filtering, convolution operation, or change of pixel value using offset.

The perceived quality enhancer 812 can evaluate the perceived degraded portion and can then enhance the video quality to generate an enhanced video. In this case, the perceived quality enhancer 812 may divide the restored video into n×n square blocks, and post-process the restored video every n×n square blocks. In other words, the perceptual quality enhancer 812 may apply a single perceptual model to the entire restored video to evaluate the perceptual degradation and may enhance the video quality, or the perceptual quality enhancer 812 may divide the restored video into a plurality of blocks and may perform post-processing in units of blocks.

The video decoding apparatus may generate a final restored video using the generated enhanced video and the restored video. In this case, as shown in fig. 8, the enhanced video may be added to the restored video in an offset form by an adder 614. Alternatively, the enhanced video generated by the perceived quality enhancer 812 may be the final restored video.

< example 3> pretreatment/post-treatment method using downsampling and upsampling

The video encoding apparatus reduces the resolution of the input video to encode the reduced resolution input video to generate a bitstream, and then transmits the bitstream to the video decoding apparatus. The video encoding apparatus includes all or some of the encoder 600, the downsampler 902, or the neighbor information encoder 904. The video decoding apparatus may generate a restored video from the bitstream and then may upsample the restored video. The video decoding apparatus includes all or some of a decoder 610, a neighbor information decoder 912, or an upsampler 914.

Hereinafter, an operation of the video encoding apparatus as shown in fig. 9 will be described.

The downsampler 902 reduces the resolution of the input video before the video is input to the encoder 600. In other words, the downsampler 902 performs a resolution reduction operation using a downsampling filter to convert the input video into video having a width and height of 1/2 or 1/4. For example, unlike the method of encoding 4K video having a resolution of 3840×2160 in the related art, the present implementation example downsamples 3840×2160 video into 1920×1080 video by performing downsampling, and uses the downsampled video as an input for encoding. When the encoder 600 is operated by performing such downsampling, encoding efficiency can be improved.

The neighbor information encoder 904 encodes the type of downsampling operation used when downsampling the original video by the downsampler 902 and downsampling information (i.e., downsampling parameters) to generate a bitstream. Here, the downsampling operation represents a downsampling filter used when downsampling is performed, and the downsampling information represents a video ratio between an original video and a downsampled video. The neighboring information encoder 904 encodes the downsampling parameter into a syntax element. Further, when cropping is applied to the original video, the neighboring information encoder 904 may additionally encode information about the cropping operation performed before downsampling into syntax elements.

Hereinafter, the operation of the video decoding apparatus as shown in fig. 9 is described.

The decoder 610 generates restored video from the bitstream. As described above, the bitstream may include downsampling parameters generated by the video encoding device.

Meanwhile, the downsampling parameters may be transmitted from the video encoding apparatus to the video decoding apparatus using the downsampling parameters included in a separate bitstream (such as SEI or VUI). The bit stream including the downsampling parameters is provided to the neighbor information decoder 912.

The neighbor information decoder 912 decodes the downsampling parameters from the bitstream. Further, the neighbor information decoder 912 obtains up-sampling information for up-sampling the restored video at the resolution of the original video based on the down-sampling parameter. Here, the up-sampling information may include an up-sampling filter used when up-sampling is performed, and the up-sampling information may include a video ratio between the down-sampled restored video and the original video.

The upsampler 914 generates a final restored video having the resolution of the original video from the restored video based on the upsampling information.

As another embodiment, the up-sampler 914 may perform super-resolution (SR) to up-sample the restored video with the resolution of the original video. Thus, the upsampler 914 may perform SR using a deep learning based neural network. In this case, the neural network for SR may be composed of a combination of a plurality of convolution layers, pooling layers, activation function layers, and the like.

Hereinafter, a video encoding method and a video decoding method according to example 1 are described using the diagrams of fig. 10 and 11.

The video encoding apparatus analyzes the input video to analyze characteristics of noise (S1000). Here, the characteristic of the noise may be gaussian distribution or uniform distribution.

The video encoding apparatus estimates parameters of noise (S1002). The video encoding device reflects the characteristics of the analyzed noise to generate parameters of noise removed from the original input video. Further, the noise parameter may include a noise generation method corresponding to the noise characteristic. Here, the noise generation method may be used later in the video decoding apparatus. As another embodiment, a noise generation method may be agreed in advance between the video encoding apparatus and the video decoding apparatus according to noise characteristics.

The video encoding apparatus removes noise from the input video using the noise parameter to pre-process the input video (S1004). The video encoding apparatus may use a noise reduction method corresponding to characteristics of noise added to the input video based on analysis of the input video.

Alternatively, a method of removing noise added to an input video using a predefined pixel operation without analyzing a noise type may be used. In this case, the predefined pixel operation may represent various types of filtering methods, such as low-pass filtering, bilateral filtering, and bilinear filtering. The video encoding apparatus may selectively use one of various types of filtering methods.

The video encoding apparatus encodes the preprocessed input video to generate a bitstream (S1006).

The video encoding apparatus encodes the parameters of noise and combines the encoded parameters with a bitstream (S1008). The video encoding device may send the bitstream to a video decoding device.

The video decoding apparatus decodes the bitstream to generate restored video (S1100).

The video decoding apparatus decodes parameters of noise from the bitstream (S1102). As described above, the noise parameters may include a noise generation method. Alternatively, a noise generation method agreed in advance between the video encoding apparatus and the video decoding apparatus may be used.

Meanwhile, the noise parameters may be transmitted from the video encoding apparatus to the video decoding apparatus using the noise parameters included in a separate bitstream (such as SEI or VUI).

The video decoding apparatus post-processes the restored video using the noise parameters to generate noise (S1104).

The video decoding apparatus generates a final restored video using the noise and the restored video (S1106). For example, noise may be added to the restored video in the form of an offset. Alternatively, noise may be added by applying predefined filtering to the restored video to generate the final restored video.

Hereinafter, a video encoding method and a video decoding method according to example 2 are described using the descriptions of fig. 12 and 13.

The video encoding apparatus analyzes the input video in terms of perceptual quality and determines removable elements according to a perceptual model (S1200). Here, perceiving the removable element means a changing element that is not identifiable in the human visual system when considering the visual characteristics of the human. Examples of such perceived visual characteristics include CSF effects, CM or TM effects, and LA effects.

Meanwhile, the perception model is a model reflecting the perceived visual characteristics in terms of perceived quality. The JND model is a representative perceptual model. Here, JND represents the minimum error in human onset of visual perception.

The video encoding apparatus estimates parameters of the perceptual model (S1202). For example, when the perceptual model is a JND model, the video encoding apparatus may obtain a threshold of pixel values of the input video in terms of perceptual quality. Here, the threshold value represents a maximum value to which pixel values of the input video can be changed in terms of perceived quality. The perception model parameters may include such thresholds.

In addition, the video encoding apparatus may select a video quality compensation method corresponding to the perceptual model, and may include the video quality compensation method as the perceptual model parameter. Here, the video quality compensation method may be used later in the video decoding apparatus. As another embodiment, a video quality compensation method may be pre-agreed between a video encoding apparatus and a video decoding apparatus according to perceptual visual characteristics.

The video encoding apparatus removes removable elements from the input video using parameters of the perceptual model to pre-process the input video (S1204).

The video encoding device may generate a video from which perceptually removable elements have been removed from the input video using a method corresponding to the perceptual model. For example, when the perceptual model is a JND model, the video encoding device may apply operations to the input video within the threshold limits described above. Here, the operation may be filtering, convolution operation, or change of pixel value using offset.

The video encoding apparatus encodes the preprocessed input video to generate a bitstream (S1206).

The video encoding apparatus encodes the parameters of the perceptual model and combines the encoded parameters of the perceptual model with the bitstream (S1208). The video encoding device may transmit the bitstream to a video decoding device.

The video decoding apparatus decodes the bitstream to generate restored video (S1300).

The video decoding apparatus decodes parameters of the perceptual model from the bitstream (S1302). Here, the perception model is a model reflecting the perceived visual characteristics in terms of perceived quality.

Meanwhile, the perceptual model parameters are utilized to be included in a separate bitstream (such as SEI or VUI), and the perceptual model parameters are transmitted from the video encoding apparatus to the video decoding apparatus.

Meanwhile, the perceptual model parameters may include a video quality enhancement method. Alternatively, a video quality enhancement method agreed in advance between the video encoding apparatus and the video decoding apparatus may be used. When the perceptual model is a JND model, the perceptual model parameters may include a threshold (to which pixel values of the restored video in terms of perceived quality may change).

The video decoding apparatus post-processes the restored video using the parameters of the perceptual model to generate an enhanced video (S1304).

When the perceptual model is a JND model, the video decoding apparatus may apply operations to the restored video according to the video quality compensation method within the above-described threshold limits. Here, the operation may be filtering, convolution operation, or change of pixel value using offset.

Alternatively, the video decoding apparatus may evaluate the perceived degraded portion and then may enhance the video quality to generate the enhanced video. In this case, the video decoding apparatus may divide the restored video into n×n square blocks and post-process the restored video every n×n square blocks. In other words, the video decoding apparatus may apply a single perceptual model to the entire restored video to evaluate perceptual degradation and enhance video quality, or may divide the restored video into a plurality of blocks and perform post-processing in units of each block.

The video decoding apparatus generates a final restored video using the enhanced video and the restored video (S1306). The video decoding apparatus may output the enhanced video as a final restored video, or add the enhanced video to the restored video to output the resultant video as the final restored video.

Although the steps in the various flowcharts are described as being performed sequentially, these steps merely exemplify the technical concepts of some embodiments of the present disclosure. Accordingly, one of ordinary skill in the art to which the present disclosure pertains may perform the steps by changing the order depicted in the various figures or by performing more than two steps in parallel. Therefore, the steps in the respective flowcharts are not limited to the time series order shown.

It should be understood that the above description presents illustrative embodiments that may be implemented in various other ways. The functionality described in some embodiments may be implemented by hardware, software, firmware, and/or combinations thereof. It should also be understood that the functional components described in this specification are labeled "…. The units "strongly emphasize their independent implementation possibilities.

Meanwhile, various methods or functions described in some embodiments may be implemented as instructions stored in a non-transitory recording medium that can be read and executed by one or more processors. For example, the non-transitory recording medium may include various types of recording devices in which data is stored in a form readable by a computer system. For example, the non-transitory recording medium may include a storage medium such as an erasable programmable read-only memory (EPROM), a flash memory drive, an optical disk drive, a magnetic hard disk drive, a Solid State Drive (SSD), and the like.

Although embodiments of the present disclosure have been described for illustrative purposes, those skilled in the art to which the present disclosure pertains will appreciate that various modifications, additions and substitutions are possible, without departing from the scope and spirit of the present disclosure. Accordingly, embodiments of the present disclosure have been described for brevity and clarity. The scope of the technical idea of the embodiments of the present disclosure is not limited by the drawings. Accordingly, it will be understood by those of ordinary skill in the art to which this disclosure pertains that the scope of this disclosure is not limited to the embodiments explicitly described above, but is limited by the claims and their equivalents.

(reference numerals)

600. Encoder with a plurality of sensors

610. Decoder

802. Perceptual quality preprocessor

804. Perception model analyzer

806. Perception model estimator

812. Perceptual quality enhancer

Cross Reference to Related Applications

The present application claims priority from korean patent application No. 10-2021-0043648, filed on 2-4-2021, and korean patent application No. 10-2022-0038959, filed on 29-3-2022, the respective disclosures of which are incorporated herein by reference in their entireties.

Claims

1. A video decoding method performed by a video decoding apparatus, the video decoding method comprising:

Generating a restored video by decoding the bitstream;

decoding parameters of a perceptual model from the bitstream, wherein the perceptual model is a model reflecting perceptual visual properties in terms of perceptual quality;

generating an enhanced video by post-processing the restored video using parameters of the perceptual model; and

generating a final restored video using the enhanced video and the restored video.

2. The video decoding method of claim 1, wherein the perceptual visual properties comprise some or all of: contrast Sensitivity Function (CSF) effect, contrast Masking (CM) effect, texture Masking (TM) effect, luma Adaptation (LA) effect.

3. The video decoding method of claim 1, wherein the perceptual model is the following model: is selected as being suitable for the model corresponding to the dominant characteristic among the perceptual visual characteristics.

4. The video decoding method of claim 1, wherein the parameters of the perceptual model comprise a video quality compensation method corresponding to the perceptual model.

5. The video decoding method of claim 4, wherein the perceptual model is a minimum noticeable error (JND) model, wherein JND is a minimum error at which human vision begins to perceive.

6. The video decoding method of claim 5, wherein when the perceptual model is a JND model, parameters of the perceptual model further comprise thresholds of pixel values of the restored video in terms of perceptual quality, and

wherein the threshold value indicates a maximum value to which pixel values of the restored video can be changed in terms of perceived quality.

7. The video decoding method of claim 6, wherein generating the enhanced video comprises:

applying an operation to the restored video within a threshold limit according to the video quality compensation method, and

wherein the operation is a filtering, convolution operation, or a change in pixel values using an offset.

8. The video decoding method of claim 1, wherein generating the enhanced video comprises:

dividing the restored video into n×n square blocks; and

the restored video is post-processed on a per N x N square block basis.

9. The video decoding method of claim 1, wherein generating the final recovered video comprises:

outputting the enhanced video as the final restored video, or outputting video generated by adding the enhanced video to the restored video as the final restored video.

10. A video decoding apparatus comprising:

a decoder configured to decode a bitstream to generate a restored video, and to decode from the bitstream parameters of a perceptual model-based video quality enhancement method and a perceptual model, wherein the perceptual model is a model reflecting perceptual visual characteristics in terms of perceptual quality;

a perceptual quality enhancer configured to post-process the restored video using the perceptual model-based video quality enhancement method to generate an enhanced video; and

an adder configured to generate a final restored video using the enhanced video and the restored video.

11. A video encoding method performed by a video encoding apparatus, the video encoding method comprising:

determining a removable element from a perceptual model by analyzing an input video in terms of perceptual quality, wherein the perceptual model is a model reflecting perceptual visual characteristics in terms of perceptual quality;

estimating parameters of the perception model;

preprocessing the input video by removing the removable element from the input video using parameters of the perceptual model;

Generating a bitstream by encoding the preprocessed input video; and

encoding parameters of the perceptual model, and combining the encoded parameters with the bitstream.

12. The video coding method of claim 11, wherein the perceptual visual properties comprise some or all of: contrast Sensitivity Function (CSF) effect, contrast Masking (CM) effect, texture Masking (TM) effect, luma Adaptation (LA) effect.

13. The video encoding method of claim 11, wherein determining the removable element comprises:

analyzing the input video to determine a dominant characteristic among the perceived visual characteristics; and

a model suitable for corresponding to the primary characteristic is selected as the perception model.

14. The video encoding method of claim 11, wherein the perceptual model is a minimum noticeable error (JND) model, and

among them, JND is the smallest error that human vision starts to perceive.

15. The video coding method of claim 14, wherein estimating parameters comprises:

selecting a video quality compensation method corresponding to the perception model; and

When the perceptual model is a JND model, deriving a threshold of pixel values of the input video in terms of perceptual quality,

wherein the threshold is a maximum value to which pixel values of the input video can be changed in terms of perceived quality.

16. The video encoding method of claim 14, wherein preprocessing the input video comprises:

applying an operation to the input video within a threshold limit, and

wherein the operation is a filtering, convolution operation, or a change in pixel value using an offset.

17. The video coding method of claim 15, wherein the parameters of the perceptual model comprise the threshold and the video quality compensation method.