CN116918323A

CN116918323A - Video encoding and decoding method and apparatus for improving prediction signal of intra prediction

Info

Publication number: CN116918323A
Application number: CN202280019115.8A
Authority: CN
Inventors: 姜制远; 李订炅; 金娜莹; 朴胜煜
Original assignee: Hyundai Motor Co; Industry Collaboration Foundation of Ewha University; Kia Corp
Current assignee: Hyundai Motor Co; Industry Collaboration Foundation of Ewha University; Kia Corp
Priority date: 2021-03-04
Filing date: 2022-03-03
Publication date: 2023-10-20

Abstract

A video encoding and decoding method and apparatus are provided, which relate to improving a prediction signal in intra prediction, and in order to reduce the data amount of a residual signal to be encoded, the present embodiment generates an improved prediction signal adjacent to an original video signal from the intra-predicted prediction signal using a variable and fixed coefficient-based deep learning model.

Description

Video encoding and decoding method and apparatus for improving prediction signal of intra prediction

Technical Field

The present invention relates to a video encoding and decoding method and apparatus for improving a predicted signal in intra prediction.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

Since video data has a larger data amount than audio data or still image data, the video data requires a large amount of hardware resources (including a memory) to store or transmit the video data that is not subjected to compression processing.

Accordingly, encoders are typically used to compress and store or transmit video data. The decoder receives the compressed video data, decompresses the received compressed video data, and plays the decompressed video data. Video compression techniques include h.264/AVC, high efficiency video codec (High Efficiency Video Coding, HEVC), and multi-function video codec (Versatile Video Coding, VVC) that has an increase in codec efficiency of HEVC of about 30% or more.

However, as the image size, resolution, and frame rate gradually increase, the amount of data to be encoded is also increasing. Accordingly, a new compression technique providing higher codec efficiency and improved image enhancement effect compared to the existing compression technique is needed.

In recent years, image processing techniques based on deep learning have been applied to existing coding basic techniques. By applying the image processing technique based on the deep learning to the existing encoding technique (in particular, compression technique such as inter-frame prediction, intra-ring filtering, transformation, etc.), the codec efficiency can be improved. Representative examples of applications include inter-prediction based on virtual reference frames generated by a deep learning model, and in-loop filtering based on a denoising model. Therefore, it is necessary to further employ an image processing technique based on deep learning to improve the codec efficiency of image encoding/decoding.

Disclosure of Invention

Technical problem

In some embodiments, the present invention seeks to provide a video coding method and apparatus for generating an improved prediction signal, which approximates the original video signal, from the intra-predicted prediction signal using a variable and fixed coefficient based deep learning model, to reduce the amount of data of the residual signal to be encoded.

Technical proposal

At least one aspect of the present invention provides a video decoding apparatus. The apparatus includes an entropy decoder configured to decode an intra prediction mode and a residual value of a current block from a bitstream, decode a refinement flag, or determine a refinement flag. The improvement flag indicates whether to apply a deep learning-based improvement model at the time of intra prediction of the current block. The apparatus also includes an intra predictor configured to generate a predicted block of the current block using the intra prediction mode. The apparatus further comprises a signal improvement unit configured to generate an improved prediction block from the prediction block using the improvement model when the improvement flag is 1. The apparatus further includes an adder configured to generate a restored block of the current block by adding a residual value to the modified prediction block when the modification flag is 1 or to generate a restored block by adding a residual value to the prediction block when the modification flag is 0.

Another aspect of the present invention provides a video decoding method for intra prediction of a current block performed by a video decoding apparatus. The method includes decoding an intra prediction mode and a residual value of a current block from a bitstream, decoding a refinement flag, or determining a refinement flag. The improvement flag indicates whether to apply a deep learning-based improvement model at the time of intra prediction of the current block. The method further includes generating a prediction block of the current block using the intra prediction mode. The method further includes generating a restored block of the current block based on the improvement flag. Generating the restored block when the improvement flag is 1 includes generating an improved prediction block from the prediction block using the improvement model, and generating the restored block by adding a residual value to the improved prediction block. When the improvement flag is 0, generating the restored block includes generating the restored block by adding a residual value to the prediction block.

Yet another aspect of the present invention provides a video encoding method for intra prediction of a current block performed by a video encoding apparatus. The method includes obtaining an intra prediction mode of a current block, obtaining a refinement flag, or determining a refinement flag. The improvement flag indicates whether to apply a deep learning-based improvement model at the time of intra prediction of the current block. The method further includes generating a prediction block of the current block using the intra prediction mode. The method further includes generating a residual block of the current block based on the improvement flag. Generating the residual block when the improvement flag is 1 includes generating an improved prediction block from the prediction block using the improvement model, and generating the residual block by subtracting the improved prediction block from the current block. Generating the residual block when the improvement flag is 0 includes generating the residual block by subtracting the prediction block from the current block.

Advantageous effects

As described above, the present invention provides a video encoding and decoding method and apparatus for generating an improved prediction signal, which approximates an original video signal, from an intra-predicted prediction signal using a variable and fixed coefficient-based deep learning model, to reduce the data amount of a residual signal to be encoded and to improve encoding and decoding efficiency.

Drawings

Fig. 1 is a block diagram of a video encoding device in which the techniques of the present invention may be implemented.

Fig. 2 illustrates a method of partitioning a block using a quadtree plus binary tree trigeminal tree (QTBTTT) structure.

Fig. 3a and 3b illustrate a plurality of intra prediction modes including a wide-angle intra prediction mode.

Fig. 4 shows neighboring blocks of the current block.

Fig. 5 is a block diagram of a video decoding apparatus in which the techniques of the present invention may be implemented.

Fig. 6 shows the transmission of the coding mode of intra prediction.

Fig. 7 shows a block diagram of an improved video encoding apparatus using an intra prediction signal according to an embodiment of the present invention.

Fig. 8 shows a block diagram of an improved video decoding apparatus using an intra prediction signal according to one embodiment of the present invention.

Fig. 9 shows an improved model including a fixed coefficient network according to one embodiment of the invention.

Fig. 10 shows an improved model comprising a fixed coefficient network according to another embodiment of the invention.

FIG. 11 illustrates the operation of the improved model according to one embodiment of the invention.

Fig. 12 shows an improved model including a variable coefficient network according to one embodiment of the invention.

Fig. 13 shows an improved model comprising a variable coefficient network according to another embodiment of the present invention.

Fig. 14 is a flowchart illustrating an improved video encoding method using an intra prediction signal according to one embodiment of the present invention.

Fig. 15 is a flowchart illustrating an improved video decoding method using an intra prediction signal according to one embodiment of the present invention.

Detailed Description

Hereinafter, some embodiments of the present invention will be described in detail with reference to the accompanying illustrative drawings. In the following description, like reference numerals denote like elements, although the elements are shown in different drawings. Furthermore, in the following description of some embodiments, detailed descriptions of related known components and functions have been omitted for clarity and conciseness when it may be considered that the subject matter of the present invention is obscured.

Fig. 1 is a block diagram of a video encoding device in which the techniques of the present invention may be implemented. Hereinafter, a video encoding apparatus and components of the apparatus are described with reference to the diagram of fig. 1.

The encoding apparatus may include: an image divider 110, a predictor 120, a subtractor 130, a transformer 140, a quantizer 145, a reordering unit 150, an entropy encoder 155, an inverse quantizer 160, an inverse transformer 165, an adder 170, a loop filtering unit 180, and a memory 190.

Each component of the encoding apparatus may be implemented as hardware or software, or as a combination of hardware and software. In addition, the function of each component may be implemented as software, and the microprocessor may also be implemented to execute the function of the software corresponding to each component.

A video is made up of one or more sequences comprising a plurality of images. Each image is divided into a plurality of regions, and encoding is performed on each region. For example, an image is segmented into one or more tiles (tiles) or/and slices (slices). Here, one or more tiles may be defined as a tile set. Each tile or/and slice is partitioned into one or more Coding Tree Units (CTUs). In addition, each CTU is partitioned into one or more Coding Units (CUs) by a tree structure. Information applied to each CU is encoded as a syntax of the CU, and information commonly applied to CUs included in one CTU is encoded as a syntax of the CTU. In addition, information commonly applied to all blocks in one slice is encoded as syntax of a slice header, and information applied to all blocks constituting one or more pictures is encoded as a picture parameter set (Picture Parameter Set, PPS) or a picture header. Furthermore, information commonly referred to by the plurality of images is encoded as a sequence parameter set (Sequence Parameter Set, SPS). In addition, information commonly referenced by the one or more SPS is encoded as a set of video parameters (Video Parameter Set, VPS). Furthermore, information commonly applied to one tile or group of tiles may also be encoded as syntax of the tile or group of tiles header. The syntax included in the SPS, PPS, slice header, tile, or tile set header may be referred to as a high level syntax.

The image divider 110 determines the size of a Coding Tree Unit (CTU). Information about the size of the CTU (CTU size) is encoded as a syntax of the SPS or PPS and transmitted to the video decoding apparatus.

The image divider 110 divides each image constituting a video into a plurality of Coding Tree Units (CTUs) having a predetermined size, and then recursively divides the CTUs by using a tree structure. Leaf nodes in the tree structure become Coding Units (CUs), which are the basic units of coding.

The tree structure may be a Quadtree (QT) in which a higher node (or parent node) is partitioned into four lower nodes (or child nodes) of the same size. The tree structure may also be a Binary Tree (BT) in which a higher node is split into two lower nodes. The tree structure may also be a Trigeminal Tree (TT), where the higher nodes are split into three lower nodes at a ratio of 1:2:1. The tree structure may also be a structure in which two or more of a QT structure, a BT structure, and a TT structure are mixed. For example, a quadtree plus binary tree (quadtree plus binarytree, QTBT) structure may be used, or a quadtree plus binary tree (quadtree plus binarytree ternarytree, QTBTTT) structure may be used. Here, BTTT is added to the tree structure to be called multiple-type tree (MTT).

Fig. 2 is a schematic diagram for describing a method of dividing a block by using the QTBTTT structure.

As shown in fig. 2, the CTU may be first partitioned into QT structures. Quadtree partitioning may be recursive until the size of the partitioned block reaches the minimum block size (MinQTSize) of leaf nodes allowed in QT. A first flag (qt_split_flag) indicating whether each node of the QT structure is partitioned into four lower-layer nodes is encoded by the entropy encoder 155 and signaled to the video decoding apparatus. When the leaf node of QT is not greater than the maximum block size (MaxBTSize) of the root node allowed in BT, the leaf node may be further divided into at least one of BT structure or TT structure. There may be multiple directions of segmentation in the BT structure and/or the TT structure. For example, there may be two directions, i.e., a direction of dividing the block of the corresponding node horizontally and a direction of dividing the block of the corresponding node vertically. As shown in fig. 2, when the MTT division starts, a second flag (MTT _split_flag) indicating whether a node is divided, and a flag additionally indicating a division direction (vertical or horizontal) and/or a flag indicating a division type (binary or trigeminal) in the case that a node is divided are encoded by the entropy encoder 155 and signaled to the video decoding apparatus.

Alternatively, a CU partition flag (split_cu_flag) indicating whether a node is partitioned may be further encoded before encoding a first flag (qt_split_flag) indicating whether each node is partitioned into four nodes of a lower layer. When the value of the CU partition flag (split_cu_flag) indicates that each node is not partitioned, the block of the corresponding node becomes a leaf node in the partition tree structure and becomes a CU, which is a basic unit of encoding. When the value of the CU partition flag (split_cu_flag) indicates that each node is partitioned, the video encoding apparatus first starts encoding the first flag in the above scheme.

When QTBT is used as another example of the tree structure, there may be two types, i.e., a type of horizontally dividing a block of a corresponding node into two blocks having the same size (i.e., symmetrical horizontal division) and a type of vertically dividing a block of a corresponding node into two blocks having the same size (i.e., symmetrical vertical division). A partition flag (split_flag) indicating whether each node of the BT structure is partitioned into lower-layer blocks and partition type information indicating a partition type are encoded by the entropy encoder 155 and transmitted to the video decoding apparatus. On the other hand, there may additionally be a type in which a block of a corresponding node is divided into two blocks in an asymmetric form to each other. The asymmetric form may include a form in which a block of a corresponding node is divided into two rectangular blocks having a size ratio of 1:3, or may also include a form in which a block of a corresponding node is divided in a diagonal direction.

A CU may have various sizes according to QTBT or QTBTTT divided from CTUs. Hereinafter, a block corresponding to a CU to be encoded or decoded (i.e., a leaf node of QTBTTT) is referred to as a "current block". When QTBTTT segmentation is employed, the shape of the current block may also be rectangular in shape, in addition to square shape.

The predictor 120 predicts the current block to generate a predicted block. Predictor 120 includes an intra predictor 122 and an inter predictor 124.

In general, each of the current blocks in the image may be predictively encoded. In general, prediction of a current block may be performed by using an intra prediction technique using data from an image including the current block or an inter prediction technique using data from an image encoded before the image including the current block. Inter prediction includes both unidirectional prediction and bi-directional prediction.

The intra predictor 122 predicts pixels in the current block by using pixels (reference pixels) located adjacent to the current block in the current image including the current block. Depending on the prediction direction, there are multiple intra prediction modes. For example, as shown in fig. 3a, the plurality of intra prediction modes may include two non-directional modes including a Planar (Planar) mode and a DC mode, and may include 65 directional modes. The neighboring pixels and algorithm equations to be used are defined differently according to each prediction mode.

For efficient direction prediction of a current block having a rectangular shape, direction modes (# 67 to # 80) indicated by dotted arrows in fig. 3b, intra prediction modes # -1 to # -14) may be additionally used. The direction mode may be referred to as a "wide angle intra-prediction mode". In fig. 3b, the arrows indicate the respective reference samples for prediction, rather than representing the prediction direction. The prediction direction is opposite to the direction indicated by the arrow. When the current block has a rectangular shape, the wide-angle intra prediction mode is a mode in which prediction is performed in a direction opposite to a specific direction mode without additional bit transmission. In this case, in the wide-angle intra prediction mode, some of the wide-angle intra prediction modes available for the current block may be determined by a ratio of a width to a height of the current block having a rectangular shape. For example, when the current block has a rectangular shape having a height smaller than a width, wide-angle intra prediction modes (intra prediction modes #67 to # 80) having angles smaller than 45 degrees are available. When the current block has a rectangular shape with a width greater than a height, a wide-angle intra prediction mode having an angle greater than-135 degrees is available.

The intra predictor 122 may determine intra prediction to be used for encoding the current block. In some examples, intra predictor 122 may encode the current block by utilizing a plurality of intra prediction modes, and may also select an appropriate intra prediction mode to use from among the test modes. For example, the intra predictor 122 may calculate a rate distortion value by using rate-distortion (rate-distortion) analysis of a plurality of tested intra prediction modes, and may also select an intra prediction mode having the best rate distortion characteristics among the test modes.

The intra predictor 122 selects one intra prediction mode among a plurality of intra prediction modes, and predicts the current block by using neighboring pixels (reference pixels) determined according to the selected intra prediction mode and an algorithm equation. Information about the selected intra prediction mode is encoded by the entropy encoder 155 and transmitted to a video decoding device.

The inter predictor 124 generates a prediction block of the current block by using a motion compensation process. The inter predictor 124 searches for a block most similar to the current block in a reference picture that has been encoded and decoded earlier than the current picture, and generates a predicted block of the current block by using the searched block. In addition, a Motion Vector (MV) is generated, which corresponds to a displacement (displacement) between a current block in the current image and a prediction block in the reference image. In general, motion estimation is performed on a luminance (luma) component, and a motion vector calculated based on the luminance component is used for both the luminance component and the chrominance component. Motion information including information of the reference picture and information on a motion vector for predicting the current block is encoded by the entropy encoder 155 and transmitted to a video decoding device.

The inter predictor 124 may also perform interpolation of reference pictures or reference blocks to increase the accuracy of prediction. In other words, the sub-samples are interpolated between two consecutive integer samples by applying the filter coefficients to a plurality of consecutive integer samples comprising the two integer samples. When the process of searching for a block most similar to the current block is performed on the interpolated reference image, the decimal-unit precision may be represented for the motion vector instead of the integer-sample-unit precision. The precision or resolution of the motion vector may be set differently for each target region to be encoded, e.g., a unit such as a slice, tile, CTU, CU, etc. When such adaptive motion vector resolution (adaptive motion vector resolution, AMVR) is applied, information on the motion vector resolution to be applied to each target area should be signaled for each target area. For example, when the target area is a CU, information about the resolution of a motion vector applied to each CU is signaled. The information on the resolution of the motion vector may be information representing the accuracy of a motion vector difference to be described below.

On the other hand, the inter predictor 124 may perform inter prediction by using bi-directional prediction. In the case of bi-prediction, two reference pictures and two motion vectors representing block positions most similar to the current block in each reference picture are used. The inter predictor 124 selects a first reference picture and a second reference picture from the reference picture list0 (RefPicList 0) and the reference picture list1 (RefPicList 1), respectively. The inter predictor 124 also searches for a block most similar to the current block in the corresponding reference picture to generate a first reference block and a second reference block. Further, a prediction block of the current block is generated by averaging or weighted-averaging the first reference block and the second reference block. Further, motion information including information on two reference pictures for predicting the current block and information on two motion vectors is transmitted to the entropy encoder 155. Here, the reference image list0 may be constituted by an image preceding the current image in display order among the pre-restored images, and the reference image list1 may be constituted by an image following the current image in display order among the pre-restored images. However, although not particularly limited thereto, a pre-restored image following the current image in the display order may be additionally included in the reference image list 0. Conversely, a pre-restored image preceding the current image may be additionally included in the reference image list 1.

In order to minimize the amount of bits consumed for encoding motion information, various methods may be used.

For example, when a reference image and a motion vector of a current block are identical to those of a neighboring block, information capable of identifying the neighboring block is encoded to transmit motion information of the current block to a video decoding apparatus. This method is called merge mode (merge mode).

In the merge mode, the inter predictor 124 selects a predetermined number of merge candidate blocks (hereinafter, referred to as "merge candidates") from neighboring blocks of the current block.

As the neighboring blocks used to derive the merge candidates, all or some of the left block A0, the lower left block A1, the upper block B0, the upper right block B1, and the upper left block B2 adjacent to the current block in the current image may be used, as shown in fig. 4. In addition, in addition to the current picture in which the current block is located, a block located within a reference picture (which may be the same as or different from the reference picture used to predict the current block) may also be used as a merging candidate. For example, a co-located block (co-located block) of a current block within a reference picture or a block adjacent to the co-located block may additionally be used as a merging candidate. If the number of merging candidates selected by the above method is less than a preset number, a zero vector is added to the merging candidates.

The inter predictor 124 configures a merge list including a predetermined number of merge candidates by using neighboring blocks. A merge candidate to be used as motion information of the current block is selected from among the merge candidates included in the merge list, and merge index information for identifying the selected candidate is generated. The generated merging index information is encoded by the entropy encoder 155 and transmitted to a video decoding apparatus.

The merge skip mode is a special case of the merge mode. After quantization, when all transform coefficients used for entropy coding are near zero, only neighboring block selection information is transmitted without transmitting a residual signal. By using the merge skip mode, relatively high encoding efficiency can be achieved for images with slight motion, still images, screen content images, and the like.

Hereinafter, the merge mode and the merge skip mode are collectively referred to as a merge/skip mode.

Another method for encoding motion information is advanced motion vector prediction (advanced motion vector prediction, AMVP) mode.

In the AMVP mode, the inter predictor 124 derives a motion vector prediction candidate for a motion vector of a current block by using neighboring blocks of the current block. As the neighboring blocks used to derive the motion vector prediction candidates, all or some of the left block A0, the lower left block A1, the upper side block B0, the upper right block B1, and the upper left block B2 adjacent to the current block in the current image shown in fig. 4 may be used. In addition, in addition to the current picture in which the current block is located, a block located within a reference picture (which may be the same as or different from a reference picture used to predict the current block) may also be used as a neighboring block used to derive a motion vector prediction candidate. For example, a co-located block of the current block within the reference picture or a block adjacent to the co-located block may be used. If the number of motion vector candidates selected by the above method is less than a preset number, a zero vector is added to the motion vector candidates.

The inter predictor 124 derives a motion vector prediction candidate by using the motion vector of the neighboring block, and determines a motion vector prediction of the motion vector of the current block by using the motion vector prediction candidate. In addition, a motion vector difference is calculated by subtracting a motion vector prediction from a motion vector of the current block.

Motion vector prediction may be obtained by applying a predefined function (e.g., median and average calculations, etc.) to the motion vector prediction candidates. In this case, the video decoding device is also aware of the predefined function. Further, since the neighboring block used to derive the motion vector prediction candidates is a block for which encoding and decoding have been completed, the video decoding apparatus may also already know the motion vector of the neighboring block. Therefore, the video encoding device does not need to encode information for identifying motion vector prediction candidates. Accordingly, in this case, information on a motion vector difference and information on a reference image for predicting a current block are encoded.

On the other hand, motion vector prediction may also be determined by selecting a scheme of any one of the motion vector prediction candidates. In this case, the information for identifying the selected motion vector prediction candidates is additionally encoded together with the information about the motion vector difference and the information about the reference picture for predicting the current block.

The subtractor 130 generates a residual block by subtracting the current block from the prediction block generated by the intra predictor 122 or the inter predictor 124.

The transformer 140 transforms a residual signal in a residual block having pixel values of a spatial domain into transform coefficients of a frequency domain. The transformer 140 may transform a residual signal in a residual block by using the entire size of the residual block as a transform unit, or may divide the residual block into a plurality of sub-blocks and perform the transform by using the sub-blocks as transform units. Alternatively, the residual block is divided into two sub-blocks, i.e., a transform region and a non-transform region, to transform the residual signal by using only the transform region sub-block as a transform unit. Here, the transform region sub-block may be one of two rectangular blocks having a size ratio of 1:1 based on a horizontal axis (or a vertical axis). In this case, a flag (cu_sbt_flag) indicating that only the sub-block is transformed, and direction (vertical/horizontal) information (cu_sbt_horizontal_flag) and/or position information (cu_sbt_pos_flag) are encoded by the entropy encoder 155 and signaled to the video decoding apparatus. In addition, the size of the transform region sub-block may have a size ratio of 1:3 based on the horizontal axis (or vertical axis). In this case, a flag (cu_sbt_quad_flag) dividing the corresponding division is additionally encoded by the entropy encoder 155 and signaled to the video decoding device.

On the other hand, the transformer 140 may perform transformation of the residual block separately in the horizontal direction and the vertical direction. For this transformation, various types of transformation functions or transformation matrices may be used. For example, the pair-wise transformation function for horizontal and vertical transformations may be defined as a transformation set (multiple transform set, MTS). The transformer 140 may select one transform function pair having the highest transform efficiency in the MTS and transform the residual block in each of the horizontal and vertical directions. Information (mts_idx) about the transform function pairs in the MTS is encoded by the entropy encoder 155 and signaled to the video decoding means.

The quantizer 145 quantizes the transform coefficient output from the transformer 140 using a quantization parameter, and outputs the quantized transform coefficient to the entropy encoder 155. The quantizer 145 may also immediately quantize the relevant residual block without transforming any block or frame. The quantizer 145 may also apply different quantization coefficients (scaling values) according to the positions of the transform coefficients in the transform block. A quantization matrix applied to quantized transform coefficients arranged in two dimensions may be encoded and signaled to a video decoding apparatus.

The reordering unit 150 may perform the rearrangement of the coefficient values on the quantized residual values.

The rearrangement unit 150 may change the 2D coefficient array to a 1D coefficient sequence by using coefficient scanning. For example, the rearrangement unit 150 may scan the DC coefficients to the coefficients of the high frequency region using zigzag scanning (zig-zag scan) or diagonal scanning (diagonal scan) to output a 1D coefficient sequence. Instead of the zig-zag scan, a vertical scan that scans the 2D coefficient array in the column direction and a horizontal scan that scans the 2D block type coefficients in the row direction may also be utilized, depending on the size of the transform unit and the intra prediction mode. In other words, the scanning method to be used may be determined in zigzag scanning, diagonal scanning, vertical scanning, and horizontal scanning according to the size of the transform unit and the intra prediction mode.

The entropy encoder 155 encodes the sequence of the 1D quantized transform coefficients output from the rearrangement unit 150 by using various encoding schemes including Context-based adaptive binary arithmetic coding (Context-based Adaptive Binary Arithmetic Code, CABAC), exponential golomb (Exponential Golomb), and the like to generate a bitstream.

Further, the entropy encoder 155 encodes information related to block division (e.g., CTU size, CTU division flag, QT division flag, MTT division type, MTT division direction, etc.) so that the video decoding apparatus can divide blocks equally to the video encoding apparatus. Further, the entropy encoder 155 encodes information on a prediction type indicating whether the current block is encoded by intra prediction or inter prediction. The entropy encoder 155 encodes intra prediction information (i.e., information about an intra prediction mode) or inter prediction information (a merge index in the case of a merge mode, and information about a reference picture index and a motion vector difference in the case of an AMVP mode) according to a prediction type. Further, the entropy encoder 155 encodes information related to quantization (i.e., information about quantization parameters and information about quantization matrices).

The inverse quantizer 160 inversely quantizes the quantized transform coefficient output from the quantizer 145 to generate a transform coefficient. The inverse transformer 165 transforms the transform coefficients output from the inverse quantizer 160 from the frequency domain to the spatial domain to restore a residual block.

The adder 170 adds the restored residual block and the prediction block generated by the predictor 120 to restore the current block. The pixels in the restored current block may be used as reference pixels when intra-predicting the next block.

The loop filtering unit 180 performs filtering on the restored pixels to reduce block artifacts (blocking artifacts), ringing artifacts (ringing artifacts), blurring artifacts (blurring artifacts), etc., which occur due to block-based prediction and transform/quantization. The loop filtering unit 180 as an in-loop filter may include all or some of a deblocking filter 182, a sample adaptive offset (sample adaptive offset, SAO) filter 184, and an adaptive loop filter (adaptive loop filter, ALF) 186.

Deblocking filter 182 filters boundaries between restored blocks to remove block artifacts (blocking artifacts) that occur due to block unit encoding/decoding, and SAO filter 184 and ALF 186 additionally filter the deblock filtered video. The SAO filter 184 and ALF 186 are filters for compensating for differences between restored pixels and original pixels that occur due to lossy coding (loss coding). The SAO filter 184 applies an offset as a CTU unit to enhance subjective image quality and coding efficiency. On the other hand, the ALF 186 performs block unit filtering, and applies different filters to compensate for distortion by dividing boundaries of respective blocks and the degree of variation. Information about filter coefficients to be used for ALF may be encoded and signaled to the video decoding apparatus.

The restored blocks filtered by the deblocking filter 182, the SAO filter 184, and the ALF 186 are stored in the memory 190. When all blocks in one image are restored, the restored image may be used as a reference image for inter-predicting blocks within a picture to be subsequently encoded.

Fig. 5 is a functional block diagram of a video decoding apparatus in which the techniques of the present invention may be implemented. Hereinafter, with reference to fig. 5, a video decoding apparatus and components of the apparatus are described.

The video decoding apparatus may include an entropy decoder 510, a reordering unit 515, an inverse quantizer 520, an inverse transformer 530, a predictor 540, an adder 550, a loop filtering unit 560, and a memory 570.

Similar to the video encoding apparatus of fig. 1, each component of the video decoding apparatus may be implemented as hardware or software, or as a combination of hardware and software. In addition, the function of each component may be implemented as software, and the microprocessor may also be implemented to execute the function of the software corresponding to each component.

The entropy decoder 510 extracts information related to block segmentation by decoding a bitstream generated by a video encoding apparatus to determine a current block to be decoded, and extracts prediction information required to restore the current block and information on a residual signal.

The entropy decoder 510 determines the size of CTUs by extracting information about the CTU size from a Sequence Parameter Set (SPS) or a Picture Parameter Set (PPS), and partitions a picture into CTUs having the determined size. Further, the CTU is determined as the highest layer (i.e., root node) of the tree structure, and the division information of the CTU may be extracted to divide the CTU by using the tree structure.

For example, when dividing a CTU by using the QTBTTT structure, first a first flag (qt_split_flag) related to the division of QT is extracted to divide each node into four nodes of the lower layer. In addition, a second flag (MTT _split_flag), a split direction (vertical/horizontal), and/or a split type (binary/trigeminal) related to the split of the MTT are extracted with respect to a node corresponding to the leaf node of the QT to split the corresponding leaf node into the MTT structure. As a result, each node below the leaf node of QT is recursively partitioned into BT or TT structures.

As another example, when a CTU is divided by using the QTBTTT structure, a CU division flag (split_cu_flag) indicating whether to divide the CU is extracted. When the corresponding block is partitioned, a first flag (qt_split_flag) may also be extracted. During the segmentation process, recursive MTT segmentation of 0 or more times may occur after recursive QT segmentation of 0 or more times for each node. For example, for CTUs, MTT partitioning may occur immediately, or conversely, QT partitioning may occur only multiple times.

As another example, when dividing the CTU by using the QTBT structure, a first flag (qt_split_flag) related to the division of QT is extracted to divide each node into four nodes of the lower layer. In addition, a split flag (split_flag) indicating whether or not a node corresponding to a leaf node of QT is further split into BT and split direction information are extracted.

On the other hand, when the entropy decoder 510 determines the current block to be decoded by using the partition of the tree structure, the entropy decoder 510 extracts information on a prediction type indicating whether the current block is intra-predicted or inter-predicted. When the prediction type information indicates intra prediction, the entropy decoder 510 extracts syntax elements for intra prediction information (intra prediction mode) of the current block. When the prediction type information indicates inter prediction, the entropy decoder 510 extracts information representing syntax elements of the inter prediction information, i.e., a motion vector and a reference picture to which the motion vector refers.

Further, the entropy decoder 510 extracts quantization-related information and extracts information on transform coefficients of the quantized current block as information on a residual signal.

The reordering unit 515 may change the sequence of the 1D quantized transform coefficients entropy-decoded by the entropy decoder 510 into a 2D coefficient array (i.e., block) again in the reverse order of the coefficient scan order performed by the video encoding device.

The inverse quantizer 520 inversely quantizes the quantized transform coefficient and inversely quantizes the quantized transform coefficient by using a quantization parameter. The inverse quantizer 520 may also apply different quantization coefficients (scaling values) to the quantized transform coefficients arranged in 2D. The inverse quantizer 520 may perform inverse quantization by applying a matrix of quantized coefficients (scaled values) from the video encoding device to a 2D array of quantized transform coefficients.

The inverse transformer 530 restores a residual signal by inversely transforming the inversely quantized transform coefficients from the frequency domain to the spatial domain to generate a residual block of the current block.

Further, when the inverse transformer 530 inversely transforms a partial region (sub-block) of the transform block, the inverse transformer 530 extracts a flag (cu_sbt_flag) transforming only the sub-block of the transform block, direction (vertical/horizontal) information (cu_sbt_horizontal_flag) of the sub-block, and/or position information (cu_sbt_pos_flag) of the sub-block. The inverse transformer 530 also inversely transforms transform coefficients of the corresponding sub-block from the frequency domain to the spatial domain to restore a residual signal, and fills the region that is not inversely transformed with a value of "0" as the residual signal to generate a final residual block of the current block.

Further, when applying MTS, the inverse transformer 530 determines a transform index or a transform matrix to be applied in each of the horizontal direction and the vertical direction by using MTS information (mts_idx) signaled from the video encoding apparatus. The inverse transformer 530 also performs inverse transformation on the transform coefficients in the transform block in the horizontal direction and the vertical direction by using the determined transform function.

The predictor 540 may include an intra predictor 542 and an inter predictor 544. The intra predictor 542 is activated when the prediction type of the current block is intra prediction, and the inter predictor 544 is activated when the prediction type of the current block is inter prediction.

The intra predictor 542 determines an intra prediction mode of the current block among the plurality of intra prediction modes according to syntax elements of the intra prediction mode extracted from the entropy decoder 510. The intra predictor 542 also predicts the current block by using neighboring reference pixels of the current block according to an intra prediction mode.

The inter predictor 544 determines a motion vector of the current block and a reference picture to which the motion vector refers by using syntax elements of the inter prediction mode extracted from the entropy decoder 510.

The adder 550 restores the current block by adding the residual block output from the inverse transformer 530 to the prediction block output from the inter predictor 544 or the intra predictor 542. In intra prediction of a block to be decoded later, pixels within the restored current block are used as reference pixels.

The loop filtering unit 560, which is an in-loop filter, may include a deblocking filter 562, an SAO filter 564, and an ALF 566. Deblocking filter 562 performs deblocking filtering on boundaries between restored blocks to remove block artifacts occurring due to block unit decoding. The SAO filter 564 and ALF 566 perform additional filtering on the restored block after deblocking filtering to compensate for differences between restored pixels and original pixels that occur due to lossy encoding. The filter coefficients of the ALF are determined by using information on the filter coefficients decoded from the bitstream.

The restored blocks filtered by the deblocking filter 562, the SAO filter 564, and the ALF 566 are stored in the memory 570. When all blocks in one image are restored, the restored image may be used as a reference image for inter-predicting blocks within a picture to be subsequently encoded.

In some embodiments, the invention relates to encoding and decoding video imagery as described above. More particularly, the present invention provides a video encoding and decoding method and apparatus for generating an improved prediction signal, which approximates an original video signal, from an intra-predicted prediction signal using a variable and fixed coefficient-based deep learning model.

The following embodiments may be generally applied to cases involving deep learning techniques for video encoding and decoding devices.

In the following description, the term "target block" to be encoded/decoded may be used interchangeably with the current block or Coding Unit (CU) as described above, or the term "target block" to be encoded/decoded may refer to some region of the coding unit.

I. Coding mode for intra prediction

As described above, intra prediction is a method of predicting a current block by referring to samples located around the encoded current target block. As shown in fig. 3a and 3b, a multi-functional video codec (VVC) technique may utilize a non-directional prediction mode of a DC/planar mode, 65 directional prediction modes, and a wide-angle intra prediction mode. In addition, intra prediction may utilize prediction techniques such as multi-reference line intra prediction (multiple reference line intra prediction, MRLP), cross-component linear model (cross-component linear model, CCLM), position-resolved intra prediction combinations (position dependent intra prediction combination, PDPC), intra sub-partitions (ISPs), and matrix weighted intra prediction (matrix-weighted intra prediction, MIP).

In the intra prediction process using MRLP, a video encoding/decoding apparatus may use a plurality of reference lines (multiple reference lines, MRL) to employ more reference lines. When MRL is applied, the video encoding/decoding apparatus may perform intra prediction on the current block using samples of two lines added at the top and left of the current block in addition to the original reference line. When the MRL is applied, an index (MRL _idx) indicating the reference line may be signaled to the video decoding device to select the reference line.

CCLM prediction is an intra prediction method using a linear model representing the similarity between a luminance signal and a chrominance signal. To activate the CCLM mode, the encoding device may signal a flag for activating the CCLM mode to the video decoding device.

CCLM prediction first derives a linear transformation function between neighboring reference samples and luma signal reference samples located at the same position as the neighboring reference samples based on the current chroma block. At this time, the linear transformation function may be derived based on the minimum value of the neighboring luminance signal, the chrominance value co-located with the neighboring luminance signal, the maximum value of the neighboring luminance signal, and the chrominance value co-located with the neighboring luminance signal. Next, prediction of chroma samples is performed by applying a linear transform function to luma samples co-located with the chroma blocks.

One of the rule-based prediction methods for intra prediction is a position-resolved intra prediction combination (Position Dependent Intra Prediction Combination, PDPC). In other words, a predictor (predictor) may be generated based on a predefined operation using encoding information of a target block performing intra prediction and neighboring pixels spatially adjacent to the target block.

The PDPC modifies the predicted samples generated according to the particular intra prediction mode to generate an intra predictor for the current block. Here, similar to the prediction mode shown in fig. 3a, the specific intra prediction modes include a planar mode, a DC mode, a horizontal mode (prediction mode 18), a vertical mode (prediction mode 50), a lower left diagonal direction mode (prediction mode 2), and 15 direction modes close to the lower left diagonal direction mode, and an upper right diagonal direction mode (prediction mode 66) and 15 direction modes close to the upper right diagonal direction mode.

For prediction samples of the current block generated according to a particular intra prediction mode, the PDPC technique may adjust each pixel value with predefined weights and location information of neighboring pixels to generate the prediction samples.

As described above, the intra prediction mode of the luminance block has a subdivided directional mode (i.e., -14 to 80) in addition to the non-directional modes (i.e., planar and DC), as shown in fig. 3a and 3 b. After sub-partitioning the current block into smaller blocks of the same size, ISP techniques share intra prediction of the current block across the entire sub-block, however, different transforms may be applied to each sub-block. In performing sub-partitioning, a block may be partitioned in a horizontal or vertical direction.

The predictor may be generated using pixels adjacent to the current block performing intra prediction and encoding information of the current block based on a predefined matrix operation. The above-described rule-based prediction method is referred to as matrix weighted intra prediction (Matrix weighted Intra Prediction, MIP).

MIP generates all or part of the intra-predictor using predefined matrix operations. When generating a portion of the predictor, the MIP may generate samples of the final intra prediction equal to the size of the current block by additionally performing interpolation (interpolation) for upsampling or enlarging using a portion of the predictor.

On the other hand, MIP may selectively select a part of pixels among pixels spatially adjacent to the current block, and use the selected pixels as neighboring pixels of the current block. As another embodiment, MIP may use a value derived from an operation based on a method such as sub-sampling or scaling down of matrix operations.

According to a method for transmitting an encoding mode of intra prediction, a prediction mode of intra prediction and whether a prediction technique is applied may be signaled from a video encoding apparatus to a video decoding apparatus, as shown in fig. 6. For example, when encoding the current block with intra prediction, the video encoding apparatus may signal pred_mode_flag as 0 and then may signal whether to apply the MIP technique using intra_mipflag.

In the example of fig. 6, when intra prediction of a current block is performed, the most probable mode (most probable mode, MPM) technique uses intra prediction modes of neighboring blocks. The video encoding apparatus may improve the coding efficiency of the intra prediction mode by transmitting an index of the MPM list instead of an index of the prediction mode. On the other hand, the example of fig. 6 does not provide a detailed description of a method for signaling a coding mode to which MIP is not applied.

On the other hand, when intra prediction is performed, an Intra Block Copy (IBC) technique generates a reference block within the same image as a predicted block of a current block, instead of using reference samples. At this time, the block vector represents a displacement indicating the reference block, and is signaled from the video encoding apparatus to the video decoding apparatus.

Block improvement for intra prediction

Fig. 7 shows a block diagram of an improved video encoding apparatus using an intra prediction signal according to one embodiment of the present invention.

The video encoding apparatus according to the present embodiment additionally includes a signal improvement unit 710 after the intra predictor 122, which is one of the basic constituent elements. Here, constituent elements included in the video encoding apparatus according to the present embodiment are not necessarily limited to specific examples. For example, the video encoding apparatus may additionally include a training unit (not shown) for training the deep learning-based improvement model included in the signal improvement unit 710, or may be implemented to operate in conjunction with an external training unit.

The intra predictor 122 generates a prediction block including a predicted signal of a current target block to be encoded from neighboring reference samples using a prediction mode.

The video encoding device may use a refinement flag definition_flag to indicate that the refinement model is applied to the prediction block. The video encoding apparatus may transmit the improvement flag to the video decoding apparatus on a per block basis, or may transmit the improvement flag on a per image or slice basis after including the improvement flag in the SPS.

When the improvement flag is 1, the signal improvement unit 710 generates an improved prediction block from the prediction block using the improvement model. On the other hand, the training unit may train the improved model such that the improved model learns the signal generating method to generate an improved signal approximating the original signal of the current block.

In the following description, the term "improved prediction model" may be used interchangeably with the term "improved block".

The video encoding device generates a residual block by subtracting the modified block from the current block when the modified flag is 1, and generates a residual block by subtracting the predicted block from the current block when the modified flag is 0. The video encoding apparatus may perform the above-described encoding process by inputting the residual value of the residual block to the transformer 140.

On the other hand, as an example, the video encoding apparatus may set the value of the improvement flag as follows. After determining the number N of intra prediction modes based on the size of the current block, where N is a natural number, the video encoding device determines N candidate prediction modes by performing a coarse mode decision (Rough Mode Decision, RMD) on the current block. The video encoding device generates a prediction block for each prediction mode using the N candidate prediction modes and the prediction modes included in the MPM, and then calculates a rate-distortion cost (rate distortion cost, RD-cost) for each prediction block. Furthermore, the video encoding apparatus generates improved prediction blocks by applying an improved model to each prediction block, and then calculates a rate-distortion cost for each improved prediction block. The video encoding apparatus may compare the rate distortion costs between the modified block and the prediction block, determine a corresponding candidate prediction mode as an intra prediction mode of the current block when the cost of using the modified block becomes minimum, and set a modification flag definition_flag to 1.

The video decoding apparatus according to the present embodiment additionally includes a signal improvement unit 810 after the intra predictor 122, which is one of the basic constituent elements.

The entropy decoder 510 decodes an intra prediction mode, an improvement flag definition_flag, and a residual block including a residual value of a current block to be decoded from a bitstream.

The intra predictor 542 generates a prediction block including a predicted signal of the current block to be decoded from neighboring reference samples using an intra prediction mode.

When the improvement flag is 1, the signal improvement unit 810 generates an improved block from the prediction block using the improvement model. The video decoding apparatus generates a restored block for the current block by adding a residual value to the modified block when the modified flag is 1, and generates a restored block for the current block by adding a residual value to the prediction block when the modified flag is 0.

In the following description, an improved model within a video encoding apparatus is described. The following description may be equally applied to an improved model used in a video decoding apparatus.

In one embodiment, the improved model may be a deep learning model implemented with a convolutional layer including only fixed coefficients (in the following description, "fixed coefficient based network"). A fixed coefficient based network may be used to refine the input block to approximate the original block. Since the training unit trains the deep learning model having a relatively large number of parameters in advance using various input data, an improved model including a fixed coefficient network can be realized.

An improved model including a fixed coefficient based network may be implemented based on a noise prediction method. For example, as shown in fig. 9, the improved model may generate an improved signal by estimating noise in the prediction block and removing the estimated noise from the prediction block.

Alternatively, the improvement model may use a method of generating an improved signal using neighboring pixel values. For example, the refinement model may use a block indicated by a block vector as a prediction block of the current block according to an application of the IBC, and may generate a refinement signal according to the prediction block. Further, as shown in fig. 10, the improvement model may use the prediction block found by the block vector as an input to generate additional vectors on a per sub-block or pixel basis and generate an improved signal using these vectors.

Alternatively, the refinement model may fill in neighboring reference samples for intra prediction of the current block to be used as input, and generate a refined signal according to the input.

On the other hand, as shown in fig. 11, the input of the improvement model is a prediction block obtained according to a specific mode, and the output of the improvement model corresponds to the improved prediction block. Here, the specific mode may be any prediction mode or combination of prediction modes for intra prediction.

Alternatively, the input may consist of a weighted sum of a prediction block according to a prediction mode for intra prediction and a prediction block according to inter prediction. The improvement model may generate an improvement signal from the input.

The video encoding device determines whether to improve a prediction block by comparing a rate distortion cost between the prediction block according to an existing encoding mode and the improved block. Thereafter, as described above, the video decoding apparatus adds a refinement flag refine_flag indicating whether or not to refine the prediction signal, and signals the refinement flag to the video decoding apparatus. The video encoding apparatus may set the improvement flag refine_flag to 1 when the rate distortion cost of the improved block is less than the rate distortion cost of the predicted block.

Rate distortion cost J _RD Can be calculated by the following equation 1.

[ equation 1]

J _RD ＝SSD(S，C)+λ·R

In equation 1, S represents an original block, and C represents a restored block, which may be a predicted block or a modified predicted block. The sum of squares (The Sum of Squared Difference, SSD) represents the similarity between the original block and the restored block. R represents the estimated bit rate and λ represents the lagrangian multiplier.

On the other hand, parameters of a network based on fixed coefficients are shared between a video encoding device and a video decoding device.

In another embodiment, the improved model may be a deep learning model implemented with a fixed coefficient based network and a convolutional layer including only variable coefficients (in the following description, "variable coefficient based network"). A network based on variable coefficients can be implemented with a relatively small number of parameters, since the network parameters have to be transmitted. Since the training unit trains the deep learning model during the encoding of the original block, while fixing the pre-trained fixed coefficient-based network, an improved model including a variable coefficient-based network can be realized.

An improved model including a variable coefficient based network may use mask mapping. The mask mapping is by outputting a signal x to a network based on fixed coefficients _fixed Performing appropriate operations to generate improved block x _refined Is a vector or variable of (a). For example, as shown in FIG. 12, a fixed coefficient based network of the improved model generates an output signal x _fixed And the variable coefficient based network of the improved network generates the mask map m. Improved model using mask mapping to perform output signal x _fixed And predicting the weighted sum of the block x to generateImproved block x _refined 。

Alternatively, the improved model may use a variable coefficient based network consisting of multiple convolutional layers. For example, as shown in FIG. 13, a fixed coefficient based network of the improved model may generate an output signal x _fixed And the variable coefficient based network of the improved model may be based on the output signal x _fixed Generating an improved block x _refined 。

The video encoding apparatus may calculate the rate distortion cost J by considering the bit rate R' associated with the variable coefficient-based network parameter _RD As shown in equation 2.

[ equation 2]

J _RD ＝SSD(S，C)+λ·(R+R′)

For example, a variable coefficient based network consisting of convolutional layers with 3 x 3 kernels requires a total of 10 parameters, including bias parameters. Thus, when the parameter is transmitted using a 16-bit floating point type (float 16 type), a total of 160 bits are additionally required.

On the other hand, the parameter θ of the variable coefficient-based network needs to be transmitted from the video encoding device to the video decoding device. The transmission period may be determined by considering the size of the parameter. For example, the parameter θ may be transmitted in each intra frame (I-frame) of the refresh parameter.

On the other hand, as described above, the video encoding apparatus may explicitly transmit the improvement flag definition_flag using the bitstream. When definition_flag=1, the video decoding apparatus may generate an improved block by applying an improved model to the prediction block. On the other hand, when definition_flag=0, the video decoding apparatus may skip the application of the refinement model and perform the conventional intra prediction. As described above, the video encoding device may transmit the improvement flag on a per block basis, or on a per video sequence or slice basis. In the case of each block-based transmission, the video encoding apparatus may additionally use a higher level flag indicating the presence of each block-based improvement flag.

As another example, the improvement flag refinishent_flag may be implicitly determined. When determining the refinement_flag=1, the video decoding apparatus may generate an improved block by applying an improved model to the prediction block. On the other hand, when determining that the definition_flag=0, the video decoding apparatus may skip the application of the refinement model and perform the conventional intra prediction.

The method for implicitly determining the improvement flag may employ one of the following methods.

In one approach, intra prediction is selected as a particular mode. For example, when the intra prediction mode is Planar, the improvement flag may be implicitly determined to be 1.

When the intra prediction adopts the MIP mode or the PDPC mode instead of the conventional intra prediction mode, the improvement flag may be implicitly determined to be 1.

When MRLP-based intra prediction is used and reference samples in closely adjacent rows or columns are not used, the improvement flag may be implicitly determined to be 1.

When the reference samples are not available in a directly adjacent row or column for intra prediction, the improvement flag may be implicitly determined to be 1.

When the ISP mode is applied to the intra prediction, the improvement flag may be implicitly determined to be 1.

The improvement model as described above receives only the predicted signal and performs signal improvement. As another embodiment, in order to achieve the same object of improving the prediction signal, the improvement model may be applied by using an input from one or a combination of residual signals obtained after prediction, signals of neighboring blocks, or signals obtained by applying two or more different prediction modes.

On the other hand, as described above, the improvement of the prediction block according to the present embodiment can improve the prediction block based on the conventional intra prediction mode. In addition, a prediction signal improvement mode may be added as a new prediction mode for video coding.

First, improvement of a prediction block according to a conventional intra prediction mode may be performed as follows.

The video encoding apparatus may generate an improved prediction block by applying the improved model according to the present embodiment to a prediction block obtained by first performing all or part of the prediction modes available for encoding. For example, the video encoding apparatus may generate an improved prediction block by applying an improved model to a prediction block obtained by performing a DC/Planar non-directional prediction mode, 65 directional prediction modes, an ISP mode, a MIP mode, and the like.

Alternatively, the video encoding device may perform prediction block refinement on one or two or more selected prediction modes at the time of encoding. For example, the video encoding apparatus may perform prediction block improvement on prediction blocks obtained from 65 directional prediction modes.

Alternatively, the video encoding device may perform prediction block refinement on blocks having PU sizes greater than or equal to, or less than or equal to, a predetermined threshold.

The improvement of the prediction block according to the above-described conventional intra prediction mode may be similarly performed in the video decoding apparatus.

Next, the video encoding apparatus may add a prediction signal improvement mode as a new mode.

When the predicted signal improvement mode is added as a new encoding mode, the video encoding apparatus may select a corresponding mode according to a result of comparing the rate distortion cost with that of the existing mode. For example, the prediction signal improvement mode may be added to existing intra prediction coding modes such as a non-directional prediction mode, a directional prediction mode, ISP, and MIP. When the prediction signal improvement mode is applied, the video encoding apparatus may selectively use reference samples in one direction or any direction of the direction prediction modes to generate a prediction block. The video encoding device may then generate an improved prediction block by applying the prediction signal improvement model.

The above prediction signal improvement mode may be similarly applied to a video decoding apparatus.

In the following description, a modified video encoding method and video decoding method using an intra prediction signal are described with reference to fig. 14 and 15.

The video encoding apparatus obtains an intra prediction mode of the current block and obtains or determines an improvement flag (S1400). Here, the refinement flag refinement_flag indicates whether to apply a refinement model based on deep learning when performing intra prediction of the current block. The video encoding apparatus may transmit the improvement flag to the video decoding apparatus on a per block basis, or may transmit the improvement flag on a per image or slice basis after including the improvement flag in the SPS.

In one example, the video decoding apparatus may obtain the value of the improvement flag set as follows. The video decoding device compares rate distortion costs between improved blocks of the plurality of candidate intra prediction modes and the predicted block. When a minimum cost is generated using the modified block, the video encoding apparatus determines a corresponding candidate prediction mode as an intra prediction mode of the current block, and sets a modification flag to 1.

In another example, the video encoding apparatus may implicitly determine the value of the improvement flag as follows. The video encoding apparatus may determine the improvement flag as 1 when intra prediction of the current block adopts a predetermined prediction mode (e.g., a plane mode). The video encoding apparatus may determine the improvement flag to be 1 when the intra prediction of the current block adopts the MIP or PDPC mode. Furthermore, when intra prediction of the current block employs a plurality of reference lines but does not use reference samples of a row or column directly adjacent to the current block, the video encoding apparatus may determine the improvement flag as 1. The video encoding device may determine the improvement flag to be 1 when the reference sample is not available in an adjacent row or column. In addition, the video encoding apparatus may determine the improvement flag to be 1 when the intra prediction of the current block adopts the ISP mode.

The video encoding apparatus generates a prediction block of the current block using the intra prediction mode (S1402).

The video encoding apparatus checks the value of the improvement flag (S1404).

When the improvement flag is 1 (yes in S1404), the video encoding apparatus generates an improved prediction block from the prediction block using the improvement model (S1406), and then generates a residual block by subtracting the improved prediction block from the current block (S1408).

The input to the improvement model is a prediction block obtained according to a particular mode, and the output of the improvement model corresponds to the improved prediction block. Here, the specific mode may be any prediction mode or combination of prediction modes for intra prediction.

In one embodiment, the improved model may be a deep learning model that includes only a fixed coefficient based network. An improved model including a fixed coefficient based model is pre-trained to generate an improved prediction block that approximates the original image of the current block. Parameters of a fixed coefficient based network are shared between a video encoding device and a video decoding device.

In another example, the improved model may be a deep learning model that includes a fixed coefficient-based network and a variable coefficient-based network. In the case of an improved model comprising a variable coefficient based network, the variable coefficient based network is trained while encoding the original image of the current block, while the fixed coefficient based network is fixed. On the other hand, the video encoding apparatus may encode parameters of the variable coefficient-based network, and then may transmit the parameters to the video decoding apparatus. The transmission period may be determined by considering the size of the parameter. For example, the parameters may be transmitted within each frame of refresh parameters (I-frame).

On the other hand, when the improvement flag is 0 (no in S1404), the video encoding apparatus generates a residual block by subtracting the prediction block from the current block (S1410).

Thereafter, the video encoding apparatus may perform a process of encoding a residual value of the residual block.

The video decoding apparatus decodes the intra prediction mode and the residual value of the current block from the bitstream, and decodes or determines the improvement flag (S1500). Here, the refinement flag refinement_flag indicates whether to apply a refinement model based on deep learning when performing intra prediction of the current block.

As described above, the improvement flag may be transmitted on a per block basis from the video encoding device or on a per video sequence or slice basis.

As another example, the video decoding apparatus may implicitly determine the value of the improvement flag similarly to the operation of the video encoding apparatus.

The video decoding apparatus generates a prediction block of the current block using the intra prediction mode (S1502).

The video decoding apparatus checks the value of the improvement flag (S1504).

When the improvement flag is 1 (yes in S1504), the video decoding apparatus generates an improved prediction block from the prediction block using the improvement model (S1506), and then generates a restored block of the current block by adding a residual value to the improved prediction block (S1508).

The improved model may be a deep learning model comprising only a fixed coefficient based network. As described above, parameters of the fixed coefficient-based network are shared between the video encoding device and the video decoding device.

In another example, the improvement may be a deep learning model that includes a fixed coefficient based network and a variable coefficient based network. In the case of a deep learning model including a variable coefficient based network, the video decoding apparatus decodes parameters of the variable coefficient based network from the bitstream.

On the other hand, when the improvement flag is 0 (no in S1504), the video decoding apparatus generates a restored block of the current block by adding the residual value to the prediction block (S1510).

Although steps in the respective flowcharts are described as sequentially performed, these steps merely exemplify the technical ideas of some embodiments of the present invention. Accordingly, one of ordinary skill in the art to which the invention pertains may perform the steps by changing the order depicted in the various figures or by performing two or more steps in parallel. Accordingly, the steps in the various flowcharts are not limited to the order in which they occur as shown.

It should be understood that the foregoing description presents illustrative embodiments that may be implemented in various other ways. The functions described in some embodiments may be implemented by hardware, software, firmware, and/or combinations thereof. It should also be understood that the functional components described in this specification are labeled "… … units" to highlight the possibility of their independent implementation.

On the other hand, the various methods or functions described in some embodiments may be implemented as instructions stored in a non-volatile recording medium, which may be read and executed by one or more processors. The nonvolatile recording medium may include various types of recording devices that store data in a form readable by a computer system, for example. For example, the nonvolatile recording medium may include a storage medium such as an erasable programmable read-only memory (EPROM), a flash memory drive, an optical disk drive, a magnetic hard disk drive, a Solid State Drive (SSD), and the like.

Although exemplary embodiments of the present invention have been described for illustrative purposes, those skilled in the art to which the present invention pertains will appreciate that various modifications, additions and substitutions are possible, without departing from the scope and spirit of the invention. Accordingly, embodiments of the present invention have been described for brevity and clarity. The scope of the technical idea of the embodiment of the invention is not limited by the illustration. Accordingly, it will be understood by those of ordinary skill in the art that the scope of the present invention should not be limited by the embodiments explicitly described above, but by the claims and their equivalents.

(reference numerals)

122: intra-frame predictor

510: entropy decoder

542: intra-frame predictor

710: signal improvement unit

810: and a signal improvement unit.

Cross Reference to Related Applications

The present application claims priority from korean patent application No.10-2021-0028794, filed on 3/4 of 2021, and korean patent application No.10-2022-0026005, filed on 28 of 2022, which are incorporated herein by reference in their entireties.

Claims

1. A video decoding device, comprising:

an entropy decoder configured to decode an intra prediction mode and a residual value of a current block from a bitstream, decode a refinement flag, or determine a refinement flag, wherein the refinement flag indicates whether to apply a deep learning-based refinement model at the time of intra prediction of the current block;

an intra predictor configured to generate a prediction block of the current block using an intra prediction mode;

a signal improvement unit configured to generate an improved prediction block from the prediction block using the improvement model when the improvement flag is 1; and

an adder configured to generate a restored block of the current block by adding a residual value to the modified prediction block when the modification flag is 1, or to generate a restored block by adding a residual value to the prediction block when the modification flag is 0.

2. The apparatus of claim 1, wherein the improvement model is implemented using a deep learning model comprising a network based on fixed coefficients, and the pre-training is performed to generate an improved prediction block that approximates an original image of the current block.

3. The apparatus of claim 1, wherein the improvement model is implemented using a deep learning model including a fixed coefficient-based network and a variable coefficient-based network, and training the variable coefficient-based network while encoding an original image of the current block, the fixed coefficient-based network being fixed.

4. The apparatus of claim 3, wherein the entropy decoder is configured to decode parameters of a variable coefficient based network from a bitstream.

5. The apparatus of claim 1, wherein the improvement flag is transmitted from the video encoding apparatus on a per block basis or on a per image or slice basis.

6. The apparatus of claim 1, wherein the improvement flag is determined to be 1 when a predetermined prediction mode is used for intra prediction of the current block.

7. The apparatus of claim 1, wherein the improvement flag is determined to be 1 when matrix weighted intra prediction is used for intra prediction of the current block.

8. The apparatus of claim 1, wherein when a plurality of reference lines are used, the improvement flag is determined to be 1, but reference samples directly adjacent to a row or column of the current block are not used for intra prediction of the current block.

9. The apparatus of claim 1, wherein the improvement flag is determined to be 1 when intra prediction reference samples for the current block are not available in an adjacent row or column.

10. The apparatus of claim 1, wherein the improvement flag is determined to be 1 when a sub-block partitioned from the current block is used for intra prediction of the current block.

11. A video decoding method performed by a video decoding device for intra prediction of a current block, the method comprising:

decoding an intra prediction mode and a residual value of a current block from a bitstream, decoding an improvement flag, or determining an improvement flag, wherein the improvement flag indicates whether to apply a deep learning-based improvement model at the time of intra prediction of the current block;

generating a prediction block of the current block using the intra prediction mode; and

a restored block of the current block is generated based on the improvement flag,

wherein, when the improvement flag is 1, generating the restored block includes:

generating an improved prediction block from the prediction block using the improved model; and is also provided with

A restored block is generated by adding a residual value to the improved prediction block,

wherein generating the restored block when the improvement flag is 0 comprises:

the restored block is generated by adding the residual value to the prediction block.

12. The method of claim 11, wherein the improvement flag is transmitted from the video encoding device on a per block basis or on a per video sequence or slice basis.

13. The method of claim 11, wherein determining an improvement flag comprises:

the improvement flag is determined to be 1 when a predetermined prediction mode is used for intra prediction of the current block.

14. A video encoding method performed by a video encoding device for intra prediction of a current block, the method comprising:

obtaining an intra prediction mode of the current block, obtaining a refinement flag, or determining a refinement flag, wherein the refinement flag indicates whether to apply a deep learning-based refinement model at the time of intra prediction of the current block;

a residual block of the current block is generated based on the improvement flag,

wherein, when the improvement flag is 1, generating the residual block includes:

A residual block is generated by subtracting the modified prediction block from the current block,

wherein, when the improvement flag is 0, generating the residual block includes:

a residual block is generated by subtracting the prediction block from the current block.

15. The method of claim 14, wherein the improved model is implemented using a deep learning model comprising a network based on fixed coefficients, and pre-training is performed to generate an improved prediction block that approximates the original image of the current block.

16. The method of claim 14, wherein the improved model is implemented using a deep learning model comprising a fixed coefficient-based network and a variable coefficient-based network, and training the variable coefficient-based network while encoding the original image of the current block, the fixed coefficient-based network being fixed.

17. The method of claim 16, further comprising:

parameters of a variable coefficient-based network are encoded and the encoded parameters are transmitted to a video decoding device.

18. The method of claim 14, further comprising:

the improvement flag is sent to the video decoding apparatus on a per block basis or on a per image or slice basis.

19. The method of claim 14, wherein determining an improvement flag comprises:

when a predetermined prediction mode is used for intra prediction of the current block, the improvement flag is determined to be 1.