WO2019194425A1

WO2019194425A1 - Apparatus and method for applying artificial neural network to image encoding or decoding

Info

Publication number: WO2019194425A1
Application number: PCT/KR2019/002654
Authority: WO
Inventors: 나태영; 이선영; 신재섭; 손세훈; 김효성; 임정연
Original assignee: 에스케이텔레콤 주식회사
Priority date: 2018-04-06
Filing date: 2019-03-07
Publication date: 2019-10-10

Abstract

The present disclosure relates to image encoding or decoding and, more specifically, to an apparatus and a method for applying an artificial neural network (ANN) to image encoding or decoding. The apparatus and the method of the present disclosure are characterized by applying a CNN-based filter coefficient to a first picture and at least one of a quantization parameter map and a block division map to output a second picture.

Description

Apparatus and method for applying artificial neural network to image encoding or decoding

The present disclosure relates to image encoding or decoding, and more particularly, to an apparatus and method for applying an artificial neural network (ANN) to image encoding or decoding.

The contents described in this section merely provide background information on the present embodiment and do not constitute a prior art.

Since moving image data has a larger amount of data than audio data or still image data, hardware resources such as memory are consumed when the data source is stored or transmitted as it is. Therefore, in general, moving image data is stored or transmitted after being compressed using an encoder, and the compressed moving image data is reproduced after being decompressed using a decoder.

Meanwhile, as the demand for video content such as high-capacity games and 360-degree video increases rapidly, image size, resolution, frame rate, and the like increase. As a result, the amount of data to be compressed is also rapidly increasing, and the demand for new high efficiency compression technology is increasing.

According to the recent experiment results, the in-loop filter of the conventional video encoding or decoding apparatus is replaced by a Convolutional Neural Network (CNN) filter, which is a kind of artificial neural network, and has a BBR of about 3.57% of BBR (Bjonteggrad-delta Bit Rate) It has been found that the gain can be achieved. Accordingly, an image encoding and decoding technique using an artificial neural network technology has attracted attention as a solution to the above problem.

In this disclosure, various techniques for applying artificial neural network technology to an image encoding or decoding process are presented.

Some techniques of this disclosure relate to mitigating quantization error and blocking degradation using CNN based filters.

According to an aspect of the present disclosure, a method of decoding an image using a CNN-based filter, the method comprising: inputting at least one of a quantization parameter map and a block division map and a first picture to the CNN-based filter; Outputting two pictures, wherein the quantization parameter map indicates information about coding units constituting the first picture, and the block partition map indicates information about divided regions constituting the first picture It provides a method for decoding an image characterized by.

According to an aspect of the present disclosure, in an image decoding apparatus using a CNN-based filter, an input unit for receiving at least one of a quantization parameter map and a block division map and a first picture, the quantization parameter map input to the input unit and the A filter unit for applying coefficients of the CNN-based filter to at least one of the block division maps and the first picture, and at least one of the quantization parameter map and the block division map and the CNN-based filter to the first picture An output unit configured to output a second picture by applying coefficients, wherein the quantization parameter map indicates a coding unit constituting the first picture, and the block partition map indicates information about a divided region constituting the first picture It provides a video decoding apparatus characterized in that.

According to such a method and apparatus, an image improvement, a quantization error, and a blocking artifact may be solved using a filter learned through supervised learning.

Some techniques of this disclosure relate to performing CNN based intra prediction.

According to an aspect of the present disclosure, an image decoding method using a CNN-based intra prediction unit, the method comprising: decoding transform coefficients for a current block to be decoded from a bitstream; Determining input data including a reference region decoded before the current block; Generating prediction pixels of the current block by applying a predetermined filter coefficient of a CNN to the input data; Inversely transforming the transform coefficients to generate residual signals for the current block; And reconstructing the current block by using the prediction pixels and the residual signals.

According to an aspect of the present disclosure, an image decoding apparatus using a CNN-based intra prediction unit includes: a decoder which decodes transform coefficients of a current block to be decoded from a bitstream; A CNN setting unit which determines input data including a reference region decoded before the current block; A CNN execution unit which generates prediction pixels of the current block by applying a predetermined filter coefficient of the CNN to the input data; An inverse transform unit inversely transforming the transform coefficients to generate residual signals for the current block; And an adder for reconstructing the current block by using the prediction pixels and the residual signals.

According to this method and apparatus, by performing CNN-based intra prediction, it is possible to maintain the complexity of the decoding process while improving the prediction accuracy.

Some techniques of this disclosure relate to performing CNN based inter prediction.

According to an aspect of the present disclosure, the method includes: setting input data including a search region in at least one reference picture; Generating a motion vector of a current block or prediction pixels of the current block by applying a predetermined filter coefficient of a CNN to the input data; Inversely transforming transform coefficients extracted from a bitstream to generate residual signals for the current block; And reconstructing the current block by using the residual signals and the motion vector of the current block or the prediction pixels of the current block.

According to an aspect of the present disclosure, a CNN setting unit configured to set input data including a search region in at least one reference picture; A CNN execution unit generating a motion vector of a current block or prediction pixels of the current block by applying a predetermined filter coefficient of the CNN to the input data; And an inverse transform unit which inversely transforms transform coefficients extracted from the bitstream to restore residual signals, wherein the current block is reconstructed using the residual signals and the motion vector of the current block or the prediction pixels of the current block. Provided is an image decoding apparatus.

According to an aspect of the present disclosure, generating a motion vector of a current block or prediction pixels of the current block by using a syntax element for an inter prediction mode extracted from a bitstream; Setting input data including a search region in at least one reference picture and a motion vector of the current block or prediction pixels of the current block; Generating a redefined motion vector of the current block or redefined prediction pixels of the current block by applying a predetermined filter coefficient of the CNN to the input data; Generating residual signals by inversely transforming transform coefficients extracted from the bitstream; And reconstructing the current block by using the residual signals and the redefined motion vector of the current block or the redefined prediction pixels of the current block.

According to an aspect of the present disclosure, a CNN setting unit for setting a motion vector of a current block generated using a syntax element for an inter prediction mode extracted from a bitstream or prediction pixels of the current block as input data; A CNN execution unit which applies a predetermined filter coefficient of the CNN to the input data to generate a redefined motion vector of the current block or redefined prediction pixels of the current block; And an inverse transformer configured to inversely transform transform coefficients extracted from the bitstream to generate residual signals, wherein the current block includes the residual signals and a redefined motion vector of the current block or a redefined prediction pixel of the current block. Provided is an image decoding apparatus that is reconstructed using the above.

According to these methods and apparatuses, by performing CNN-based inter prediction, it is possible to maintain the complexity of the decoding process while improving prediction accuracy.

Some techniques of this disclosure relate to filtering a reference region used for intra prediction using a CNN based filter.

According to an aspect of the present disclosure, decoding the transform coefficients for the current block to be decoded from the bitstream; Setting input data of a CNN by using a first reference region decoded before the current block; Generating a second reference region by filtering the first reference region by applying a predetermined filter coefficient of the CNN to the input data; Generating a prediction block of the current block by performing intra prediction using the second reference region; Inversely transforming the transform coefficients to generate residual signals for the current block; And reconstructing the current block by using the prediction block and the residual signals.

According to an aspect of the present embodiment, a decoding unit for decoding the transform coefficients for the current block to be decoded from the bitstream; An intra predictor configured to generate a predictive block of the current block by performing intra prediction using a second reference region filtered from a first reference region selected from regions decoded before the current block; An inverse transform unit inversely transforming the transform coefficients to generate residual signals for the current block; And an adder for reconstructing the current block by using the prediction block and the residual signals, wherein the second reference region is a preset filter coefficient of the CNN with respect to input data set by using the first reference region. The image decoding apparatus is generated by filtering the first reference region by applying.

According to such a method and apparatus, by using a CNN-based filter to filter the surrounding area used for intra prediction of the current block, prediction accuracy can be greatly improved without significantly changing the existing intra prediction structure.

1 is an exemplary block diagram of an image encoding apparatus that may implement techniques of this disclosure.

2 is a diagram for explaining a method of dividing a block using a QTBTTT structure.

3 is an exemplary diagram for a plurality of intra prediction modes.

4 is an exemplary block diagram of an image decoding apparatus that may implement techniques of this disclosure.

5 is a diagram illustrating an exemplary structure of a CNN that may be used in the techniques of this disclosure.

6 illustrates a CNN-based filter according to an embodiment of the present invention.

7A to 7C are diagrams illustrating structures of a CNN having different positions of a concatenated layer according to an embodiment of the present invention.

8A to 8C illustrate data to be input to an input layer of a CNN according to an embodiment of the present invention.

9A and 9B are diagrams illustrating an example of a block partitioning map according to an embodiment of the present invention.

10A to 10C are diagrams illustrating another example of a block partitioning map according to an embodiment of the present invention.

11A to 11C illustrate block division maps for adjusting the strength of deblocking according to an embodiment of the present invention.

12 illustrates a flowchart of decoding an image using a CNN-based filter according to an embodiment of the present invention.

FIG. 13 is a diagram schematically illustrating a configuration of an apparatus for decoding an image according to an embodiment of the present invention.

14 is a block diagram illustrating a configuration of a CNN prediction unit that may be included in an image encoding apparatus according to an embodiment of the present invention.

15 is an exemplary diagram of a peripheral region that can be used as input data of a CNN.

16 is a diagram illustrating an example of configuring an input layer of a CNN from a plurality of neighboring blocks.

FIG. 17 is an exemplary diagram for describing a prediction direction suitable for a current block in view of pixel value types of neighboring blocks.

18 is an exemplary diagram of a layer configuration of a CNN including hint information.

19 is a block diagram illustrating a configuration of a CNN prediction unit that may be included in an image decoding apparatus according to an embodiment of the present invention.

20 is a flowchart illustrating an operation of a CNN prediction unit that may be included in the image encoding apparatus illustrated in FIG. 14.

FIG. 21 is a flowchart illustrating an operation of a CNN predictor that may be included in the image decoding apparatus illustrated in FIG. 19.

22 is a block diagram illustrating a configuration of a CNN prediction unit that may be included in an image encoding apparatus according to an embodiment of the present invention.

23 is an exemplary diagram of a layer configuration of a CNN.

24 is an exemplary diagram of time-base distance information between a current picture and a reference picture.

25 is an exemplary diagram of a layer configuration of a CNN including hint information.

FIG. 26 is a block diagram illustrating a configuration of a CNN prediction unit that may be included in an image decoding apparatus according to an embodiment of the present invention.

27A and 27B are flowcharts illustrating a process of performing inter prediction by a CNN prediction unit included in an image encoding apparatus according to an embodiment of the present invention.

28A and 28B are flowcharts illustrating a process of performing inter prediction by a CNN prediction unit included in an image decoding apparatus according to an embodiment of the present invention.

29 is a flowchart illustrating a method of calculating filter coefficients of a CNN according to an embodiment of the present invention.

30 is an exemplary diagram of a peripheral region that can be used as input data of a CNN according to an embodiment of the present invention.

31 is an exemplary diagram of a layer configuration of a CNN according to an embodiment of the present invention.

32 is a block diagram illustrating a configuration of a CNN-based filter unit according to an embodiment of the present invention.

33 is a flowchart illustrating a filtering process of a reference region according to an embodiment of the present invention.

34 is a flowchart illustrating a filtering process of a reference region according to an embodiment of the present invention.

Hereinafter, some embodiments of the present invention will be described in detail through exemplary drawings. In adding an identification code to the components of each drawing, it should be noted that the same components as possible, even if shown on different drawings have the same reference numerals. In addition, in describing the present invention, when it is determined that the detailed description of the related well-known configuration or function may obscure the gist of the present invention, the detailed description thereof will be omitted.

In addition, in describing the component of this invention, terms, such as 1st, 2nd, A, B, (a), (b), can be used. These terms are only for distinguishing the components from other components, and the nature, order or order of the components are not limited by the terms. Throughout the specification, when a part is said to include, 'include' a certain component, which means that it may further include other components, except to exclude other components unless otherwise stated. . In addition, terms such as 'unit', 'module', etc. described in the specification mean a unit for processing at least one function or operation, which may be implemented by hardware or software or a combination of hardware and software.

1 is an exemplary block diagram of an image encoding apparatus that may implement techniques of this disclosure. Hereinafter, an image encoding apparatus and subcomponents thereof will be described with reference to FIG. 1.

As shown in FIG. 1, the image encoding apparatus includes a block splitter 110, a predictor 120, a subtractor 130, a transformer 140, a quantizer 145, an encoder 150, and inverse quantization. The unit 160, an inverse transform unit 165, an adder 170, a filter unit 180, and a memory 190 may be configured.

Each component of the image encoding apparatus may be implemented in hardware or software, or a combination of hardware and software. In addition, the functions of each component may be implemented in software, and the microprocessor may be implemented in a form in which each function of each software (component) is executed.

One image (video) is composed of a plurality of pictures. Each picture is divided into a plurality of regions, and encoding is performed for each region. For example, one picture is divided into one or more tiles. Here, one or more tiles may be defined as a tile group. Each tile is divided into one or more coding tree units (CTUs). Each CTU is divided into one or more coding units (CUs) by a tree structure.

Information applied to each CU is encoded by the syntax of the CU, and information commonly applied to CUs included in one CTU is encoded by the syntax of the CTU. In addition, information commonly applied to all blocks in one tile is encoded by the syntax of the tile or the syntax of the tile group to which the tile belongs. Information applied to all blocks constituting one picture may be a picture parameter. It is encoded in a set (PPS, Picture Parameter Set) or picture header.

Furthermore, information commonly referred to by a plurality of pictures is encoded in a sequence parameter set (SPS), and information commonly referred to by one or more SPSs is encoded in a video parameter set (VPS).

The block divider 110 determines the size of a coding tree unit (CTU). Information on the size of the CTU (CTU size) is encoded in the syntax of the SPS or PPS and transmitted to the image decoding apparatus.

After dividing each picture constituting an image into a plurality of coding tree units (CTUs) having a predetermined size, the block dividing unit 110 recursively divides the divided CTUs using a tree structure. Split by recursively. A leaf node in the tree structure becomes a CU (coding unit) which is a basic unit of coding.

The tree structure includes a quad tree (QT) in which a parent node (or parent node) is divided into four child nodes (or child nodes) of the same size, and a binary tree in which the parent node is divided into two child nodes. Tree, BT), and a ternary tree (TT) in which an upper node is divided into three lower nodes at a 1: 2: 1 ratio.

In addition, a structure in which two or more of a QT structure, a BT structure, and a TT structure are mixed may be included. For example, a Quad Tree plus Binary Tree (QTBT) structure may be used, and a Quad Tree plus Binary Tree Ternary Tree (QTBTTT) structure may be used.

2 is a diagram for explaining a method of dividing a block using a QTBTTT structure. As shown in FIG. 2, the CTU may first be divided into a QT structure. Quadtree splitting may be repeated until the size of the splitting block reaches the minimum block size (MinQTSize) of the leaf nodes allowed in QT.

If the leaf node of the QT is not larger than the maximum block size (MaxBTSize) of the root node allowed in BT, it may be further divided into one or more of the BT structure or the TT structure. In the BT structure and / or the TT structure, there may be a plurality of division directions.

For example, in some examples, there may be two types of splitting a block of a corresponding node horizontally (ie, horizontal splitting) and vertically splitting (ie, vertical splitting).

As represented in FIG. 2, when BTTT segmentation is performed, a flag indicating whether nodes are segmented, a flag indicating a division direction (vertical or horizontal), and / or a flag indicating a partition type (Binary or Ternary) are decoded. May be signaled to the device.

Meanwhile, there may further be a type in which blocks of the corresponding node are further divided into two blocks having an asymmetric shape. The asymmetric form may include a form of dividing a block of a node into two rectangular blocks having a size ratio of 1: 3, a form of dividing a block of a node in a diagonal direction, and the like.

When QTBT is used as another example of the tree structure, the CTU is first divided into the QT structure, and then the leaf nodes of the QT may be further divided into the BT structure.

The CU may have various sizes depending on the QTBT or QTBTTT splitting from the CTU. Hereinafter, a block corresponding to a CU (that is, a leaf node of QTBTTT) to be encoded or decoded is referred to as a 'current block'.

The prediction unit 120 predicts the current block and generates a prediction block. The predictor 120 may include an intra predictor 122 and an inter predictor 124. In general, current blocks within a picture may each be predictively coded. Prediction of the current block may be performed using an intra prediction technique using data of the picture including the current block or an inter prediction technique using data of a coded picture before the picture containing the current block.

The intra predictor 122 predicts pixels in the current block by using pixels (reference pixels) positioned around the current block in the current picture including the current block. There are a plurality of intra prediction modes according to the prediction direction. For example, as shown in FIG. 3, the plurality of intra prediction modes may include a non-directional mode including a planar mode and a DC mode and 65 directional modes. The surrounding pixels to be used and the expressions are defined differently for each prediction mode.

In addition, as described below, the intra predictor 122 may predict pixels in the current block using reference pixels through a CNN-based learning and inference process. In this case, the intra prediction unit 122 may operate in parallel with the CNN-based intra prediction mode (hereinafter, also referred to as a “CNN mode”) together with the plurality of intra prediction modes illustrated in FIG. 3. Alternatively, the intra prediction unit 122 may operate only the CNN mode.

The intra predictor 122 may determine an intra prediction mode to use to encode the current block. In some examples, intra prediction unit 122 may encode the current block using several intra prediction modes and select an appropriate intra prediction mode to use from the tested modes. For example, intra predictor 122 calculates rate distortion values using rate-distortion analysis for several tested intra prediction modes, and has the best rate distortion characteristics among the tested modes. Intra prediction mode may be selected.

The intra predictor 122 selects one intra prediction mode from among the plurality of intra prediction modes, and predicts the current block by using a neighboring pixel (reference pixel) and an operation formula determined according to the selected intra prediction mode. As described later, in the CNN mode, the intra predictor 122 predicts the current block by using input data and coefficient values of the convolution kernel.

Information about the selected intra prediction mode is encoded by the encoder 150 and transmitted to the image decoding apparatus. The intra prediction unit 122 performs intra prediction on the current block of the plurality of intra prediction modes in order to efficiently encode intra prediction mode information indicating which mode of the plurality of intra prediction modes is used as the intra prediction mode of the current block. Some of the most probable modes as prediction modes can be determined as MPM (most probable mode).

The MPM list may include intra prediction modes, planar mode, and DC mode of neighboring blocks of the current block. In addition, the MPM list may further include a CNN mode.

When the intra prediction mode of the current block is selected from the MPMs, first intra identification information indicating which mode of the MPMs is selected as the intra prediction mode of the current block is encoded by the encoder 150 and signaled to the image decoding apparatus. do.

On the other hand, when the intra prediction mode of the current block is not selected among the MPMs, the second intra identification information indicating which mode other than the MPM is selected as the intra prediction mode of the current block is transmitted to the encoder 150. Is encoded and signaled to the video decoding apparatus.

The inter prediction unit 124 searches for the block most similar to the current block in the reference picture encoded and decoded before the current picture through a motion estimation process, and uses the block found through the motion compensation process. Generate a predictive block for the block. Inter prediction may be generally classified into uni-directional prediction and bi-directional prediction according to a prediction direction. Unidirectional prediction refers to a method of predicting a current block using only pictures displayed before the current picture or only pictures displayed after the time axis. Bidirectional prediction refers to a method of predicting a current block by referring to both a picture displayed before and a picture displayed after the current picture on the time axis.

The inter prediction unit 124 generates a motion vector corresponding to a displacement between the current block in the current picture and the prediction block in the reference picture. In general, motion estimation is performed on a luma component, and a motion vector calculated based on the luma component is used for both the luma component and the chroma component.

The motion information including the information about the reference picture and the motion vector used to predict the current block is encoded by the encoder 150 and transmitted to the image decoding apparatus.

Various methods may be used to reduce or minimize the amount of bits required to encode motion information. Representative examples of these various methods include Skip mode, Merge mode, and Adaptive (Advanced) motion vector predictor (AMVP) mode.

In the Skip mode and the Merge mode, first, a predetermined number of candidate blocks are selected from neighboring blocks. When the candidate blocks are selected, the inter prediction unit 124 constructs a merge list including candidate blocks, and selects motion information to be used as motion information of the current block among motion information of candidate blocks included in the list. A merge index value for identifying the selected motion information (selected candidate block) is generated.

The index value of the selected motion information, that is, the merge index value, is encoded and signaled to the image decoding apparatus. In the HEVC standard, index values for Skip / Merge mode are expressed through the merge_idx syntax.

In the AMVP mode, first, motion vector predictor (MVP) candidates for the motion vector of the current block are derived using neighboring blocks of the current block. When the prediction motion vector candidates are derived, the inter prediction unit 124 determines the prediction motion vector mvp of the motion vector of the current block, subtracts the prediction motion vector determined from the motion vector of the current block, and subtracts the differential motion vector (motion). vector difference, mvd). The calculated differential motion vector is encoded and signaled to the image decoding apparatus.

The process of determining the predicted motion vector from the predicted motion vector candidates may be implemented through a predefined function (eg, median operation, average value operation, etc.). In this case, the video decoding apparatus is set to apply a predefined function.

Since the neighboring blocks used to derive the predictive motion vector candidates correspond to blocks that have already been encoded and decoded, the image decoding apparatus has already recognized the motion vectors of the neighboring blocks. Therefore, since the information for identifying the predicted motion vector candidates does not need to be encoded, the apparatus for encoding an image encodes only the information about the differential motion vector and the information about the reference picture used to predict the current block.

The process of determining the prediction motion vector from the prediction motion vector candidates may be implemented by selecting any one of the prediction motion vector candidates. In such a case, the information for identifying the determined prediction motion vector is additionally coded together with the information about the differential motion vector and the reference picture used for predicting the current block.

In addition, as described below, the inter prediction unit 124 may predict the current block through a CNN-based inference process. To this end, the filter coefficients of the CNN, that is, the coefficients of the convolution kernel, may be set through the supervised learning process of the CNN.

In addition, the inter prediction unit 124 primarily generates motion information or prediction pixels of the current block according to an existing inter prediction method (ie, motion estimation (ME) and motion compensation (MC)). Then, the generated motion information or prediction pixels may be refined through CNN-based learning and inference processes to predict the current block.

In addition, the inter prediction unit 124 may operate in parallel with the CNN-based inter prediction method (hereinafter, referred to as a "CNN prediction mode") together with the existing inter prediction method. Alternatively, the inter prediction unit 124 may independently operate only the CNN prediction mode by replacing the existing inter prediction method.

The subtractor 130 generates a residual block by subtracting the prediction block generated by the intra predictor 122 or the inter predictor 124 and the current block, and the transform unit 140 has a residual having pixel values in the spatial domain. The residual signal in the block is transformed into transform coefficients in the frequency domain.

The transform unit 140 may convert the residual signals in the residual block using the size of the current block as a conversion unit, divide the residual block into a plurality of smaller subblocks, and convert the residual signals into a conversion unit of the subblock size. You may.

There may be various ways of dividing the residual block into smaller sub-blocks. For example, it may be divided into sub-blocks having the same size, and may be divided by a quad tree (QT) method using the residual block as a root node.

The quantization unit 145 quantizes the transform coefficients output from the transform unit 140, and outputs the quantized transform coefficients to the encoder 150.

The encoder 150 generates a bitstream by encoding the quantized transform coefficients by using an encoding method such as CABAC. Also, the encoder 150 encodes and signals information such as a CTU size, a QT split flag, a BTTT split flag, a split direction, and a split type associated with block division, so that the image decoding apparatus splits the block in the same manner as the image encoder. To be able.

In addition, the encoder 150 encodes information on a prediction type indicating whether the current block is encoded by intra prediction or inter prediction, and intra prediction information (that is, intra prediction mode) according to the prediction type. Information) or inter prediction information (information about a reference picture and a motion vector) is encoded.

Furthermore, when both the existing inter prediction scheme and the CNN based inter prediction scheme are used, the encoder 150 encodes information (flag) indicating whether to use the CNN based inter prediction scheme. When the CNN-based inter prediction scheme is not used, since the conventional inter prediction scheme is used, the encoder 150 encodes the information about the reference picture and the motion vector as the inter prediction information. On the other hand, when the CNN based inter prediction method is used, the encoder 150 encodes information necessary for performing CNN based inter prediction as inter prediction information. As another example, when only the CNN based inter prediction method is used in place of the existing inter prediction method, the encoder 150 encodes information necessary for performing CNN based inter prediction. Information necessary for performing CNN-based inter prediction may include selection information on input data or filter coefficients of the CNN, which will be described later in detail with reference to FIG. 22.

The inverse quantizer 160 inversely quantizes the quantized transform coefficients output from the quantizer 145 to generate transform coefficients. The inverse transformer 165 restores the residual block by converting the transform coefficients output from the inverse quantizer 160 from the frequency domain to the spatial domain.

The adder 170 reconstructs the current block by adding the reconstructed residual block and the prediction block generated by the predictor 120. The pixels in the reconstructed current block are used as reference pixels for intra prediction of the next order block.

The filter unit 180 may reduce the blocking artifacts, the ringing artifacts, the blurring artifacts, and the like that occur due to block-based prediction and transformation / quantization. Perform filtering on The filter unit 180 may include a deblocking filter 182 and a SAO filter 184.

The deblocking filter 180 filters the boundaries between the reconstructed blocks to remove blocking artifacts caused by the encoding / decoding of blocks. The SAO filter 184 further filters the deblocking filtered image. Do this. The SAO filter 184 corresponds to a filter used to compensate for the difference between the reconstructed pixel and the original pixel caused by lossy coding.

The reconstructed blocks filtered through the deblocking filter 182 and the SAO filter 184 are stored in the memory 190. When all the blocks in a picture are reconstructed, the reconstructed picture is used as a reference picture for inter prediction of a block in a picture to be encoded later.

4 is an exemplary block diagram of an image decoding apparatus that may implement techniques of this disclosure. Hereinafter, an image decoding apparatus and subcomponents thereof will be described with reference to FIG. 4.

As shown in FIG. 4, the image decoding apparatus includes a decoder 410, an inverse quantizer 420, an inverse transformer 430, a predictor 440, an adder 450, a filter 460, and a memory ( 470).

In the image decoding apparatus, similar to the image encoding apparatus illustrated in FIG. 1, each component may be implemented as a hardware chip, and the functions of each component may be implemented in software and the microprocessor may be configured to execute the functions of each software. have.

The decoder 410 decodes the bitstream received from the image encoding apparatus, extracts information related to block division (partition information of the luma block and / or partition information of the chroma block), and uses the same to decode the current block to be decoded. It extracts the prediction information and the information about the residual signal necessary to recover the current block.

The decoder 410 extracts information on the CTU size from a Sequence Parameter Set (SPS) or Picture Parameter Set (PPS) to determine the size of the CTU, and divides the picture into a CTU of the determined size. In addition, the decoder 410 determines the CTU as the highest layer of the tree structure, that is, the root node, extracts partition information from the bitstream, and divides or reconstructs the block using the extracted information.

In addition, the decoder 410 extracts information about whether a block is divided into BT and a split type (dividing direction) for a node corresponding to a leaf node of a QT split, and divides the corresponding leaf node into a BT structure.

As another example, when splitting or restoring a block using the QTBTTT structure, the decoder 410 extracts information (flag) on whether to split the QT, divides each node into four nodes of a lower layer, and splits the QT. For nodes corresponding to leaf nodes (nodes where QT splitting no longer occurs), information about whether the split is further divided by BT or TT, information on the split direction, and split type information that distinguishes the BT or TT structure is extracted. Recursively split into BT or TT structures.

As another example, when dividing or restoring a block using the QTBTTT structure, the decoder 410 extracts information (eg, a flag) on whether to split or not, and splits type information when the corresponding block is divided. Extract. When the partition type is QT, the decoder 410 divides each node into four nodes corresponding to lower layers. If the splitting type indicates that the splitting node is a leaf node of the QT splitting (node where QT splitting no longer occurs), that is, split into BT or TT, the decoder 410 additionally includes information on splitting direction and whether the BT or TT structure is present. Extract the partition type information for identifying the partition and split it into a BT or TT structure.

As such, when the current block to be decoded is determined using the split information, the decoder 410 extracts information about a prediction type indicating whether the current block is intra predicted or inter predicted.

When the prediction type information indicates intra prediction, the decoder 410 extracts a syntax element for intra prediction information (intra prediction mode) of the current block. When the prediction type information indicates inter prediction, the decoder 410 extracts a syntax element of the inter prediction information, that is, a motion vector and information indicating the reference picture to which the motion vector refers (motion information of the current block). .

Meanwhile, the decoder 410 extracts information about quantized transform coefficients of the current block as information on the residual signal.

The inverse quantizer 420 inverse quantizes the quantized transform coefficients, and the inverse transformer 430 inversely transforms the inverse quantized transform coefficients from the frequency domain to the spatial domain to generate a residual block for the current block. .

The predictor 440 may include an intra predictor 442 and an inter predictor 444. The intra predictor 342 is activated when the prediction type of the current block is intra prediction, and inter prediction. Unit 344 is activated when the prediction type of the current block is intra prediction.

The intra prediction unit 442 determines the intra prediction mode of the current block among the plurality of intra prediction modes by using the syntax element for the intra prediction mode extracted from the decoder 410, and surrounds the current block according to the determined intra prediction mode. Predict the current block by using the reference pixels.

As described below, when the intra prediction mode for the current block is determined as the CNN mode, the intra prediction unit 442 infers the CNN by using the coefficients of the convolution kernel determined by the image encoding apparatus (ie, the filter coefficients). By predicting the current block can be predicted.

The inter prediction unit 444 determines a motion vector of the current block and a reference picture to which the motion vector refers by using syntax elements of the inter prediction mode extracted from the decoder 410, and then uses the motion vector and the reference picture. Predict the current block.

As described below, the inter prediction unit 444 may generate a motion vector of the current block or directly generate prediction pixels of the current block through a CNN-based inference process.

In addition, the inter prediction unit 444 first generates motion vectors or prediction pixels of the current block according to an existing inter prediction scheme (ie, motion compensation MC), and then secondly, the corresponding motion vectors or prediction pixels. By refining them through a CNN based inference process, the motion vector or the prediction pixels of the current block may be finally generated.

The inter prediction unit 444 may operate in parallel with the CNN-based inter prediction method together with the existing inter prediction method. In this case, the decoder 410 further decodes information (eg, a flag) indicating whether the prediction type information indicates CNN based inter prediction. The inter prediction unit 444 selectively performs an existing inter prediction method or CNN based inter prediction according to information indicating whether the decoder 410 is CNN based inter prediction. Alternatively, the inter prediction unit 444 may independently operate only the CNN based inter prediction scheme. In this case, the inter prediction unit 444 performs CNN based inter prediction when the prediction type information (information indicating whether intra prediction or inter prediction) decoded by the decoder 410 indicates inter prediction.

The adder 450 reconstructs the current block by adding the residual block output from the inverse transformer 430 and the prediction block output from the inter predictor 444 or the intra predictor 442. The pixels in the reconstructed current block are used as reference pixels for intra prediction of a block to be decoded later.

The filter unit 460 includes a deblocking filter 462 and a SAO filter 464. The deblocking filter 462 removes blocking artifacts caused by block-by-block decoding by deblocking filtering the boundary between the reconstructed blocks. The SAO filter 464 performs additional filtering on the reconstructed block after the deblocking filtering to compensate for the difference between the reconstructed pixel and the original pixel resulting from lossy coding.

The reconstructed block filtered through the deblocking filter 462 and the SAO filter 464 is stored in the memory 470, and when all the blocks in one picture are reconstructed, the reconstructed picture intercepts the block in the picture to be subsequently encoded. It is used as a reference picture for prediction.

The techniques of this disclosure generally relate to applying artificial neural network techniques to image encoding or decoding.

Some techniques of this disclosure relate to a CNN-based filter capable of performing the functions of a deblocking filter and a SAO filter in an image encoding apparatus and a decoding apparatus. Some other techniques of this disclosure relate to performing CNN based intra prediction. Some other techniques of this disclosure relate to performing CNN based inter prediction. Some other techniques of this disclosure relate to performing CNN based filtering on a reference region used for intra prediction of the current block.

Referring to FIG. 5, a convolutional neural network (CNN) is a multilayer neural network having a special connection structure designed for image processing, and may include an input layer 510, a hidden layer 530, and an output layer 550. The hidden layer 530 is positioned between the input layer 510 and the output layer 550 and may include a plurality of convolution layers 531 to 439. The hidden layer 530 may further include an upsampling layer or a pooling layer in order to adjust the resolution of a feature map that is a result of the convolution operation. In addition, the CNN may have a structure (not shown) in which ResNet is combined with VDSR (Very Deep Super Resolution) or VDSR.

All layers constituting the CNN each include a plurality of nodes, and each node may be interconnected with nodes of other adjacent layers so as to transfer an output value to which a predetermined connection weight is applied as an input of other nodes.

The convolution layers 531 to 539 may generate a feature map by performing a convolution operation on image data input to each layer using a convolution kernel (ie, a filter) in the form of a 2D matrix or a 3D matrix. have. Here, the feature map refers to image data in which various features of image data input to each layer are expressed. The number of convolution layers 531 to 539, the size of the convolution kernel, and the like may be preset before the learning process.

The output layer 550 may be composed of a fully connected layer. The nodes of the output layer 550 may output image data by combining various features expressed in the feature map.

The CNN algorithm can be divided into learning process and inference process. The learning process may be classified into supervised learning, unsupervised learning, and reinforcement learning according to a learning method. Among these, supervised learning refers to coefficient values (ie, filters) of the convolution kernel using an output label that is an explicit answer to data input to the input layer 510 (hereinafter referred to as 'input data'). Coefficient).

The filter coefficient of the CNN may be calculated through repeated supervised learning using an error backpropagation algorithm for predetermined input data. Specifically, according to the error backpropagation algorithm, the error between the output data of the CNN and the output label propagates in the reverse direction from the output layer 550 through the hidden layer 530 to the input layer 510. In the process of propagation of the corresponding error, the connection weights between the nodes are updated in the direction of reducing the corresponding error. In addition, by repeating the supervised learning process of the CNN until the corresponding error is less than a preset threshold, an optimal filter coefficient to be used in the inference process of the CNN may be calculated.

Now, the techniques of this disclosure using CNN as part of the video encoding or decoding process will be described in detail.

1. CNN 기반의 필터를 이용한 화질 개선1. Image quality improvement using CNN based filter

Hereinafter, the image quality improvement technique using the CNN-based filter will be described with reference to FIGS. 6 to 13. The CNN-based filter described below may be used in both the image encoding apparatus and the image decoding apparatus. For example, the CNN-based filter may be used in place of the deblocking filter 182 and the SAO filter 284 of the image encoding apparatus, and may be used in place of the deblocking filter 462 and the SAO filter 464 of the image decoding apparatus. Can be. In the following description, CNN-based filters are described using YUV as an example of information constituting pictures, but CNN-based filters may be applied to RGB, YCbCr, and the like. That is, in the following description, it should be understood that 'YUV to improve picture quality' may be 'RGB to improve picture quality' or 'YCbCr to improve picture quality'.

If at least one of a quantization parameter (QP) map 603 and a block partition map 605 and a YUV 601 for improving image quality are input to the input layer, a YUV difference ( 621 is output. Here, the YUV 601 to improve the image quality may be a YUV 601 reconstructed from the bit stream received from the encoder, and means a YUV in which the original YUV is artificially or unartificially damaged. In addition, a hint (not shown) may also be input to the input layer.

First, in the learning process, the coefficients of the CNN-based filter, that is, the coefficients of the convolution kernel, are trained such that the YUV difference 621 outputted to the output layer is the difference between the original YUV and the YUV to improve the image quality. The convolution kernel is available in 2D and 3D forms. The CNN-based filter 611 is for improving the image quality of the input YUV, and the final output of the CNN-based filter 611 is called YUV, that is, the YUV 631 with improved image quality.

Here, the YUV to improve the image quality may be filtered for each channel or may be filtered at once.

The size of the QP map may be set to the same resolution as the input YUV to be filtered, and the value of the QP map may be filled with QP values used in coding units in the YUV plane, for example, blocks or sub-blocks. At this time, if the YUV is filtered for each channel, one map may be configured with the QP value of the channel to be filtered. If the YUV is filtered at one time, the QP values of the three channels may consist of three separate maps or one map having an average QP value.

As a method for improving the accuracy of the CNN technique, block mode map information useful for the learning process may be added as hint information in addition to the QP map, the block division map, and the image to be improved in quality as an input layer. Here, the block mode map may be filled with a mode value used in a coding unit, for example, a block or a sub block. For example, the information may be information for distinguishing whether a block is encoded in an intra mode or an inter mode, and the information may be represented by a number. In this case, the convolution kernel coefficient that is a result of the learning process may be set to include not only the data of the input layer but also a hint. Basically, the input layer and output layer of the learning process and the inference process of the CNN technique should be configured identically.

Subsequently, the inference process generates the YUV with improved image quality from the YUV, the quantization parameter map, and the block partitioning map to improve the image quality by applying the coefficients of the CNN-based filter obtained in the learning process.

In detail, the YUV 701, the quantization parameter map 703, and the block partition map 705, which may improve the image quality input to the input layer 710, may be concatenated through the concatenate layer 720 during the CNN process. Can be. However, the position of the contiguous layer 720 may be changed as illustrated in FIGS. 7A, 7B, and 7C.

In FIG. 7A, a contiguous layer 720 is positioned immediately after the input layer 710, and a YUV 701, a quantization parameter map 703, and a block partition map 705 are input to the input layer 710 to improve image quality. It shows the structure of the CNN that is immediately connected. FIG. 7B illustrates the structure of the CNN in which the concatenated layer 720 is located between the convolution layers 730, and FIG. 7C illustrates the structure of the CNN in which the concatenated layer 720 is positioned immediately before the output layer 740.

8A to 8C illustrate data to be input to an input layer according to an embodiment of the present invention.

Specifically, FIG. 8A illustrates that a Y plane (for example, a Y coding tree block (CTB)) for improving image quality is omitted in a pixel value of luminance luma for improving image quality. QP map applied to the Y plane to improve the image quality, Figure 8c shows a block division map of the Y plane to improve the image quality.

Hereinafter, various structures of a block division map according to an embodiment of the present invention will be described. The block partitioning map indicates whether a block is divided or not, so that the processing of the partitioned boundary of the block and the inner region of the block may be differently performed during the CNN's learning process and inference process.

9A and 9B show an example of a block partitioning map according to an embodiment of the present invention.

The block division map may be set to the same resolution as the YUV plane to be filtered and may be configured to indicate whether to block the partition. For example, when the YUV plane is composed of a coding tree block including a plurality of coding blocks, that is, coding blocks (CBs), the block partitioning map may represent a division boundary of the coding blocks in the coding tree block. FIG. 9A illustrates a coding tree block divided by a quadtree plus binary tree (QTBT) scheme, and FIG. 9B illustrates a block partitioning map according to the coding tree block. Referring to FIG. 9B, the boundaries of the coding blocks are indicated by '1' and the inside of the coding blocks are indicated by '0' in the block division map.

10A to 10B show another example of a block partitioning map according to an embodiment of the present invention.

In FIGS. 10A to 10B, when the YUV plane consists of coding tree blocks, blocking deterioration processing on boundaries of the coding tree blocks may not be possible, and thus an extra area α may be added to the YUV plane. 10A and 10B show an example in which 2 pixels are set as the extra area α, but other values may be set. In addition, the boundary of the coding tree block may be indicated by the extra area, and whether or not the block is split may be displayed also in the area outside the coding tree block. If the filtering includes an extra area, an area overlapping with another adjacent coding tree block is generated after the filtering, and the overlapping area may be processed as an average value. Specifically, referring to FIG. 10C, when the coding tree blocks 1001 and 1003 including an extra area are adjacent to each other, an overlapping area 1005 is formed. The overlapping region 1005 may be set to an average value of values of adjacent coding tree blocks 1001 and 1003.

In the previous embodiment, the boundary of the coding block is distinguished by 1 pixel. When the value of the 1 pixel is 0, the inside of the coding block is represented, and the 1 is the boundary of the coding block.

In FIG. 11A, the boundaries of the coding blocks are expressed by the number of pixels (or, pixel width, luma sample line, luma sample length, etc.) in order to control the strength of de-blocking. The number of pixels may be determined by at least one of a size of a coding block, a value of a quantization parameter, and an encoding mode. For example, as shown in FIG. 11A, when the coding block is large, the number of pixels may be set to two, and when the coding block is small, the number of pixels may be set to one. In addition, if the value of the quantization parameter is large, the number of pixels may be increased. If the value of the quantization parameter is small, the number of pixels may be set. As another example, when the encoding mode is intra, the number of the pixels may be increased, and when the encoding mode is inter, the number of the pixels may be set to be small. These can all be set in reverse.

The number of pixels may mean the number of pixels located at a block boundary to be updated by filtering. For example, when a 3 pixel value in one block located at the block boundary line is to be updated, the boundary of the block may be indicated by 3 pixels in the block division map. As another example, the number of pixels may mean the number of pixels located at a block boundary line to be referred to for filtering. For example, when filtering is performed by referring to 4 pixel values of a block located at a block boundary line, the block boundary may be indicated by 4 pixels in the block division map.

In FIG. 11B, the boundary value of the coding block is displayed differently in order to adjust the strength of deblocking. The boundary value of the coding block may be determined by at least one of a size of a coding block, a value of a quantization parameter, an encoding mode, a number of pixels to be updated, and a number of pixels to be referred to for filtering. As shown in FIG. 7A, when the coding block is large, the value of the quantization parameter is large, or the coding mode is intra, the boundary value of the coding block can be set large, and conversely, the coding block is small, or the value of the quantization parameter is small. If the encoding mode is inter, the boundary value of the coding block may be set small. Both of these can also be set in reverse.

FIG. 11C shows the number of pixels at the boundary of the coding block and the boundary value of the coding block in order to control the strength of deblocking. Descriptions thereof will be omitted herein as described with reference to FIGS. 11A and 11B.

As described above, the configured block partitioning map is used in the learning process to help the CNN filter to operate as a strong deblocking filter.

At least one of a quantization parameter map and a block partitioning map and a YUV for improving image quality are input to the CNN-based filter (1201). The quantization parameter map may be set to the same resolution as the YUV to improve the image quality. The block partitioning map may be displayed differently between the partitioned boundary of the block and the inner region of the block. The number and value of pixels representing the partitioned boundary of the block in the block division map are determined by at least one of the size of the coding block, the value of the quantization parameter, the encoding mode, the number of pixels to be updated, and the number of pixels to be referred to for filtering. Can be determined.

As an input of a YUV, a quantization parameter map, and a block partitioning map for improving image quality, a YUV having improved image quality is output using a coefficient of the CNN-based filter learned using the original YUV as a final output (1203). When a hint such as a block mode map is additionally input to the CNN-based filter, the hint of the CNN-based filter is additionally input and learned.

FIG. 13 is a diagram schematically illustrating a configuration of an apparatus for decoding an image according to an embodiment of the present invention. The apparatus illustrated in FIG. 13 may be, for example, a component or a module corresponding to the filter unit 460 of FIG. 4.

The apparatus for decoding the image may include an input unit 1301, a filter unit 1303, and an output unit 1305. Other configurations may be included, but a description of components not directly related to the present disclosure will be omitted.

The input unit 1301 inputs at least one of a quantization parameter map and a block division map and a YUV for improving image quality. The quantization parameter map may be set to the same resolution as the YUV to improve the image quality, and the block partitioning map may be displayed differently between the partitioned boundary of the block and the internal region of the block. The number and value of pixels representing the divided boundary of the block in the block division map may be determined by at least one of a size of a coding block, a value of a quantization parameter, and an encoding mode.

The filter unit 1303 applies at least one of the quantization parameter map and the block division map input to the input unit 1301 and the coefficients of the CNN-based filter learned to the YUV to improve the image quality.

The output unit 1305 outputs a YUV having an improved image quality by applying at least one of the input quantization parameter map and the block division map and the coefficients of the CNN-based filter learned to the YUV to improve the image quality.

In the present example, the input unit 1301, the filter unit 1303, and the output unit 1305 have been described, but they may be integrated into one configuration and implemented, and one configuration may be divided into several configurations. It may be.

2. CNN 기반의 인트라 예측(1) - CNN 기반의 예측 블록 생성2. CNN based intra prediction (1)-CNN based prediction block generation

Some techniques of this disclosure relate to performing CNN based intra prediction. Hereinafter, a technique of performing CNN-based intra prediction will be described with reference to FIGS. 14 to 21.

14 is a block diagram illustrating a configuration of a CNN prediction unit that may be included in an image encoding apparatus according to an embodiment of the present invention. The CNN predictor 1400 of FIG. 14 may be, for example, an intra predictor 122 of the image encoding apparatus illustrated in FIG. 1 or a module included in the intra predictor 122.

Referring to FIG. 14, the CNN prediction unit 1400 may restore the encoding object image (that is, the original image) transferred from the block splitter (eg, 110 of FIG. 1) and the adder (eg, 170 of FIG. 1). A prediction block may be generated by performing CNN based intra prediction on an image. To this end, the CNN prediction unit 1400 may include a CNN setting unit 1410 and a CNN execution unit 1430.

The CNN setting unit 1410 may calculate filter coefficients, that is, coefficients of a convolution kernel, by performing supervised learning using a CNN composed of a plurality of layers. Herein, the structure of the CNN is as described above with reference to FIG. 5, and the CNN may further include an upsampling layer or a pulling layer to adjust the size of the layer.

Image data input to the input layer (hereinafter, referred to as “input data”) may be configured as a reference region encoded before the current block.

The reference region includes a neighboring region adjacent to the current block, and a block of components (hereinafter, referred to as 'current blocks of other channels') that are encoded before the block of the component to be encoded among the luma blocks and chroma blocks constituting the current block. It may include at least one block (or area). Here, the peripheral area may be an area of the same channel as the current block or an area of another channel. In addition, the peripheral area may be configured in block units (ie, peripheral blocks) or in pixel units (ie, peripheral pixels or peripheral lines). The reference region may further include a new region (ie, an average block, an average pixel, or an average line) generated by averaging pixel values of the peripheral region.

15 is an exemplary diagram of a peripheral region that can be used as input data of a CNN. Specifically, FIG. 15A illustrates a peripheral area in block units, and FIG. 15B illustrates a peripheral area in pixel units.

Referring to FIG. 15A, the reference area, that is, the neighboring blocks in block units, includes the left block C, the upper block B, the right upper block D, and the lower left block E adjacent to the current block X. ), The upper left block (A). In this specification, original blocks (ie, uncoded blocks), prediction blocks, and reconstruction blocks of neighboring blocks are denoted differently. For example, for the upper left block A, the original block is denoted by 'Ao', the predictive block is denoted by 'Ap', and the reconstructed block is denoted by 'Ar'. In addition, an average block obtained by averaging pixel values of the neighboring blocks A, B, C, D, and E is denoted by 'F'.

Referring to FIG. 15B, the pixel-based reference region may include pixels of '1 × 1' adjacent to the current block X, and lines of '1 × n' or 'n × 1'. . For reference, since the reference region in the block unit has a wider application range of the convolution kernel than the reference region in the pixel unit, the accuracy of the CNN's learning process and inference process can be improved. Hereinafter, for convenience of description, the present embodiment will be described on the premise that the reference region is in block units.

On the other hand, in the YCbCr 4: 2: 0 or 4: 2: 2 format, chroma blocks are either used at their original size, or up-scaled using an upsampling layer to be the same size as the luma block. Can be used.

When neighboring blocks of a channel different from the current block are input to the input layer, the right block of the current block X, in addition to the neighboring blocks Ar, Br, Cr, Dr, Er shown in FIG. One or more blocks (not shown) of the lower block and the lower right block may be further input to the input layer. For example, when the current block is a chroma block, the accuracy of intra prediction may be improved by adding one or more blocks among the right block, the lower block, and the right lower block of the current block of the luma channel, which have already been encoded, as input data.

The input layer may be composed of a plurality of layers for each of the neighboring blocks Ar, Br, Cr, Dr, and Er, as shown in FIG. 16A, and as shown in FIG. 16B. Peripheral blocks of Ar and Br may be integrated into one layer.

Referring back to FIG. 14, the image data output from the output layer (hereinafter, referred to as “output data”) may be a prediction block of the current block. In this case, the output label may be composed of original blocks (ie, unencoded blocks) of the current block for supervised learning through comparison with the output data.

Table 1 shows some example configurations of the CNN layer. However, it should be noted that the exemplary embodiments are not intended to limit the embodiments to which the technology of the present disclosure may be applied thereto.

CNN 레이어 예시#CNN layer example	입력 레이어Input layer	출력 레이어Output layer
CNN 레이어 예시#CNN layer example	데이터data	데이터data	레이블(label)Label
예시 1Example 1	- 현재 블록과 동일 채널의 주변 블록들Neighboring blocks of the same channel as the current block	현재 블록의 예측 블록Predictive block of current block	현재 블록의 원본 블록Original block of the current block
예시 2Example 2	- 현재 블록과 다른 채널의 현재 블록-Current block of channel different from current block	현재 블록의 예측 블록Predictive block of current block	현재 블록의 원본 블록Original block of the current block
예시 3Example 3	- 현재 블록과 동일 채널의 주변 블록들 및 이들의 평균 블록- 다른 채널의 현재 블록Neighboring blocks of the same channel as the current block and their average blocks current blocks of other channels	현재 블록의 예측 블록Predictive block of current block	현재 블록의 원본 블록Original block of the current block
예시 4Example 4	- 현재 블록과 동일 채널의 주변 블록들 및 이들의 평균 블록- 다른 채널의 현재 블록- 다른 채널의 주변 블록들 및 이들의 평균 블록Peripheral blocks of the same channel as the current block and their average blocks current blocks of the other channels neighbor blocks of the other channels and their average blocks	현재 블록의 예측 블록Predictive block of current block	현재 블록의 원본 블록Original block of the current block

Referring to Table 1, in the configuration example of the CNN layer, the data of the input layer may be configured in various combinations, the data of the output layer is the prediction block of the current block, and the label of the output layer is the original block of the current block. Input data and output data should be the same in the learning process and inference process of CNN, respectively.

On the other hand, the CNN setting unit 1410 may set the hint information to minimize the error between the output data and the output label and improve the accuracy of intra prediction. Here, the hint information may include at least one of directional information of intra prediction, a quantization parameter (QP) of the current block or reference region, and an absolute sum of transform coefficients or residual signals of the neighboring block (ie, the amount of residual). It may include. The hint information may be transmitted to the image decoding apparatus through the bitstream and used to decode the current block.

17 is an exemplary diagram for describing a suitable prediction direction for a current block in view of the pixel value form of neighboring blocks.

In FIG. 17, neighboring blocks of the current block X are composed of an upper left block A, an upper block B, and a left block C. In FIG.

Referring to (a) of FIG. 17, when looking at the pixel values of the neighboring blocks A, B, and C, the upper left block A is about half white, and the left block C is mostly white. However, the upper block (B) is mostly colored other than white. Considering that the pixel value form of the current block X is mostly white, it can be seen that performing intra prediction in the horizontal direction (horizontal direction) can maximize the prediction accuracy.

Referring to (b) of FIG. 17, when looking at the pixel values of the neighboring blocks A, B, and C, the upper left block A and the left block C are mostly white, but the upper block B is white. Mostly color other than white. Considering that most pixel value forms of the current block X have a color other than white, it can be seen that performing intra prediction in the vertical direction (vertical direction) can maximize the prediction accuracy.

Accordingly, the CNN prediction unit 1400 according to the present embodiment intends to improve the accuracy of intra prediction by using the directional information of the intra prediction as hint information of the learning process and the inference process of the CNN.

The directional information of intra prediction may be an intra prediction mode number indicating 65 directional modes and non-directional modes illustrated in FIG. 3. The hint information including one or more prediction directional information may be encoded by the encoder 150 of the image encoding apparatus of FIG. 1 and transmitted to the image decoding apparatus of FIG. 4.

In this case, various methods may be used to minimize the amount of bits required to encode the hint information. For example, the CNN setting unit 1410 selects some of the 65 prediction directions (eg, horizontal direction, vertical direction, diagonal down-right direction, diagonal up-right direction, etc.) as a representative direction, and selects one of the selected representative directions. May be set as hint information for intra prediction of the current block. The CNN setting unit 1410 may transmit the hint information to the image decoding apparatus in a manner similar to that of the most probable mode (MPM).

The hint information may include a quantization parameter (QP) indicating the strength of the quantization. Here, the QP may be a QP value applied to the quantization process of the current block or reference region.

The hint information may include the amount of residuals. Here, the amount of residual may be the sum of the transform coefficients of the neighboring block or the absolute value of the residual signals.

The hint information may be composed of one or more maps and may be concatenated with a layer of the CNN. The map for the hint information may be concatenated at various locations between the input layer and the output layer. For example, the map for the hint information may be concatenated immediately after the input layer as shown in FIG. 18, or may be concatenated immediately before the output layer.

Meanwhile, the input data may be configured in various combinations according to the direction of intra prediction. For example, when the directionality of the intra prediction is the horizontal direction (horizontal direction), the input data may be composed of one or more blocks selected from the left neighboring blocks Ar, Cr, Er of the current block X and their average blocks. Can be. Conversely, if the directionality of the intra prediction is in the vertical direction (vertical direction), the input data may consist of one or more blocks selected from the upper neighboring blocks Ar, Br, and Dr of the current block X and their average blocks. Can be.

The CNN setting unit 1410 may calculate filter coefficients through an iterative learning process using an error backpropagation algorithm in order to minimize an error between the output data and the output label. In detail, the error between the output data and the output label may be propagated in the reverse direction from the output layer of the CNN to the input layer via the hidden layer. In the process of propagation of an error, the connection weights between nodes may be updated in a direction to reduce the corresponding error. The CNN setting unit 1410 may calculate the filter coefficients by repeating the learning process of the CNN using the error backpropagation algorithm until the corresponding error is less than a predetermined threshold.

The above filter coefficient calculation process may be performed in a predetermined unit (eg, CU, CTU, slice, frame, or sequence (group of frames)). For example, the CNN setting unit 1410 may calculate filter coefficients for each current block or may calculate filter coefficients for each frame.

When the filter coefficient is calculated in units of frames, the filter coefficient may be commonly used for intra prediction of a plurality of current blocks included in the frame. In this case, the prediction direction information, which is one of the hint information, may also be plural. For example, when the intra prediction directional information is composed of one map, one map may include a plurality of directional values.

The calculated information about the filter coefficients may be transmitted to the image decoding apparatus through the bitstream and used for the image decoding process.

In addition, the CNN setting unit 1410 may configure a filter coefficient set by previously calculating a plurality of filter coefficients using predetermined sample images. In this case, the CNN setting unit 1410 may set one filter coefficient selected according to a predetermined criterion in the set as the filter coefficient for the current block. For example, the CNN setting unit 1410 may select one filter coefficient from the set based on the similarity of pixel values between the current block and the sample images. Alternatively, the CNN setting unit 1410 may select a filter coefficient closest to the filter coefficient calculated through one learning process from the set. The selection information of the filter coefficient, for example, the index information, may be transmitted to the image decoding apparatus through the bitstream and used for the image decoding process.

Meanwhile, although FIG. 14 illustrates that the CNN setting unit 1410 is included in the CNN prediction unit 1400, it should be noted that this is exemplary and the present embodiment is not limited thereto. That is, the CNN setting unit 1410 may be implemented as a separate unit from the CNN prediction unit 1400, or may be integrated with the CNN execution unit 1430 and implemented as one unit.

The CNN execution unit 1430 performs a CNN-based inference process on the input data using the filter coefficients set by the CNN setting unit 1410, that is, the coefficient values of the convolution kernel, thereby outputting the output data, that is, the current block. A prediction block can be generated. In this case, the generated prediction block may be transferred to a subtractor of the image encoding apparatus and used to generate a residual block from the current block.

19 is a block diagram illustrating a configuration of a CNN prediction unit that may be included in an image decoding apparatus according to an embodiment of the present invention. The CNN predictor 1900 of FIG. 19 may be, for example, an intra predictor 442 of the image decoding apparatus illustrated in FIG. 4, or a module included in the intra predictor 442.

The CNN predictor 1900 of FIG. 19 differs only in a method of setting input signals and filter coefficients, that is, coefficient values of a convolution kernel, from the CNN predictor 1400 of FIG. 14. The description will be omitted or briefly described.

Referring to FIG. 19, the CNN prediction unit 1900 may generate a prediction block by performing CNN-based intra prediction based on a reconstructed image. To this end, the CNN prediction unit 1900 may include a CNN setting unit 1910 and a CNN execution unit 1930.

The structure of the CNN is as described above with reference to FIG. 5, and the CNN may further include an upsampling layer or a pulling layer to adjust the size of the layer.

The image data input to the input layer (hereinafter referred to as 'input data') may be configured as a reference region decoded before the current block.

The reference region includes a neighboring region adjacent to the current block, and a block of components decoded before the block of the component to be decoded among the luma blocks and chroma blocks constituting the current block (hereinafter, referred to as a current block of another channel). It may include at least one block (or area). Here, the peripheral area may be an area of the same channel as the current block or an area of another channel. In addition, the peripheral area may be configured in block units (ie, peripheral blocks) or in pixel units (ie, peripheral pixels or peripheral lines).

The reference region may further include a new region (ie, an average block, an average pixel, or an average line) generated by averaging pixel values of the peripheral region. For example, the input data may be composed of neighboring blocks of the same channel as the current block and current blocks of a channel different from the average block thereof.

Hereinafter, for convenience of description, the present embodiment will be described on the premise that the reference region is in block units.

As described above with reference to FIG. 16, the input layer may be composed of a plurality of layers for each of the neighboring blocks, or a plurality of neighboring blocks may be integrated into one layer.

The image data output from the output layer (hereinafter, referred to as 'output data') may be a prediction block of the current block.

Some configuration examples of the CNN layer are as described above with reference to Table 1. However, it should be noted that this is exemplary and does not limit the present embodiment.

The CNN setting unit 1910 may configure one or more maps using the hint information transmitted from the image encoding apparatus, and then concatenate at various positions between the input layer and the output layer.

The hint information is information for improving the accuracy of intra prediction. The hint information includes information on prediction directionality, a quantization parameter (QP) of the current block or reference region, and an absolute sum of transform coefficients or residual signals of a neighboring block (ie, residual). It may comprise at least one of).

The prediction directional information included in the hint information may be an intra prediction mode number indicating 65 directional modes and a non-directional mode, or may be index information indicating any one or more representative directions selected from the 65 directional modes. have.

Meanwhile, the input data may be configured in various combinations according to the direction of intra prediction. For example, when the directionality of the intra prediction is the horizontal direction (horizontal direction), the input data may be composed of one or more blocks selected from the left neighboring blocks of the current block and their average blocks. Conversely, when the directionality of the intra prediction is in the vertical direction (vertical direction), the input data may be composed of one or more blocks selected from upper neighboring blocks of the current block and their average blocks.

The CNN setting unit 1910 may set the filter coefficients transmitted from the image encoding apparatus as filter coefficients for intra prediction of the current block. In this case, the filter coefficient may be a value calculated by a video encoding apparatus in a predetermined unit, for example, a CU unit or a frame unit.

When the filter coefficient is set in units of frames, the filter coefficient may be commonly used for intra prediction of a plurality of current blocks included in the frame. In this case, the prediction direction information, which is one of the hint information, may also be plural. For example, the directional information of intra prediction is composed of one map, but may include a plurality of directional values in one map.

When the image encoding apparatus and the image decoding apparatus operate the same set of filter coefficients, the CNN setting unit 1910 may set filter coefficients for intra prediction of the current block based on index information of the filter coefficients transmitted from the image encoding apparatus. have.

Meanwhile, although FIG. 19 illustrates that the CNN setting unit 1910 is included in the CNN prediction unit 1900, it should be noted that this is exemplary and the present embodiment is not limited thereto. That is, the CNN setting unit 1910 may be implemented as a separate unit from the CNN prediction unit 1900. In addition, the CNN setting unit 1910 may be integrated with the CNN execution unit 1930 and implemented as one unit.

The CNN execution unit 1930 performs a CNN-based inference process on the input data using the filter coefficients set by the CNN setting unit 1910, that is, the coefficient values of the convolution kernel, and thus, output data, that is, the current block, for the current block. A prediction block can be generated.

The generated prediction block may be transferred to an adder and added to the residual block to be used to recover the current block.

Hereinafter, an exemplary method of performing CNN based intra prediction according to the present embodiment will be described with reference to FIGS. 20 and 21.

Referring to FIG. 20, in step S2010, the CNN setting unit 1410 may set input data and an output label of the CNN.

The input data may be composed of a reference region encoded before the current block. For example, the input data may be composed of neighboring blocks of the same channel as the current block. Alternatively, the input data may be composed of neighboring blocks of the same channel as the current block, average blocks thereof, and current blocks of channels different from the current block.

The data of the output layer may be a prediction block of the current block, and the label of the output layer may be composed of the original block of the current block.

The CNN setting unit 1410 may set the directional information of the prediction as the hint information in order to improve the accuracy of the intra prediction. The set hint information may be transmitted to the image decoding apparatus through the bitstream and used to decode the current block. In this case, the input data may be configured in various combinations according to the direction of intra prediction.

In operation S2020, the CNN setting unit 1410 may calculate filter coefficients through a learning process. The CNN setting unit 1410 may repeat the learning process using an error backpropagation algorithm to improve the accuracy of intra prediction.

The filter coefficient calculation process may be performed in a predetermined unit, for example, a frame unit or a block unit. The CNN setting unit 1410 may configure a filter coefficient set by calculating a plurality of filter coefficients in advance by using predetermined sample images. In this case, the CNN setting unit 1410 may set one filter coefficient selected according to a predetermined criterion in the set as the filter coefficient for the current block.

In operation S2030, the CNN execution unit 1430 performs the CNN-based inference process on the input data using the filter coefficients set by the CNN setting unit 1410, that is, the coefficient values of the convolution kernel, thereby outputting the data. A prediction block for the current block may be generated. In this case, the generated prediction block may be transferred to a subtractor (eg, 130 of FIG. 1) of the image encoding apparatus and used to generate a residual block from the current block.

Referring to FIG. 21, in operation S2110, the CNN setting unit 1910 may set filter coefficients for intra prediction of the current block based on information about filter coefficients transmitted from the image encoding apparatus.

The input data of the CNN may be composed of a reference region decoded before the current block, and the output data becomes a prediction block for the current block.

When the hint information for intra prediction is transmitted from the image encoding apparatus, the CNN setting unit 1910 configures the hint information extracted by the decoder (eg, 410 of FIG. 4) as a map and concatenates the layer of the CNN. (concatenation) can be.

Meanwhile, the input data may be configured in various combinations according to the direction of intra prediction.

In operation S2120, the CNN execution unit 1930 performs a CNN-based inference process on the input data using the filter coefficients set by the CNN setting unit 1910, that is, the coefficient values of the convolution kernel, thereby outputting the data. A prediction block for the current block may be generated. In this case, the generated prediction block may be transferred to an adder (eg, 450 of FIG. 4) and added to the residual block to be used to recover the current block.

3. CNN 기반의 인터 예측3. CNN based inter prediction

Some techniques of this disclosure relate to performing CNN based inter prediction. Hereinafter, a technique of performing CNN-based inter prediction will be described with reference to FIGS. 22 to 28.

22 is a block diagram illustrating a configuration of a CNN prediction unit that may be included in an image encoding apparatus according to the present embodiment. The CNN predictor 2200 of FIG. 22 may be, for example, an inter predictor 124 of the image encoding apparatus illustrated in FIG. 1, or one module included in the inter predictor 124. Referring to FIG. 22, the CNN prediction unit 2200 may include a CNN setting unit 2210 and a CNN execution unit 2230.

First embodiment

According to the first embodiment, the CNN predictor 2200 sets image data (that is, input data) and filter coefficients (that is, coefficients of a convolution kernel) to be input to an input layer of the CNN, and uses the CNN to use the CNN. By inferring the current block can be predicted.

Hereinafter, each configuration and operation of the CNN predictor 2200 according to the first embodiment will be described in detail.

The CNN setting unit 2210 may set input data. In detail, the CNN setting unit 2210 may select at least one reference picture and set a search region in the selected reference picture as input data. Here, the search area in the reference picture means a specific area in the reference picture having a size equal to or larger than the size of the current block.

The position of the search region in the reference picture may be determined based on the position of the current block. For example, the position may be the same position as the current block in the reference picture or a position shifted from the same position by a predefined motion vector. Here, the predefined motion vector may be a motion vector (MV) of a neighboring block adjacent to the current block, or an initial MV or a predicted motion vector shared between the video encoding apparatus and the video decoding apparatus. MV) can be used. Alternatively, a global motion vector (global MV) in units of slices (or pictures) to which the current block belongs may be used as the predefined motion vector.

Meanwhile, the size of the search area in the reference picture may be the same size as the current block or may be an extended size than the current block. For example, the size of the search area may be a size of an area in which a block having the same size as the current block located at the location of the search area is extended left and right by the x coordinate of the above-described predefined motion vector and expanded up and down by the y coordinate. have.

The CNN setting unit 2210 may select a reference picture based on the inter prediction direction. For example, in the case of unidirectional prediction, a reference picture of a specific order of reference picture list 0 may be selected. In the case of bidirectional prediction, a reference picture of a particular order of reference picture list 0 and a reference picture of a particular order of reference picture list 1 may be selected. Here, the information on the reference picture to be used as the input data of the CNN may include selection information (eg, reference picture index value) for a reference picture of a specific order in a specific reference picture list. The information about the reference picture may be transmitted to the image decoding apparatus through the encoder (for example, 190 of FIG. 1). In this case, the information about the reference picture may be encoded as a syntax of the coding unit (CU) so that different reference pictures for each coding unit may be used as input data of the CNN. In addition, the information about the reference picture is encoded as a syntax of a higher unit than the coding unit, for example, CTU, slice, PPS, or SPS, so that the same reference pictures are input data of the CNN for all the coding units included in the higher unit. It can also be used.

In addition, the CNN setting unit 2210 may select a reference picture that is predefined so that the image encoding apparatus and the image decoding apparatus share a reference picture to be used as input data of the CNN. For example, in the case of unidirectional prediction, a first reference picture (eg, a reference picture corresponding to reference picture index 0) and a second reference picture (eg, a reference picture corresponding to reference picture index 1) of reference picture list 0 may be selected. Can be. For bidirectional prediction, the first reference picture of reference picture list 0 (e.g., reference picture corresponding to reference picture index 0) and the first reference picture of reference picture list 1 (e.g., reference picture corresponding to reference picture index 0) Can be selected. However, it should be noted that this is exemplary and the present embodiment is not limited thereto. That is, the CNN setting unit 2210 may set the input data of the CNN by selecting the reference picture in various ways.

Alternatively, the CNN setting unit 2210 selects at least one reference picture to be used as input data of the CNN from among the plurality of reference pictures, and encodes information about the selected reference picture by the encoder (eg, 190 of FIG. 1). It can also be delivered to the image decoding apparatus through. In this case, information about a reference picture to be used as an input of the CNN may be encoded as a syntax of a coding unit (CU), so that different reference pictures for each coding unit may be used as input data of the CNN. Alternatively, the information about the reference picture to be used as the input of the CNN is encoded as a syntax of a higher unit than the coding unit, for example, CTU, slice, PPS, or SPS, so that the same reference pictures are included for all coding units included in the higher unit. It can also be used as input data for a CNN. Here, the information on the reference picture to be used as the input of the CNN may be a picture order count (POC) value of the selected picture or a difference value between the POC value of the selected picture and the POC value of the current picture.

In addition, the CNN setting unit 2210 may further set at least one of a reconstructed peripheral region adjacent to the current block in the current picture and a motion vector of the peripheral region as additional input data. Here, the peripheral area may be an area of the same component as the current block or may be an area of another component. In addition, the peripheral area may be configured in block units (ie, peripheral blocks) or in pixel units (ie, peripheral pixels or peripheral lines). The motion vectors of the plurality of peripheral regions may be set as input data individually or in combination with one or more motion vectors.

In this embodiment, all input data is assumed to be a luma component. However, it should be noted that this is for convenience of description only and does not limit the present embodiment. That is, the input data may consist of various combinations of three components (ie, Y, Cb, Cr).

The motion vector as input data may be represented using a color code. Here, the color code is a mapping of coordinate values of a motion vector to color values. The motion vector represented by the color code may be input to the CNN by configuring one or more maps. For example, the colors are mapped to two-dimensional planes of the x coordinate and the y coordinate, and a color value corresponding to the value of the motion vector (x, y) is used as the color code of the motion vector. In addition, the motion vector may be represented by one map composed of x coordinate values of the corresponding motion vector and the other map composed of y coordinate values of the corresponding motion vector.

The map for the motion vector may have the same resolution as the area corresponding to the motion vector. For example, the map of the motion vector of the reconstructed peripheral area adjacent to the current block may have the same resolution as the size of the corresponding peripheral area.

The map for the motion vector may mean that a color code representing coordinate values of the motion vector is mapped to each pixel of the corresponding map. In addition, the map for the motion vector may mean that the same color code is mapped for each predetermined region of the map corresponding to the unit of the motion vector. For example, when a unit of a motion vector is a region obtained by dividing a peripheral region, that is, a sub block unit, the same color code may be mapped to all pixels included in one sub block of the map. In addition, a map composed of x coordinate values (or y coordinate values) for the motion vector may have an x coordinate value (or all pixels included in the unit of the motion vector) for each pixel of the map. , y coordinate value) itself may be mapped.

If the resolution of the search region in the reference picture input to the CNN input layer and the reconstructed peripheral region in the current picture are different from each other, the CNN setting unit 2210 may include a pooling layer or an upsampling layer. You can adjust the resolution so that they are the same.

Meanwhile, image data (ie, output data) output through the output layer of the CNN corresponding to the input data may be motion vectors or prediction pixels of the current block.

23 is an exemplary diagram of a layer configuration of a CNN.

Referring to FIG. 23, input data is set to a search region in two reference pictures and three peripheral regions in the current picture. Here, the peripheral areas are selected from the restored three blocks located on the upper side, the left side, and the upper left side with respect to the current block, and each is composed of a separate map and is input to the input layer. In this case, the output data becomes the prediction pixels of the current block.

Specific examples of the layer configuration of the CNN can be summarized as shown in Table 2. However, it should be noted that this is merely exemplary and does not limit the present embodiment.

CNN 레이어 예시#CNN layer example	입력 레이어Input layer	출력 레이어Output layer
CNN 레이어 예시#CNN layer example	데이터data	데이터data	레이블(label)Label
예시 1Example 1	- 복수의 참조 픽쳐 내 탐색영역-Search area in a plurality of reference pictures	현재 블록의 움직임 벡터또는현재 블록의 예측 픽셀들Motion vector of current block or prediction pixels of current block	현재 블록의 실제 움직임 벡터또는 현재 블록의 원본 블록The actual motion vector of the current block or the original block of the current block
예시 2Example 2	- 복수의 참조 픽쳐 내 탐색영역- 주변영역의 움직임 벡터-Search region in a plurality of reference pictures-motion vector of the surrounding region	현재 블록의 움직임 벡터또는현재 블록의 예측 픽셀들Motion vector of current block or prediction pixels of current block	현재 블록의 실제 움직임 벡터또는현재 블록의 원본 블록The actual motion vector of the current block or the original block of the current block
예시 3Example 3	- 복수의 참조 픽쳐 내 탐색영역- 주변영역(복원 픽셀들)-Search region in a plurality of reference pictures-peripheral region (restored pixels)	현재 블록의 움직임 벡터또는현재 블록의 예측 픽셀들Motion vector of current block or prediction pixels of current block	현재 블록의 실제 움직임 벡터또는현재 블록의 원본 블록The actual motion vector of the current block or the original block of the current block
예시 4Example 4	- 복수의 참조 픽쳐 내 탐색영역- 주변영역(복원 픽셀들)- 주변영역의 움직임 벡터-Search region in a plurality of reference pictures-peripheral region (restored pixels)-motion vector of the peripheral region	현재 블록의 움직임 벡터또는현재 블록의 예측 픽셀들Motion vector of current block or prediction pixels of current block	현재 블록의 실제 움직임 벡터또는현재 블록의 원본 블록The actual motion vector of the current block or the original block of the current block

Referring to Table 2, the data of the input layer can be composed of various combinations, the data of the output layer is the motion vector or prediction pixels of the current block, and the label of the output layer is the actual motion vector of the current block or the original of the current block It becomes a block. Here, the actual motion vector of the current block may mean, for example, a motion vector calculated through motion estimation (ME) using a full search method or a motion vector obtained by refining it. In this embodiment, the input data and the output data should be basically the same in the learning process and the inference process of the CNN, respectively.

Meanwhile, the CNN setting unit 2210 may further set the hint information as additional input data in order to improve the accuracy of inter prediction. Here, the hint information may include time axis distance information between the current picture and the reference picture, for example, a difference value between a picture order count (POC) value of the current picture and a POC value of the reference picture. In this case, the time axis distance information exists as many as the number of reference pictures. In addition, the hint information may include a quantization parameter (QP). The quantization parameter used as the hint information may be selected from among quantization parameter values of the current block, the surrounding area, or the search area in the reference picture, and may be a value (eg, an average value) derived from at least some of them.

24 is an exemplary diagram of time-base distance information between a current picture and a reference picture. Specifically, FIG. 24A shows a case of unidirectional prediction, and FIG. 24B shows a case of bidirectional prediction.

Referring to FIG. 24A, in unidirectional prediction, time axis distance information becomes -3 and -1 from the left side. Then, by using the motion estimation result (solid line) from the search area 2411 in the t-3 picture to the search area 2413 in the t-1 picture, the current in the t picture from the search area 2411 in the t-3 picture. The motion vector up to block 2415 and the motion vector from the search region 2413 in the t-1 picture to the current block 2415 in the t picture can be inferred.

Referring to FIG. 24B, in bidirectional prediction, time axis distance information becomes -1 and +2 from the left side. Then, using the motion estimation result (solid line) from the search area 2431 in the t-1 picture to the search area 2435 in the t + 2 picture, the current in the t picture from the search area 2431 in the t-1 picture is obtained. The motion vector up to block 2433 and the motion vector from the search region 2435 in the t + 2 picture to the current block 2433 in the t picture can be inferred.

The hint information may consist of one or more maps and may be concatenated with a layer of the CNN. The map for the hint information may be concatenated at various locations between the input layer and the output layer. The hint information may be transmitted to the image decoding apparatus through the bitstream and used to decode the current block.

25 is an exemplary diagram of a layer configuration of a CNN including hint information. Referring to FIG. 25, input data is set to hint information including a search area and a map in two reference pictures. In this case, the output data becomes the prediction pixels of the current block.

The CNN setting unit 2210 may calculate filter coefficients through an iterative learning process in order to minimize an error between the output data and the output label. To this end, the CNN setting unit 2210 may use an error backpropagation algorithm. In detail, the CNN setting unit 2210 may propagate an error between the output data and the output label in a reverse direction from the output layer to the input layer through the hidden layer in the learning process of the CNN. In the process of propagating the corresponding error, the CNN setting unit 2210 may update the connection weights between the nodes to reduce the corresponding error. The CNN setting unit 2210 may calculate filter coefficients by repeating a learning process using an error backpropagation algorithm until the corresponding error is less than a predetermined threshold.

The above filter coefficient calculation process may be performed in a predetermined unit (eg, CU, CTU, slice, frame, or sequence (group of frames)). For example, the CNN setting unit 2210 may calculate filter coefficients for each current block or may calculate filter coefficients for each frame.

When filter coefficients are calculated in units of frames, the filter coefficients may be commonly used for inter prediction of a plurality of current blocks included in the frame.

In addition, the CNN setting unit 2210 may configure a filter coefficient set by previously calculating a plurality of filter coefficients using predetermined sample images. In this case, the CNN setting unit 2210 may set one filter coefficient selected according to a predetermined criterion in the set as the filter coefficient for the current block. For example, the CNN setting unit 2210 may select one filter coefficient from the set based on the similarity of pixel values between the current block and the sample images. Alternatively, the CNN setting unit 2210 may select a filter coefficient closest to the filter coefficient calculated through one learning process from the set. The selection information of the filter coefficient, for example, the index information, may be transmitted to the image decoding apparatus through the bitstream and used for the image decoding process.

Meanwhile, although FIG. 22 illustrates that the CNN setting unit 2210 is included in the CNN predicting unit 2200, it should be noted that this is exemplary and the present embodiment is not limited thereto. That is, the CNN setting unit 2210 may be implemented as a separate unit from the CNN predicting unit 2200, or may be integrated with the CNN executing unit 2230 and implemented as one unit.

The CNN execution unit 2230 may execute the CNN using the input data and the filter coefficients set by the CNN setting unit 2210 to generate a motion vector of the current block or directly generate prediction pixels of the current block. . In this case, the generated prediction pixels may be transferred to a subtractor of the image encoding apparatus and used to generate a residual block from the current block.

Second embodiment

According to the second embodiment, the CNN prediction unit 2200 primarily generates motion vectors or prediction pixels of the current block according to existing inter prediction schemes (ie, motion prediction (ME) and motion compensation (MC)). Afterwards, the motion information or the prediction pixels of the current block may be finally generated by refining the corresponding motion vector or the prediction pixels through a CNN-based inference process.

Hereinafter, each configuration and operation of the CNN predictor 2200 according to the second embodiment will be described in detail. However, description of the content overlapping with the first embodiment will be omitted or briefly described.

The CNN prediction unit 2200 determines a reference picture encoded and decoded before the current picture, and searches for reference pixels most similar to the current block in the determined reference picture. The CNN predictor 2200 may generate motion vectors or prediction pixels of the current block by using the found reference pixels. In this case, the generated motion vector or prediction pixels of the current block are set as input data of the CNN by the CNN setting unit 2210.

The CNN setting unit 2210 may set filter coefficients (that is, coefficient values of a convolution kernel) to be applied to the CNN. The filter coefficient may be calculated by performing a CNN learning process on input data including a search region in a reference picture. The input data may further include at least one of a peripheral region adjacent to the current block in the current picture and a motion vector of the peripheral region. In addition, the input data may further include at least one of hint information for improving the accuracy of inter prediction, for example, time axis distance information between a current picture and a reference picture and a quantization parameter (QP). In addition, the filter coefficient may be a value selected from a preset specific value or a set consisting of a plurality of preset specific values.

The CNN execution unit 2230 executes the CNN by using the input data and the filter coefficients set by the CNN setting unit 2210 to refine the motion vector or the prediction pixels of the current block, thereby improving the motion vector or the current block. The prediction pixels can be finally generated. In this case, the generated prediction pixels may be transferred to a subtractor of the image encoding apparatus and used to generate a residual block from the current block.

FIG. 26 is a block diagram illustrating a configuration of a CNN prediction unit that may be included in an image decoding apparatus according to an embodiment of the present invention. The CNN predictor 2600 of FIG. 26 may be, for example, an inter predictor 444 of the image decoding apparatus illustrated in FIG. 4, or one module included in the inter predictor 444. Referring to FIG. 26, the CNN predictor 2600 may include a CNN setting unit 2610 and a CNN execution unit 2630.

First embodiment

According to the first embodiment, the CNN prediction unit 2600 determines a reference picture based on the reference picture selection information signaled from the image encoding apparatus, and performs the inference process of the CNN using the determined reference picture to move the current block. Vector or prediction pixels can be generated.

Hereinafter, each configuration and operation of the CNN predictor 2600 according to the first embodiment will be described in detail. However, a description of overlapping contents with the CNN prediction unit 2200 that may be included in the image encoding apparatus described above with reference to FIG. 22 will be omitted or briefly described.

The CNN setting unit 2610 may set input data. The CNN setting unit 2610 may select a reference picture based on the reference picture selection information signaled from the image encoding apparatus, and set a search region in the selected reference picture as input data. Here, the search area in the reference picture means a specific area in the reference picture having a size larger than or equal to the size of the current block, and the position and size of the search area are as described above with reference to FIG.

Meanwhile, when the reference picture is selected based on the inter prediction direction in the image encoding process, the CNN setting unit 2610 may select the reference picture based on the information on the inter prediction direction signaled from the image encoding apparatus. For example, in the case of unidirectional prediction, similarly to a video encoding apparatus, a reference picture of a specific order of reference picture list 0 may be selected as input data. In the case of bidirectional prediction, similarly to the image encoding apparatus, a reference picture of a particular order of reference picture list 0 and a reference picture of a particular order of reference picture list 1 may be selected. Here, the selection information of the reference picture may be a value indicating a reference picture of a specific order in the selected reference picture list, for example, a reference picture index value. In addition, the selection information of the reference picture may be a picture order count (POC) value of the selected picture or a difference value between the POC value of the selected picture and the POC value of the current picture.

The CNN setting unit 2610 may further set at least one of a reconstructed peripheral region adjacent to the current block in the current picture and a motion vector of the peripheral region as additional input data. Here, the motion vector may consist of one or more maps expressed in color code. For example, the motion vector may be represented by one map composed of color values corresponding to the values of the corresponding vector (x, y). In addition, the motion vector may be represented by one map composed of x coordinate values of the vector and the other map composed of y coordinate values of the vector.

Specific examples of the layer configuration of the CNN are as described above with reference to Table 2. However, it should be noted that this is merely exemplary and does not limit the present embodiment.

On the other hand, the CNN setting unit 2610 may further set the hint information as additional input data in order to improve the accuracy of the inter prediction. Here, the hint information may include time-base distance information between a current picture and a reference picture, for example, a difference value between a picture order count (POC) value of the current picture and a POC value of the reference picture. In this case, the time axis distance information exists as many as the number of reference pictures. In addition, the hint information may include a quantization parameter (QP) value of a current block, a peripheral region, or a search region in a reference picture.

The hint information may be concatenated with the input layer or the convolutional layer of the CNN to form one concatenation layer. In this case, the hint information may be composed of one or more maps having the same resolution as the layer to be concatenated. The hint information may be transmitted to the image decoding apparatus through the bitstream and used as input data for CNN based inter prediction.

The CNN setting unit 2610 may set the filter coefficients signaled from the image encoding apparatus as filter coefficients for inter prediction of the current block.

In addition, when the filter coefficients are preset to a specific value and stored in the image encoding apparatus and the image decoding apparatus, respectively, the CNN setting unit 2610 sets the filter coefficients previously stored in the image decoding apparatus as filter coefficients for inter prediction of the current block. Can be.

In addition, when the filter coefficients are previously set to a plurality of specific values to form a set and stored in the image encoding apparatus and the image decoding apparatus, respectively, the CNN setting unit 2610 may select selection information of the filter coefficients signaled from the image encoding apparatus, for example. The specific filter coefficient of the set selected according to the index information may be set as the filter coefficient for inter prediction of the current block.

The CNN execution unit 2630 may generate the prediction pixels by inferring the motion information of the current block by executing the CNN using the input data and the filter coefficients set by the CNN setting unit 2610. In this case, the generated prediction pixels may be transferred to an adder and added to the residual block to be used to recover the current block.

Second embodiment

According to the second embodiment, the CNN prediction unit 2600 primarily generates motion vectors or prediction pixels of the current block according to existing inter prediction schemes (ie, motion prediction (ME) and motion compensation (MC)). Afterwards, the motion information or the prediction pixels of the current block may be finally generated by refining the corresponding motion vector or the prediction pixels through a CNN-based inference process.

Hereinafter, each configuration and operation of the CNN predictor 2600 according to the second embodiment will be described in detail. However, a description of overlapping contents with the CNN predictor 2600 of the first embodiment and the CNN predictor 2200 that may be included in the image encoding apparatus described above with reference to FIG. 22 will be omitted or briefly described.

The CNN predictor 2600 may determine a motion vector of the current block and a reference picture referenced by the motion vector by using a syntax element of the inter prediction mode extracted from the decoder. The CNN predictor 2600 may generate motion information or prediction pixels of the current block by predicting the current block by using the determined motion vector and the reference picture.

The CNN setting unit 2610 may set the generated motion information or prediction pixels of the current block as image data (that is, input data) to be input to the input layer of the CNN.

In addition, the CNN setting unit 2610 may set filter coefficients (that is, coefficient values of a convolution kernel) to be applied to the CNN. Here, the filter coefficient may be a value transmitted from the image encoding apparatus. In addition, the filter coefficient may be a value selected by the image encoding apparatus in a set consisting of a predetermined specific value or a plurality of predetermined specific values.

Meanwhile, although FIG. 26 illustrates that the CNN setting unit 2610 is included in the CNN predicting unit 2600, it should be noted that this is exemplary and the present embodiment is not limited thereto. That is, the CNN setting unit 2610 may be implemented as a separate unit from the CNN prediction unit 2600, or may be integrated with the CNN execution unit 2630 and implemented as a unit.

The CNN execution unit 2630 executes the CNN by using the input data and the filter coefficients set by the CNN setting unit 2610 to refine the motion vector or the prediction pixels of the current block to thereby refine the motion vector of the current block. Alternatively, prediction pixels may be finally generated. In this case, the generated prediction pixels may be transferred to an adder and added to the residual block to be used to recover the current block.

Hereinafter, an exemplary method of performing CNN based inter prediction according to the present embodiment will be described with reference to FIGS. 27 and 28.

27A and 27B are flowcharts illustrating a process of performing inter prediction by a CNN predictor included in the image encoding apparatus illustrated in FIG. 22. In detail, FIG. 27A illustrates a CNN based inter prediction process according to the first embodiment, and FIG. 27B illustrates a CNN based inter prediction process according to the second embodiment.

First embodiment

Referring to FIG. 27A, in operation S2711, the CNN setting unit 2210 may set image data (ie, input data) to be input to an input layer of the CNN in order to perform an inference process of the CNN. Here, the input data may include a search region in the reference picture. The input data may further include at least one of a reconstructed peripheral region adjacent to the current block in the current picture and a motion vector of the peripheral region. In each case, the input data may further include hint information, such as time axis distance information between the current picture and the reference picture, in order to improve the accuracy of inter prediction.

In addition, the CNN setting unit 2210 may set filter coefficients (that is, coefficient values of a convolution kernel) to be applied to the CNN. Here, the filter coefficient may be a value calculated by performing a learning process of the CNN on the input data set in step S2711. In addition, the filter coefficient may be a value selected from a preset specific value or a set consisting of a plurality of preset specific values.

In operation S2713, the CNN execution unit 2230 may execute the CNN using the input data and the filter coefficients set in operation S2711 to generate a motion vector of the current block or directly generate prediction pixels of the current block. . In this case, the generated prediction pixels may be transferred to a subtractor and used to generate a residual block from the current block.

Second embodiment

Referring to FIG. 27B, in operation S2731, the CNN prediction unit 2200 may generate motion vectors or prediction pixels of the current block according to existing inter prediction schemes (ie, motion estimation (ME) and motion compensation (MC)). Can be.

In detail, the CNN prediction unit 2200 determines a reference picture encoded and decoded before the current picture, and searches for reference pixels most similar to the current block in the determined reference picture. The CNN predictor 2200 may generate motion vectors or prediction pixels of the current block by using the found reference pixels.

In operation S2733, the CNN setting unit 2210 may set the motion vector or the prediction pixels of the current block generated in operation S1031 as image data (that is, input data) to be input to the input layer of the CNN.

In addition, the CNN setting unit 2210 may set filter coefficients (that is, coefficient values of a convolution kernel) to be applied to the CNN. The filter coefficient may be calculated by performing a learning process of the CNN on the set input data. The input data may include a predetermined search area in the reference picture. The input data may further include at least one of a reconstructed peripheral region adjacent to the current block in the current picture and a motion vector of the peripheral region. In each case, the input data may further include hint information for improving the accuracy of inter prediction, for example, at least one of time axis distance information between a current picture and a reference picture and a quantization parameter (QP). . In addition, the filter coefficient may be a value selected from a preset specific value or a set consisting of a plurality of preset specific values.

In step S2735, the CNN execution unit 2230 executes the CNN using the input data and the filter coefficients set in step S2733 to refine the motion vectors or the prediction pixels of the current block generated in step S2731, thereby Finally, the motion vector or prediction pixels of the block can be generated. In this case, the generated prediction pixels may be transferred to a subtractor of the image encoding apparatus and used to generate a residual block from the current block.

28A and 28B are flowcharts illustrating a process of performing inter prediction by a CNN predictor included in the image decoding apparatus illustrated in FIG. 26. In detail, FIG. 28A illustrates a CNN based inter prediction process according to the first embodiment, and FIG. 28B illustrates a CNN based inter prediction process according to the second embodiment.

First embodiment

Referring to FIG. 28A, in operation S2811, the CNN setting unit 2610 may set image data (ie, input data) to be input to an input layer of the CNN to perform an inference process of the CNN. Here, the input data may include a search region in the reference picture determined based on the reference picture selection information signaled from the image encoding apparatus. The input data may further include at least one of a reconstructed peripheral area of the current block in the current picture and a motion vector of the peripheral area. And in each case, the input data may further include hint information to improve the accuracy of inter prediction.

In addition, the CNN setting unit 2610 may set filter coefficients (that is, coefficient values of a convolution kernel) to be applied to the CNN. Here, the filter coefficient may be a signal signaled from the image encoding apparatus. In addition, the filter coefficient may be a value selected by the image encoding apparatus in a set consisting of a predetermined specific value or a plurality of predetermined specific values.

In operation S2813, the CNN execution unit 2630 may generate a motion vector or prediction pixels of the current block by executing the CNN using the input data and the filter coefficients set in operation S2811. In this case, the generated prediction pixels may be transferred to an adder of the image encoding apparatus and added to the residual block to be used to reconstruct the current block.

Second embodiment

Referring to FIG. 28B, in step S2831, the CNN prediction unit 2600 performs motion vector or prediction of the current block according to existing inter prediction schemes (ie, motion estimation (ME) and motion compensation (MC)). Pixels can be generated.

In detail, the CNN predictor 2600 may determine a motion vector of the current block and a reference picture referenced by the motion vector by using a syntax element for the inter prediction mode extracted from the decoder. The CNN predictor 2600 may generate a motion vector or prediction pixels of the current block by predicting the current block by using the determined motion vector and the reference picture.

In operation S2833, the CNN setting unit 2610 may set motion vector or prediction pixels of the current block generated in operation S2831 as image data (that is, input data) to be input to the input layer of the CNN.

In step S2835, the CNN execution unit 2630 executes the CNN using the filter coefficient set in step S2833 to refine the motion vector or prediction pixels of the current block generated in step S2831, thereby Finally, motion vectors or prediction pixels may be generated. In this case, the generated prediction pixels may be transferred to an adder of the image decoding apparatus and added to the residual block to be used to reconstruct the current block.

4. CNN 기반의 인트라 예측(2) - CNN 기반의 참조 영역 필터링4. CNN-based intra prediction (2)-CNN-based reference region filtering

Some techniques of this disclosure relate to a technique for performing CNN based filtering on a reference region in order to minimize the quantization error of the reference region used for intra prediction of the current block. In other words, instead of directly generating the prediction block of the current block based on the CNN, the prediction region may be greatly improved without filtering the existing intra prediction structure by greatly filtering the surrounding area used for intra prediction of the current block.

First, a method of calculating a filter coefficient of the CNN for use in filtering the reference region will be described.

29 is a flowchart illustrating a process of calculating a filter coefficient of a CNN according to an embodiment of the present invention.

Referring to FIG. 29, in operation S2910, the image encoding apparatus may set input data of a CNN. The input data may include a reference region encoded before the current block, which is a block to be encoded.

The reference region is a block of a component encoded before the block of the component to be encoded among the luma block and chroma blocks constituting the current block, and adjacent to the current block. May comprise at least one block). That is, the peripheral area may be an area of the same component as the current block or may be an area of another component. In addition, the reference region may further include a new region (ie, an average block, an average line, or an average pixel) generated by averaging pixel values of the peripheral region.

The peripheral area may be configured in pixel units (ie, peripheral lines or peripheral pixels) or in block units (ie, peripheral blocks). 30 is an exemplary diagram of a peripheral region that may be set as input data of a CNN. Specifically, FIG. 30A illustrates a peripheral region in pixel units, and FIG. 30B illustrates a peripheral region in block units.

Referring to FIG. 30A, the peripheral area (ie, the peripheral pixel or the peripheral line) in units of pixels includes '1 × 1' pixels adjacent to the current block X, and '1 × n' or 'n ×'. May include lines of 1 '.

Referring to (b) of FIG. 30, the peripheral area (that is, the neighboring block) in units of blocks includes a left block (C), an upper block (B), a right upper block (D), and a lower left block adjacent to the current block (X). (E) and the upper left block (A). In this specification, original blocks (ie, uncoded blocks), prediction blocks, and reconstruction blocks of neighboring blocks will be described differently. For example, for the upper left block A, the original block is denoted by 'Ao', the predictive block is denoted by 'Ap', and the reconstructed block is denoted by 'Ar'. In addition, an average block obtained by averaging pixel values of the neighboring blocks A, B, C, D, and E is denoted by 'F'. Since the periphery area of the block unit has a wider application range of the convolution kernel than the periphery area of the pixel unit, setting the periphery area of the block unit as the input data of the CNN can improve the accuracy of the output data.

When neighboring blocks of components different from the current block are input to the CNN, the right block and the lower block of the current block X, in addition to the neighboring blocks Ar, Br, Cr, Dr, Er shown in FIG. And one or more blocks (not shown) of the lower right block may be further input to the CNN. For example, when the current block is a chroma block, the accuracy of the output data may be further improved by further inputting one or more blocks among the right block, the lower block, and the right lower block of the current block of the luma component that have already been encoded to the CNN. Hereinafter, for convenience of description, the present embodiment will be described on the premise that the reference area is composed of a peripheral area of a block unit, that is, one or more peripheral blocks.

The input data may be composed of at least one layer and input to the CNN. For example, as illustrated in (a) of FIG. 31, neighboring blocks Ar, Br, Cr, Dr, and Er may be configured as separate layers and input to the CNN. In addition, as shown in (b) of FIG. 31, all or part of the neighboring blocks (Ar and Br) may be integrated into a multiplier or the like and may be configured as a single layer and input to the CNN.

The input data may further include additional information to improve the output accuracy of the CNN. The additional information may include all encoding related information that can be referred to by the image encoding / decoding apparatus. For example, the additional information may include at least one of a quantization parameter (QP) value of the surrounding area, a quantization parameter (QP) value of the current block (in the case of an image decoding apparatus), and information about a residual of the surrounding area. can do. Here, the information about the residual of the peripheral region may include an absolute value of each of the transform coefficients of the corresponding peripheral region or an absolute sum of all the transform coefficients in the frequency domain. In addition, the information on the residual of the peripheral region may include an absolute value of each of the residual signals of the corresponding peripheral region or an absolute sum of all residual signals in the spatial domain.

When the input data includes a reference area in units of lines and k additional information, a convolution kernel of 'n × 1 × k' or '1 × n × k' may be applied to the input layer of the CNN. Alternatively, when the input data includes a reference area in units of blocks and k additional information, a convolution kernel of 'n × m × k' may be applied to the input layer of the CNN.

On the other hand, in the YCbCr 4: 2: 0 or 4: 2: 2 format, the chroma block is used at its original size or up-scaled to the same size as the luma block using an upsampling layer. Can be used.

In operation S2920, the apparatus for encoding an image may set an output label to be used for supervised learning of the CNN.

The output label means an explicit correct answer to the input data set in step S2910 and is used to calculate a squared error through comparison with the output data of the CNN. The output label may be original pixel values of the peripheral region set as input data, or of pixel values of other components of the peripheral region, which are smaller than the quantization parameter (QP) value applied to the peripheral region. The pixel values may be applied to the quantization parameter.

The output data refers to data output through the CNN's execution result output layer, and pixel values for restoring pixel values of the peripheral area set as input data to a level before quantization (hereinafter, referred to as 'restored peripheral areas'). May be). The input data and output data of the CNN should be basically the same in the learning process and the inference process.

Specific examples of the layer configuration of the CNN are shown in Table 3.

CNN 레이어 예시#CNN layer example	입력 레이어Input layer	출력 레이어Output layer
CNN 레이어 예시#CNN layer example	데이터data	데이터data	레이블(label)Label
예시 1Example 1	- 라인단위의 주변 영역 - 부가정보-Peripheral area of line unit-Additional information	복원된 주변 영역, 또는 복원된 주변 영역의 잔차 정보Reconstructed Peripheral Area, or Residual Information in Restored Peripheral Area	주변 영역의 원본 픽셀값들,또는 주변 영역의 성분들 중에서 현재 QP보다 작은 QP를 이용하여 이전에 복호화된 성분의 픽셀값들Original pixel values of the peripheral region, or pixel values of a component previously decoded using a QP smaller than the current QP among the components of the peripheral region.
예시 2Example 2	- 블록단위의 주변 영역- 부가정보-Peripheral area in block unit-Additional information	복원된 주변 영역, 또는 복원된 주변 영역의 잔차 정보Reconstructed Peripheral Area, or Residual Information in Restored Peripheral Area	주변 영역의 원본 픽셀값들,또는 주변 영역의 성분들 중에서 현재 QP보다 작은 QP를 이용하여 이전에 복호화된 성분의 픽셀값들Original pixel values of the peripheral region, or pixel values of a component previously decoded using a QP smaller than the current QP among the components of the peripheral region.
예시 3Example 3	- 라인단위 또는 블록단위의 주변 영역- 부가정보-Peripheral area in line or block unit-Additional information	복원된 주변 영역,또는복원된 주변 영역의 잔차 정보Residual information in the restored periphery or restored periphery	주변 영역의 원본 픽셀값들,또는 주변 영역의 성분들 중에서 현재 QP보다 작은 QP를 이용하여 이전에 복호화된 성분의 픽셀값들Original pixel values of the peripheral region, or pixel values of a component previously decoded using a QP smaller than the current QP among the components of the peripheral region.

Referring to Table 3, the data (input data) of the input layer may be composed of a peripheral area and additional information in line units and / or block units. In this case, the data (output data) of the output layer may be a peripheral area reconstructed by approximating original pixel values before quantization with the peripheral area set as the input data, or residual information (pixel value) of the peripheral area. )). In addition, the label of the output layer is decoded by using original pixel values of the peripheral area set as input data or a QP (Quantization Parameter) smaller than a region of a component set as input data among luma blocks and chroma blocks constituting the peripheral area. The pixel values of the component. Such input data and output data should basically be the same in the learning process and inference process of CNN. However, it should be noted that this is merely exemplary and does not limit the present embodiment.

In operation S2930, the image encoding apparatus calculates the filter coefficients of the CNN by repeatedly performing CNN's supervised learning process using an error backpropagation algorithm based on the input data set in operation S2910 and the output label set in operation S2920. can do. The image encoding apparatus may calculate a filter coefficient for each of the Y, Cb, and Cr components, or may calculate one filter coefficient commonly applied to all of the Y, Cb, and Cr components.

Hereinafter, a configuration and method for filtering a reference region used for intra prediction based on CNN will be described in detail with reference to the accompanying drawings.

32 is a block diagram illustrating a configuration of a CNN-based filter unit according to an embodiment of the present invention. Referring to FIG. 32, the CNN-based filter unit 3200 may include a CNN setting unit 3210 and a CNN execution unit 3230.

The structure of the CNN is as described above with reference to FIG. 5, and the CNN may further include an upsampling layer or a pooling layer to adjust the size of the layer.

The CNN setting unit 3210 may set input data of the CNN. As described above with reference to FIG. 29, the input data may include a reference area reconstructed before the current block. The reference region is a block of a component encoded before the block of the component to be encoded among the luma block and chroma blocks constituting the current block, and adjacent to the current block. May comprise at least one block). That is, the peripheral area may be an area of the same component as the current block or may be an area of another component. In addition, the reference region may further include a new region (ie, an average block, an average line, or an average pixel) generated by averaging pixel values of the peripheral region.

The peripheral area may be configured in pixel units (ie, peripheral lines or peripheral pixels) or in block units (ie, peripheral blocks). Since the periphery area of the block unit is wider in the application range of the convolution kernel than the periphery area of the pixel unit, the accuracy of the output data can be improved by setting the periphery area of the block unit as input data of the CNN.

The input data may be composed of at least one layer and input to the CNN. For example, the input data may be composed of one layer for each neighboring block and input to the CNN. In addition, the input data may be composed of one layer in which all or part of the neighboring blocks (Ar and Br) are integrated and input to the CNN.

The input data may further include additional information to improve the accuracy of the CNN learning process and the CNN inference process. The additional information may include at least one of a quantization parameter (QP) value of the reference region, a quantization parameter value of the current block (in the case of an image decoding apparatus), and information about a residual of the reference region. Here, the information about the residual of the reference region may be the absolute values or the absolute sums of the transform coefficients of the reference region in the frequency domain, and may be the respective absolute values or the absolute sum of the residual signals in the spatial domain. have. In addition, the additional information may further include intra prediction mode information (eg, directional information of intra prediction) of the current block.

When the input data includes a periphery area in line units and k additional information, a convolution kernel of 'n × 1 × k' or '1 × n × k' may be applied to the input layer of the CNN. In addition, when the input data includes a peripheral area of block unit and k additional information, a convolution kernel of 'n × m × k' may be applied to the input layer of the CNN.

The CNN setting unit 3210 may set the filter coefficient calculated by the image encoding apparatus as a filter coefficient for intra prediction of the current block. In this case, the filter coefficient may be a value calculated by a video encoding apparatus in a predetermined unit, for example, a CU unit or a frame unit.

When the filter coefficient is set in units of frames, the filter coefficient may be commonly used for intra prediction of a plurality of current blocks included in the frame.

When the image encoding apparatus and the image decoding apparatus operate the filter coefficient set, the CNN setting unit 3210 may perform a filter for intra prediction of the current block from the filter coefficient set based on the index information of the filter coefficient transmitted from the image encoding apparatus. You can also set the factor.

Meanwhile, although FIG. 32 illustrates that the CNN setting unit 3210 is included in the CNN-based filter unit 3200, it should be noted that this is exemplary and the present embodiment is not limited thereto. That is, the CNN setting unit 3210 may be implemented as a separate unit from the CNN-based filter unit 3200. In addition, the CNN setting unit 3210 may be integrated with the CNN execution unit 3230 and implemented as one unit.

The CNN execution unit 3230 performs a CNN-based inference process on the input data using the filter coefficients set by the CNN setting unit 3210, that is, the coefficient values of the convolution kernel, thereby outputting the output data, that is, the current block. A prediction block can be generated.

In this case, the generated prediction block may be i) delivered to the subtractor in the case of the image encoding apparatus and used to generate a residual block of the current block, and ii) delivered to the adder in the case of the image decoding apparatus and transmitted to the presenter. It can be used to recover the current block by adding it to the residual block of the block.

33 is a flowchart illustrating a filtering process of a reference region according to an aspect of the present embodiment.

Referring to FIG. 33, in operation S3310, the intra predictor may determine an intra prediction mode to be used for encoding or decoding a current block. The intra prediction mode may include a plurality of modes according to the prediction direction as described above with reference to FIG. 3. For example, the intra prediction mode may include 65 directional modes and a non-directional mode including a planar mode and a DC mode.

The intra prediction unit may select a reference region to be used for intra prediction of the current block according to the determined intra prediction mode. That is, the reference region may be configured differently according to the intra prediction mode of the current block. The reference region may include a restored peripheral region adjacent to the current block as described above with reference to FIG. 29. In addition, the reference region may further include additional information related to intra prediction of the current block.

In operation S3320, the intra predictor may determine whether to perform filtering on the reference region selected in operation S3310 by determining whether the preset filtering condition is satisfied.

The reference region selected for intra prediction of the current block may have a quantization error with original pixel values while undergoing quantization / dequantization. Since the quantization error causes a decrease in the accuracy of intra prediction, it is necessary to filter the reference region before performing the intra prediction on the current block in order to minimize the quantization error. However, filtering the reference region does not guarantee minimization of the quantization error, and there is a problem that the complexity of the image encoding and decoding process may increase due to the filtering. Therefore, the intra prediction unit according to the present embodiment may adaptively perform filtering on the reference region to be used for intra prediction of the current block only under specific conditions.

The filtering condition may be set based on the size of each of the reference regions selected for intra prediction of the current block. For example, the intra predictor may perform filtering on the neighbor block only when the size of the neighbor block included in the reference region is greater than or equal to '4 × 4'. In addition, the filtering condition may be set based on the intra prediction mode of the current block and the size of the current block. For example, when the intra prediction mode is the DC mode, filtering is not performed on the reference region regardless of the size of the current block, and when the intra prediction mode is the directional mode having the prediction direction of 'Vertical-Right', the size of the current block is Filtering of the reference region may be performed only when it is greater than or equal to '8 × 8'. However, it should be noted that this is exemplary and does not limit the present embodiment.

As such, the intra predictor may adaptively filter the reference region selected for intra prediction of the current block, thereby minimizing the increase in the complexity of the image encoding and decoding process while improving the accuracy of the intra prediction.

If the result of the determination in step S3320 satisfies the preset filtering condition (Yes), the intra prediction unit may determine to perform filtering on the reference region selected for intra prediction of the current block, and proceed to step S3330.

If it is determined in operation S3320 that the preset filtering condition is not satisfied (No), the intra prediction unit may determine not to perform filtering on the reference region selected for intra prediction of the current block, and proceed to operation S3340.

In operation S3330, the intra predictor performs filtering on the reference region selected in operation S3310 to restore pixel values of the reference region to a level before quantization (hereinafter, referred to as 'filtered reference region'). May be generated). Here, a CNN based filter may be used to filter the corresponding reference region. Hereinafter, a process of filtering the reference region by using the CNN-based filter will be described in detail.

In operation S3332, the intra prediction unit may set the input data and the filter coefficient of the CNN to perform CNN-based filtering on the reference region.

Specifically, the input data of the CNN is set as the reference region to be used for intra prediction of the current block, and is selected as the reference region selected in step S3310.

The filter coefficient of the CNN is set to the filter coefficient calculated through the supervised learning process of the image encoding apparatus. The filter coefficients calculated by the image encoding apparatus may be signaled to the image decoding apparatus through the bitstream and used in the CNN-based filtering process of the image decoding apparatus. In addition, when the filter coefficient is previously set to a specific value and stored in the image encoding apparatus and the image decoding apparatus, respectively, the filter coefficient stored in each apparatus may be used in the CNN-based filtering process without additional signaling. In addition, when the filter coefficient is preset with a plurality of specific values to form one set and then stored in the image encoding apparatus and the image decoding apparatus, selection information of the filter coefficient selected by the image encoding apparatus (for example, a filter in the set) The specific filter coefficient set in the set according to the index information of the coefficient may be used in the CNN-based filtering process of the image decoding apparatus. In this case, the filter coefficient may be previously set to a plurality of specific values according to the quantization parameter value to configure one set, and a specific example thereof is shown in Table 4. However, since the contents of Table 4 are exemplary, it should be noted that the present embodiment is not limited thereto.

그룹(index)Group	QPQP	필터계수Filter coefficient
그룹 1 (G01)Group 1 (G01)	0 ~ 100 to 10	W00, W01, W02, W03W00, W01, W02, W03
그룹 2 (G02)Group 2 (G02)	11 ~ 2011 to 20	W04, W05, W06, W07W04, W05, W06, W07
그룹 3 (G03)Group 3 (G03)	21 ~ 3021-30	W08, W09, W10, W11W08, W09, W10, W11
그룹 4 (G04)Group 4 (G04)	31 ~ 4031-40	W12, W13, W14, W15W12, W13, W14, W15
그룹 5 (G05)Group 5 (G05)	41 ~41-	W16, W17, W18, W19W16, W17, W18, W19

Referring to Table 4, when the quantization parameter of the reference region selected for intra prediction of the current block has a value of '0 to 10', the filter coefficients 'W00 to W03' set to Group 1 (G01) are assigned to the reference region. It can be used for CNN based filtering process. Similarly, when the quantization parameter of the reference region selected for intra prediction of the current block has a value of '41' or more, the filter coefficients 'W16 to W19' set to group 5 (G05) are included in the CNN-based filtering process for the reference region. Can be used. In each case, the group index information 'G01 to G05' corresponding to the filter coefficient selected by the image encoding apparatus may be signaled to the image decoding apparatus through a bitstream and used for the CNN based filtering process of the image decoding apparatus. .

In operation S3334, the intra prediction unit may execute the CNN to which the filter coefficient set in operation S3332 is applied to perform the inference process on the input data to generate output data. In this case, the generated output data may be a filtered reference region in which pixel values of the reference region set as input data are restored to a level before quantization.

In operation S3340, the intra predictor may perform intra prediction on the current block by using the reference region.

In this case, the reference region used for intra prediction may be an unfiltered reference region selected in step S3310. In addition, the reference region used for intra prediction may be a reference region filtered to a level before quantization in step S3330.

When intra prediction is performed on the current block by using the filtered reference region, the intra predictor may perform intra prediction on the current block using the pixel values of the filtered reference region as it is. Also, in this case, the intra predictor may use pixel values (that is, weighted average values) calculated by applying preset weights to the pixel values of the filtered reference region and the pixel values before filtering of the reference region. Intra prediction may be performed on the current block. Here, the weight may be an experimentally determined value in order to improve the accuracy of the intra prediction result and the image encoding and decoding efficiency.

34 is a flowchart illustrating a filtering process of a reference region according to another aspect of the present embodiment. The filtering process of FIG. 34 differs from the filtering process of FIG. 33 in a specific method of filtering the reference region, and a description of overlapping portions will be omitted or briefly described below.

Referring to FIG. 34, in operation S3410, the intra predictor may determine an intra prediction mode to be used to encode or decode a current block. The intra prediction unit may select a reference region to be used for intra prediction of the current block according to the determined intra prediction mode. The reference region may include a restored peripheral region adjacent to the current block. In addition, the reference region may further include additional information related to intra prediction of the current block.

In operation S3420, the intra predictor may determine whether to perform the filtering on the reference region selected in operation S3410 by determining whether the preset filtering condition is satisfied.

The reference region selected for intra prediction of the current block may have a quantization error with original pixel values while undergoing quantization / dequantization. Accordingly, in order to minimize such quantization error and not to greatly increase the complexity of the image encoding / decoding process, the intra prediction unit may adaptively filter the reference region only under specific conditions.

If it is determined in operation S3420 that the predetermined filtering condition is satisfied (Yes), the intra prediction unit may determine to perform filtering on the reference region selected for intra prediction of the current block, and proceed to operation S3430.

If it is determined in operation S3420 that the preset filtering condition is not satisfied (No), the intra prediction unit may determine not to perform filtering on the reference region selected for intra prediction of the current block, and proceed to operation S3440.

In operation S3430, the intra predictor performs filtering on the reference region selected in operation S3410 to restore pixel values of the reference region to a level before quantization (hereinafter, referred to as 'filtered reference region'). May be generated). In this case, both a low pass filter (eg, a 2-tap filter, a 3-tap filter, etc.) and a CNN-based filter may be used to filter the corresponding reference region.

In this case, the filtering for the reference region is performed by first filtering using a low pass filter (or a CNN based filter), and then second filtering using a CNN based filter (or a low pass filter). It can be done in a manner. Hereinafter, a process of filtering the reference region using the low pass filter and the CNN-based filter will be described in detail.

In operation S3432, the intra predictor may first filter the reference region using the low pass filter. Since a method for filtering the reference region using the low pass filter is obvious to those skilled in the art, a detailed description thereof will be omitted.

In operation S3434, the intra predictor may secondarily filter the reference region filtered in operation S3432 using the CNN-based filter.

Specifically, the input data of the CNN is set as a reference region to be used for intra prediction of the current block, and is filtered as a reference region in step S3432.

The filter coefficient of the CNN is set to the filter coefficient calculated through the supervised learning process of the image encoding apparatus. The filter coefficients calculated by the image encoding apparatus may be signaled to the image decoding apparatus through the bitstream and used in the CNN-based filtering process of the image decoding apparatus. In addition, when the filter coefficient is previously set to a specific value and stored in the image encoding apparatus and the image decoding apparatus, respectively, the filter coefficient stored in each apparatus may be used in the CNN-based filtering process without additional signaling. In addition, when the filter coefficient is preset with a plurality of specific values to form one set and then stored in the image encoding apparatus and the image decoding apparatus, selection information of the filter coefficient selected by the image encoding apparatus (for example, a filter in the set) The specific filter coefficient set in the set according to the index information of the coefficient may be used in the CNN-based filtering process of the image decoding apparatus. In this case, the filter coefficient may be previously set to a plurality of specific values according to the range of the quantization parameter value to configure one set, and a specific example thereof is as described above with reference to Table 4.

The intra predictor may generate output data by executing an inference process on input data by executing a CNN to which a specific filter coefficient is applied. In this case, the generated output data may be a filtered reference region in which pixel values of the reference region set as input data are restored to a level before quantization.

In operation S3440, the intra predictor may perform intra prediction on the current block by using the reference region.

In this case, the reference region used for intra prediction may be an unfiltered reference region selected in step S3410. Alternatively, the reference region used for intra prediction may be a reference region filtered to a level before quantization in step S3430.

In various flow diagrams and related descriptions illustrating techniques of this disclosure, a plurality of steps are described as being performed sequentially, but this is merely illustrative of the technical spirit of embodiments of the present invention. In other words, a person of ordinary skill in the art to which the present invention pertains may perform the altered order or perform some of the steps in parallel without departing from the essential characteristics of the embodiments of the present invention. As various modifications and changes may be made to be carried out, the flowcharts illustrated in the drawings are not necessarily limited to time-series order.

Each of the functions, operations, and methods of the aforementioned devices can be embodied as computer readable code on a computer readable recording medium. The computer-readable recording medium includes all kinds of recording devices in which data that can be read by a computer system is stored. That is, computer-readable recording media include storage media such as magnetic storage media (eg, ROM, floppy disk, hard disk, etc.), and optical reading media (eg, CD-ROM, DVD, etc.). The computer readable recording medium can also be distributed over network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion.

The above description is merely illustrative of the technical idea of the present invention, and those skilled in the art to which the present invention pertains may make various modifications and changes without departing from the essential characteristics of the present embodiment. Accordingly, the embodiments are not intended to limit the technical spirit of the present invention but to describe the present invention, and the scope of the technical spirit of the present invention is not limited by these embodiments. The protection scope of the present invention should be interpreted by the following claims, and all technical ideas within the equivalent scope should be interpreted as being included in the scope of the present invention.

CROSS-REFERENCE TO RELATED APPLICATION

This patent application is incorporated by reference in its entirety herein (1) Patent Application No. 10-2018-0040588, filed in Korea on April 06, 2018, and (2) June 25, 2018. Patent Application No. 10-2018-0072499, filed in Korea, (3) Patent Application No. 10-2018-0072506, filed in Korea on June 25, 2018, and (4) July 12, 2018 Claim for priority on Korean Patent Application No. 10-2018-0081123 filed in Korea, and (5) Patent Application No. 10-2018-0099166 filed in Korea on August 24, 2018.

Claims

In the video decoding method using a CNN (Convolutional Neural Network) based filter,

Inputting at least one of a quantization parameter map and a block division map and a first picture to the CNN-based filter; And

Outputting a second picture,

The quantization parameter map indicates information about coding units constituting the first picture, and the block division map indicates information about divided regions constituting the first picture.
The method of claim 1,

And a coefficient of the CNN-based filter is trained using at least one of a quantization parameter map and a block division map, a third picture, and an original picture.
The method of claim 1,

The quantization parameter map is set to the same resolution as the first picture.
The method of claim 1,

And inputting a block mode map indicating an encoding mode to the CNN-based filter.
The method of claim 1,

And the block division map displays different values of the divided boundary of the block and the internal region of the block.
The method of claim 1,

The number of pixels representing the divided boundary of the block in the block division map is determined by at least one of the size of the coding block, the value of the quantization parameter, the encoding mode, the number of pixels to be updated, and the number of pixels to be referred to for filtering. An image decoding method characterized by.
The method of claim 1,

In the block division map, a value of a pixel representing a partitioned boundary of a block is determined by at least one of a size of a coding block, a value of a quantization parameter, an encoding mode, a number of pixels to be updated, and a number of pixels to be referred to for filtering. An image decoding method characterized by.
The method of claim 1,

The coefficient of the CNN-based filter is an image decoding method, characterized in that received from the device for encoding the image.
In the image decoding apparatus using a CNN (Convolutional Neural Network) based filter,

An input unit configured to receive at least one of a quantization parameter map and a block division map and a first picture;

A filter unit which applies coefficients of the CNN-based filter to at least one of the quantization parameter map and the block division map and the first picture input to the input unit; And

An output unit for outputting a second picture by applying at least one of the quantization parameter map and the block division map and coefficients of the CNN-based filter to the first picture,

And wherein the quantization parameter map indicates information about coding units constituting the first picture, and the block partition map indicates information about divided regions constituting the first picture.
The method of claim 9,

The coefficient of the CNN-based filter is trained using at least one of a quantization parameter map, a block division map, a third picture, and an original picture.
The method of claim 9,

And the quantization parameter map is set to the same resolution as the first picture.
The method of claim 9,

The input unit further receives a block mode map indicating an encoding mode,

And a coefficient of the CNN-based filter is additionally learned by further inputting the block mode map.
The method of claim 9,

And the block partitioning map differently displays the divided boundary of the block and the value of the inner region of the block.
The method of claim 9,

The number of pixels representing the divided boundary of the block in the block division map is determined by at least one of the size of the coding block, the value of the quantization parameter, the encoding mode, the number of pixels to be updated, and the number of pixels to be referred to for filtering. An image decoding device.
The method of claim 9,

In the block division map, a value of a pixel representing a partitioned boundary of a block is determined by at least one of a size of a coding block, a value of a quantization parameter, an encoding mode, a number of pixels to be updated, and a number of pixels to be referred to for filtering. An image decoding device.
The method of claim 9,

And a coefficient of the CNN-based filter is received from an apparatus for encoding an image.