CN116746146A - Multiple neural network models for filtering during video codec - Google Patents

Multiple neural network models for filtering during video codec Download PDF

Info

Publication number
CN116746146A
CN116746146A CN202280008942.7A CN202280008942A CN116746146A CN 116746146 A CN116746146 A CN 116746146A CN 202280008942 A CN202280008942 A CN 202280008942A CN 116746146 A CN116746146 A CN 116746146A
Authority
CN
China
Prior art keywords
data
unit
neural network
video
filtering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202280008942.7A
Other languages
Chinese (zh)
Inventor
王洪涛
V·M·S·A·科特拉
J·陈
M·卡尔切维茨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qualcomm Inc
Original Assignee
Qualcomm Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US17/566,282 external-priority patent/US20220215593A1/en
Application filed by Qualcomm Inc filed Critical Qualcomm Inc
Priority claimed from PCT/US2022/011021 external-priority patent/WO2022147494A1/en
Publication of CN116746146A publication Critical patent/CN116746146A/en
Pending legal-status Critical Current

Links

Landscapes

  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

An example apparatus for filtering decoded video data includes one or more processors configured to operate a neural network filtering unit to: receiving data from one or more other units of the device, the data from the one or more other units of the device being different from the data of the decoded picture of the video data, and wherein to receive the data from the one or more other units of the device, the one or more processors are configured to operate the neural network filtering unit to receive boundary strength data from the deblocking unit of the device; determining one or more neural network models for filtering a portion of the decoded picture; and filtering a portion of the decoded picture using the one or more neural network models and data from one or more other units of the device, including boundary strength data.

Description

Multiple neural network models for filtering during video codec
Cross Reference to Related Applications
The present application claims priority from U.S. patent application Ser. No. 17/566,282, filed on 12 months of 2021, and U.S. provisional application Ser. No. 63/133,733, filed on 4 months of 2021, which are incorporated herein by reference in their entireties. U.S. patent application Ser. No. 17/566,282, filed on 12/30/2021, claims the benefit of U.S. provisional application Ser. No. 63/133,733, filed on 4/1/2021.
Technical Field
The present disclosure relates to video encoding and decoding, including video encoding and video decoding.
Background
Digital video capabilities can be incorporated into a wide range of devices including digital televisions, digital direct broadcast systems, wireless broadcast systems, personal Digital Assistants (PDAs), laptop or desktop computers, tablet computers, electronic book readers, digital cameras, digital recording devices, digital media players, video gaming devices, video gaming consoles, cellular or satellite radio telephones, so-called "smartphones", video teleconferencing devices, video streaming devices, and the like. Digital video devices implement video codec techniques such as those described in the standards defined by MPEG-2, MPEG-4, ITU-T H.263, ITU-T H.264/MPEG-4 part 10, advanced Video Codec (AVC), ITU-T H.265/High Efficiency Video Codec (HEVC), and extensions of these standards. By implementing such video codec techniques, video devices may more efficiently transmit, receive, encode, decode, and/or store digital video information.
Video coding techniques include spatial (intra-picture) prediction and/or temporal (inter-picture) prediction to reduce or eliminate redundancy inherent in video sequences. For block-based video coding, a video slice (e.g., a video picture or a portion of a video picture) may be partitioned into video blocks, which may also be referred to as a Coding Tree Unit (CTU), a Coding Unit (CU), and/or a coding node. Video blocks in an intra-coding (I) slice of a picture are encoded using spatial prediction with respect to reference samples in neighboring blocks in the same picture. Video blocks in an inter-codec (P or B) slice of a picture may use spatial prediction with respect to reference samples in neighboring blocks in the same picture or temporal prediction with respect to reference samples in other reference pictures. A picture may be referred to as a frame and a reference picture may be referred to as a reference frame.
Disclosure of Invention
In general, this disclosure describes techniques for filtering a potentially distorted decoded picture. The filtering process may be based on neural network technology. The filtering process may be used in the context of advanced video codecs such as ITU-T h.266/extension of the universal video codec (VVC), or the successor to the video codec standard, as well as any other video codec. In one example, the neural network filtering unit may receive the boundary strength values calculated by the deblocking filter and use the boundary strength values to further filter the deblocked video data, for example, using one or more neural network models.
In one example, a method of filtering decoded video data includes: receiving, by a neural network filtering unit of a video decoding device, data of a decoded picture of video data; receiving, by the neural network filtering unit, data from one or more other units of the video decoding device, the data from the one or more other units being different from the data of the decoded picture, and wherein receiving the data from the one or more other units of the video decoding device comprises receiving boundary strength data from a deblocking unit of the video decoding device; determining, by a neural network filtering unit, one or more neural network models for filtering a portion of the decoded picture; and filtering, by the neural network filtering unit, a portion of the decoded picture using the one or more neural network models and data from one or more other units of the video decoding device, including boundary strength data.
In another example, an apparatus for filtering decoded video data includes: a memory configured to store decoded pictures of video data; and one or more processors implemented in the circuitry and configured to operate the neural network filtering unit to: receiving data from one or more other units of the device, the data from the one or more other units of the device being different from the data of the decoded picture, and wherein to receive the data from the one or more other units of the device, the one or more processors are configured to operate the neural network filtering unit to receive boundary strength data from the deblocking unit of the device; determining one or more neural network models for filtering a portion of the decoded picture; and filtering a portion of the decoded picture using the one or more neural network models and data from one or more other units of the device, including boundary strength data.
In another example, a computer-readable storage medium having instructions stored thereon that, when executed, cause a processor of a video decoding device to operate a neural network filtering unit to: receiving data of a decoded picture of video data; receiving data from one or more other units of the video decoding device that are different from the data of the decoded picture, and wherein the instructions that cause the processor to receive the data from the one or more other units of the video decoding device comprise instructions that cause the processor to receive boundary strength data from a deblocking unit of the video decoding device; determining one or more neural network models for filtering a portion of the decoded picture; and filtering a portion of the decoded picture using the one or more neural network models and data from one or more other units of the video decoding device, including boundary strength data.
In another example, an apparatus for filtering decoded video data includes a filtering unit comprising: means for receiving data of a decoded picture of video data; means for receiving data from one or more other units of the video decoding device, the data from the one or more other units being different from the data of the decoded picture, and wherein the means for receiving data from the one or more other units of the video decoding device comprises means for receiving boundary strength data from a deblocking unit of the video decoding device; means for determining one or more neural network models for filtering a portion of the decoded picture; and means for filtering a portion of the decoded picture using the one or more neural network models and data from one or more other units of the video decoding device, including boundary strength data.
The details of one or more examples of the disclosure are set forth in the accompanying drawings and the description below. Other features, objects, and advantages of the disclosure will be apparent from the description and drawings, and from the claims.
Drawings
Fig. 1 is a block diagram illustrating an example video encoding and decoding system that may perform the techniques of this disclosure.
Fig. 2 is a conceptual diagram illustrating a hybrid video codec framework.
Fig. 3 is a conceptual diagram illustrating a hierarchical prediction structure using a group of pictures (GOP) size of 16.
Fig. 4 is a conceptual diagram illustrating a neural network-based filter having four layers.
Fig. 5 is a conceptual diagram illustrating an example portion of a picture including boundaries, boundary samples, and internal samples.
Fig. 6 is a block diagram illustrating an example video encoder that may perform the techniques of this disclosure.
Fig. 7 is a block diagram illustrating an example video decoder that may perform the techniques of this disclosure.
Fig. 8 is a flowchart illustrating an example method for encoding a current block in accordance with the techniques of this disclosure.
Fig. 9 is a flowchart illustrating an example method for decoding a current block in accordance with the techniques of this disclosure.
Fig. 10 is a flowchart illustrating an example method of filtering decoded video data in accordance with the techniques of this disclosure.
Detailed Description
Video codec standards include ITU-T H.261, ISO/IEC MPEG-1 visualizations, ITU-T H.262 or ISO/IEC MPEG-2 visualizations, ITU-T H.263, ISO/IEC MPEG-4 visualizations, and ITU-T H.264 (also known as ISO/IEC MPEG-4 AVC), high Efficiency Video Codec (HEVC), or ITU-T H.265, including range extensions, multiview extensions (MV-HEVC), and scalability extensions (SHVC) thereof. Another example video codec standard is the universal video codec (VVC) or ITU-T h.266, which is developed by the joint video experts group (jfet) of the ITU-T Video Codec Experts Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG). Version 1 of the VVC specification (hereinafter "VVC FDIS") is available from http: the product was obtained from// phenyl. Int-evry. Fr/JVET/doc_end_user/documents/19_Teleconnerence/wg 11/JVET-S2001-v17. Zip.
Fig. 1 is a block diagram illustrating an example video encoding and decoding system 100 that may perform the techniques of this disclosure. The techniques of this disclosure are generally directed to encoding (encoding and/or decoding) video data. Generally, video data includes any data used to process video. Thus, video data may include raw, non-decoded video, encoded video, decoded (e.g., reconstructed) video, and video metadata, such as signaling data.
As shown in fig. 1, in this example, the system 100 includes a source device 102 that provides encoded video data to be decoded and displayed by a destination device 116. Specifically, the source device 102 provides video data to the destination device 116 via the computer readable medium 110. The source device 102 and the destination device 116 may comprise any of a variety of devices including desktop computers, notebook (i.e., laptop) computers, tablet computers, set-top boxes, hand-held telephones such as smartphones, televisions, cameras, display devices, digital media players, video game consoles, video streaming devices, and the like. In some cases, the source device 102 and the destination device 116 may be equipped for wireless communication, and thus may be referred to as wireless communication devices.
In the example of fig. 1, source device 102 includes video source 104, memory 106, video encoder 200, and output interface 108. Destination device 116 includes input interface 122, video decoder 300, memory 120, and display device 118. In accordance with the present disclosure, the video encoder 200 of the source device 102 and the video decoder 300 of the destination device 116 may be configured to apply techniques that use multiple neural network models for filtering. Thus, source device 102 represents an example of a video encoding device, and destination device 116 represents an example of a video decoding device. In other examples, the source device and the destination device may include other components or arrangements. For example, source device 102 may receive video data from an external video source, such as an external camera. Similarly, the destination device 116 may interface with an external display device instead of including an integrated display device.
The system 100 as shown in fig. 1 is merely one example. In general, any digital video encoding and/or decoding device may perform techniques that use multiple neural network models for filtering. The source device 102 and the destination device 116 are merely examples of such codec devices, wherein the source device 102 generates decoded video data for transmission to the destination device 116. The present disclosure refers to "codec" devices as devices that perform the codec (encoding and/or decoding) of data. Thus, the video encoder 200 and the video decoder 300 represent examples of codec devices, specifically, a video encoder and a video decoder, respectively. In some examples, source device 102 and destination device 116 may operate in a substantially symmetrical manner such that each of source device 102 and destination device 116 includes video encoding and decoding components. Thus, the system 100 may support unidirectional or bidirectional video transmission (e.g., for video streaming, video playback, video broadcasting, or video telephony) between the source device 102 and the destination device 116.
In general, video source 104 represents a source of video data (i.e., raw, non-decoded video data) and provides a succession of pictures (also referred to as "frames") of video data to video encoder 200, which video encoder 200 encodes the data of the pictures. The video source 104 of the source device 102 may include a video capture device such as a camera, a video archive containing previously captured raw video, and/or a video feed interface that receives video from a video content provider. As a further alternative, video source 104 may generate computer graphics-based data as the source video, or a combination of live video, archived video, and computer generated video. In each case, the video encoder 200 encodes captured, pre-captured, or computer-generated video data. The video encoder 200 may rearrange the pictures from the received order (sometimes referred to as the "display order") to a codec order for encoding and decoding. The video encoder 200 may generate a bitstream including the encoded video data. The source device 102 may then output the encoded video data onto the computer readable medium 110 via the output interface 108 for receipt and/or retrieval by an input interface 122, such as the destination device 116.
The memory 106 of the source device 102 and the memory 120 of the destination device 116 represent general purpose memory. In some examples, the memories 106, 120 may store raw video data, such as raw video from the video source 104 and raw decoded video data from the video decoder 300. Additionally or alternatively, the memories 106, 120 may store software instructions executable by, for example, the video encoder 200 and the video decoder 300, respectively. Although in this example, memory 106 and memory 120 are shown separately from video encoder 200 and video decoder 300, it should be understood that video encoder 200 and video decoder 300 may also include internal memory for functionally similar or equivalent purposes. Further, the memories 106, 120 may store encoded video data, for example, data output from the video encoder 200 and input to the video decoder 300. In some examples, a portion of the memory 106, 120 may be allocated as one or more video buffers, e.g., to store raw, decoded, and/or encoded video data.
Computer-readable medium 110 may represent any type of medium or device capable of transmitting encoded video data from source device 102 to destination device 116. In one example, the computer-readable medium 110 represents a communication medium to enable the source device 102 to transmit encoded video data directly to the destination device 116 in real-time, e.g., via a radio frequency network or a computer-based network. According to a communication standard, such as a wireless communication protocol, output interface 108 may modulate a transmission signal including encoded video data, and input interface 122 may demodulate a received transmission signal. The communication medium may include any wireless or wired communication medium such as a Radio Frequency (RF) spectrum or one or more physical transmission lines. The communication medium may form part of a packet-based network, such as a local area network, a wide area network, or a global network such as the internet. The communication medium may include a router, switch, base station, or any other device that facilitates communication from source device 102 to destination device 116.
In some examples, source device 102 may output encoded data from output interface 108 to storage device 112. Similarly, destination device 116 may access encoded data from storage device 112 via input interface 122. Storage device 112 may include any of a variety of distributed or locally accessed data storage media such as a hard disk, blu-ray disc, DVD, CD-ROM, flash memory, volatile or non-volatile memory, or any other suitable digital storage media for storing encoded video data.
In some examples, source device 102 may output encoded video data to file server 114 or another intermediate storage device that may store the encoded video generated by source device 102. The destination device 116 may access the stored video data from the file server 114 via streaming or download.
File server 114 may be any type of server device capable of storing encoded video data and transmitting the encoded video data to destination device 116. File server 114 may represent a web server (e.g., for a website), a server configured to provide file transfer protocol services, such as File Transfer Protocol (FTP) or file delivery over unidirectional transport (FLUTE) protocol, content Delivery Network (CDN) devices, hypertext transfer protocol (HTTP) servers, multimedia Broadcast Multicast Services (MBMS) or enhanced MBMS (eMBMS) servers, and/or Network Attached Storage (NAS) devices. The file server 114 may additionally or alternatively implement one or more HTTP streaming protocols, such as dynamic adaptive streaming over HTTP (DASH), live streaming over HTTP (HLS), real-time streaming protocol (RTSP), dynamic streaming over HTTP, and the like.
The destination device 116 may access the encoded video data from the file server 114 via any standard data connection, including an internet connection. This may include a wireless channel (e.g., wi-Fi connection), a wired connection (e.g., digital Subscriber Line (DSL), cable modem, etc.), or a combination of both, that is adapted to access encoded video data stored on file server 114. The input interface 122 may be configured to operate in accordance with any one or more of the various protocols described above for retrieving or receiving media data from the file server 114 or other such protocols for retrieving media data.
Output interface 108 and input interface 122 may represent wireless transmitters/receivers, modems, wired network components (e.g., ethernet cards), wireless communication components operating according to any of a variety of IEEE 802.11 standards, or other physical components. In examples where output interface 108 and input interface 122 comprise wireless components, output interface 108 and input interface 122 may be configured to transmit data, such as encoded video data, in accordance with cellular communication standards, such as 4G, 4G-LTE (long term evolution), LTE-advanced, 5G, and the like. In some examples where output interface 108 includes a wireless transmitter, output interface 108 and input interface 122 may be configured to communicate with a wireless communication device in accordance with a wireless communication device such as an IEEE 802.11 specification, an IEEE 802.15 specification (e.g., )、/>Other wireless standards such as standards to transmit data such as encoded video data. In some examples, source device 102 and/or destination device 116 may include respective system-on-chip (SoC) devices. For example, source device 102 may comprise a SoC device that performs the functions attributed to video encoder 200 and/or output interface 108, and destination device 116 may comprise a SoC device that performs the functions attributed to video decoder 300 and/or input interface 122.
The techniques of this disclosure may be applied to video encoding and decoding to support any of a variety of multimedia applications, such as over-the-air television broadcasting, cable television transmission, satellite television transmission, internet streaming video transmission (such as dynamic adaptive streaming over HTTP (DASH)), decoding of digital video encoded onto a data storage medium, digital video stored on a data storage medium, or other applications.
The input interface 122 of the destination device 116 receives the encoded video bitstream from the computer readable medium 110 (e.g., communication medium, storage device 112, file server 114, etc.). The encoded video bitstream may include signaling information defined by the video encoder 200, which is also used by the video decoder 300, such as syntax elements having values describing characteristics and/or processing of video blocks or other decoded units (e.g., slices, pictures, groups of pictures, sequences, etc.). The display device 118 displays the decoded pictures of the decoded video data to a user. The display device 118 may represent any of a variety of display devices, such as a Liquid Crystal Display (LCD), a plasma display, an Organic Light Emitting Diode (OLED) display, or another type of display device.
Although not shown in fig. 1, in some examples, the video encoder 200 and the video decoder 300 may each be integrated with an audio encoder and/or an audio decoder, and may include appropriate MUX-DEMUX units or other hardware and/or software to process multiplexed streams including both audio and video in a common data stream. The MUX-DEMUX units may conform to the ITU h.223 multiplexer protocol or other protocols such as the User Datagram Protocol (UDP), if applicable.
Video encoder 200 and video decoder 300 may each be implemented as any of a variety of suitable encoder and/or decoder circuits, such as one or more microprocessors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs), discrete logic, software, hardware, firmware or any combinations thereof. When the techniques are implemented in part in software, a device may store instructions of the software in a suitable non-transitory computer-readable medium and execute the instructions in hardware using one or more processors to perform the techniques of this disclosure. Each of the video encoder 200 and the video decoder 300 may be included in one or more encoders or decoders, any of which may be integrated as part of a combined encoder/decoder (CODEC) in the respective device. Devices including video encoder 200 and/or video decoder 300 may include integrated circuits, microprocessors, and/or wireless communication devices, such as cellular telephones.
Video encoder 200 and video decoder 300 may operate in accordance with a video codec standard, such as ITU-T h.265, also known as High Efficiency Video Codec (HEVC) or an extension thereof, such as a multiview and/or scalable video codec extension. Alternatively, the video encoder 200 and video decoder 300 may operate in accordance with other proprietary or industrial standards, such as a common video codec (VVC). The draft of the VVC standard is described below: "Versatile Video Coding (Draft 9) (Universal video codec (Draft 9))" by Bross et al, joint video expert group (JVET) of ITU-T SG 16WP 3 and ISO/IEC JTC 1/SC 29/WG 11, conference 18: 4 months 15 to 24 days, JVET-R2001-v8 (hereinafter referred to as "VVC draft 9"). However, the techniques of this disclosure are not limited to any particular codec standard.
In general, the video encoder 200 and the video decoder 300 may perform block-based encoding and decoding of pictures. The term "block" generally refers to a structure that includes data to be processed (e.g., data encoded, decoded, or otherwise used in an encoding process and/or a decoding process). For example, a block may comprise a two-dimensional matrix of samples of brightness and/or color data. In general, video encoder 200 and video decoder 300 may codec video data represented in YUV (e.g., Y, cb, cr) format. That is, the video encoder 200 and the video decoder 300 may codec brightness and color tone components, which may include both red and blue hue components, instead of red, green, and blue (RGB) data of samples of a picture. In some examples, the video encoder 200 converts the received RGB formatted data to a YUV representation prior to encoding, and the video decoder 300 converts the YUV representation to RGB format. Alternatively, a preprocessing and post-processing unit (not shown) may perform these conversions.
The present disclosure may generally relate to encoding and decoding (e.g., encoding and decoding) of a picture, including processes of encoding or decoding data of the picture. Similarly, the present disclosure may relate to the coding of blocks of pictures, including processes of coding or decoding data of the blocks, e.g., prediction and/or residual coding. The encoded video bitstream typically includes a series of values representing codec decisions (e.g., codec modes) and syntax elements for picture-to-block partitioning. Thus, reference to a codec picture or block should generally be understood as a value of a syntax element that the codec forms the picture or block.
HEVC defines various blocks, including Coding Units (CUs), prediction Units (PUs), and Transform Units (TUs). According to HEVC, a video codec, such as video encoder 200, partitions a Coding Tree Unit (CTU) into CUs according to a quadtree structure. That is, the video codec partitions CTUs and CUs into four equal, non-overlapping squares, and each node of the quadtree has zero or four child nodes. A node without child nodes may be referred to as a "leaf node," and a CU of such a leaf node may include one or more PUs and/or one or more TUs. The video codec may also partition PUs and TUs. For example, in HEVC, the Residual Quadtree (RQT) represents the partitioning of TUs. In HEVC, PUs represent inter prediction data, while TUs represent residual data. The intra-predicted CU includes intra-prediction information, such as an intra-mode indication.
As another example, the video encoder 200 and the video decoder 300 may be configured to operate according to VVC. According to VVC, a video codec, such as video encoder 200, partitions a picture into a plurality of Codec Tree Units (CTUs). The video encoder 200 may partition the CTUs according to a tree structure, such as a quadtree-binary tree (QTBT) structure or a multi-type tree (MTT) structure. QTBT structures remove the concept of multiple partition types, such as separation between CUs, PUs, and TUs of HEVC. The QTBT structure includes two levels: a first level partitioned according to a quadtree partitioning and a second level partitioned according to a binary tree partitioning. The root node of the QTBT structure corresponds to the CTU. Leaf nodes of the binary tree correspond to coding and decoding units (CUs).
In the MTT partitioning structure, blocks may be partitioned using a Quadtree (QT) partition, a Binary Tree (BT) partition, and one or more types of Trigeminal Tree (TT) (also referred to as Ternary Tree (TT)) partitions. A trigeminal or ternary tree partition is a partition that divides a block into three sub-blocks. In some examples, the trigeminal or ternary tree partitioning divides a block into three sub-blocks without dividing the original block by center. The segmentation types (e.g., QT, BT, and TT) in MTT may be symmetrical or asymmetrical.
In some examples, the video encoder 200 and the video decoder 300 may use a single QTBT or MTT structure to represent each of the luma component and the chroma components, while in other examples, the video encoder 200 and the video decoder 300 may use two or more QTBT or MTT structures, such as one QTBT/MTT structure for the luma component and another QTBT/MTT structure for the two chroma components (or two QTBT/MTT structures for the respective chroma components).
The video encoder 200 and video decoder 300 may be configured to use quadtree partitioning, QTBT partitioning, MTT partitioning, or other partitioning structures according to HEVC. For purposes of explanation, the description of the techniques of this disclosure is presented with respect to QTBT segmentation. However, it should be understood that the techniques of this disclosure may also be applied to video codecs configured to use quadtree partitioning or other types of partitioning.
In some examples, the CTU includes a Coding Tree Block (CTB) of luma samples of a picture having three sample arrays, two corresponding CTBs of chroma samples, or a CTB of samples of a monochrome picture or a picture encoded using three separate color planes, and a syntax structure used to encode and decode the samples. CTBs may be blocks of nxn samples of some value of N, such that dividing a component into CTBs is a sort of partitioning. The component may be an array or single spot from one of three arrays (luminance and two chromaticities) of a picture in 4:2:0, 4:2:2, 4:4:4 color format, or an array or single spot of an array of pictures in monochrome format. In some examples, the codec block is an mxn sample block of some values of M and N, such that dividing CTBs into codec blocks is a sort of partitioning.
Blocks (e.g., CTUs or CUs) may be grouped in pictures in various ways. As one example, a tile may refer to a rectangular region of CTU rows within a particular slice in a picture. A tile may be a rectangular region of CTUs within a particular tile column and a particular tile row in a picture. A slice column refers to a rectangular region of a CTU whose height is equal to the height of a picture and whose width is specified by syntax elements (e.g., syntax elements such as in a picture parameter set). A slice line refers to a rectangular region of a CTU whose height is specified by a syntax element (e.g., such as a syntax element in a picture parameter set) and whose width is equal to the width of a picture.
In some examples, a tile may be partitioned into multiple tiles, each tile may include one or more rows of CTUs within the tile. A tile that is not divided into multiple tiles may also be referred to as a tile. However, tiles that are a proper subset of tiles may not be referred to as tiles.
Tiles in a picture may also be arranged in slices. A slice may be an integer number of tiles of a picture that may be contained exclusively in a single Network Abstraction Layer (NAL) unit. In some examples, a slice includes multiple complete slices or complete tiles that include only one contiguous sequence of slices.
The present disclosure may use "N x N" and "N by N" interchangeably to represent the sample size of a block (such as a CU or other video block) in the vertical and horizontal dimensions, e.g., 16 x 16 samples or 16 by 16 samples. Generally, a 16×16 CU has 16 samples in the vertical direction (y=16), and 16 samples in the horizontal direction (x=16). Similarly, an nxn CU typically has N samples in the vertical direction and N samples in the horizontal direction, where N represents a non-negative integer value. The samples in a CU may be arranged in rows and columns. Further, a CU does not necessarily have to have the same number of samples in the horizontal direction as in the vertical direction. For example, a CU may include n×m samples, where M is not necessarily equal to N.
The video encoder 200 encodes video data of a CU representing prediction and/or residual information, as well as other information. The prediction information indicates how the CU is to be predicted in order to form a prediction block of the CU. The residual information generally represents sample-by-sample differences between samples of the CU and the prediction block prior to encoding.
To predict a CU, video encoder 200 may typically form a prediction block of the CU by inter-prediction or intra-prediction. Inter-prediction generally refers to predicting a CU from data of a previously decoded picture, while intra-prediction generally refers to predicting a CU from previously decoded data of the same picture. To perform inter prediction, the video encoder 200 may generate a prediction block using one or more motion vectors. Video encoder 200 typically performs a motion search to identify reference blocks that closely match the CU (e.g., in terms of differences between the CU and the reference blocks). The video encoder 200 may calculate a difference metric using a Sum of Absolute Differences (SAD), a Sum of Squared Differences (SSD), a Mean Absolute Difference (MAD), a Mean Squared Difference (MSD), or other such difference calculation to determine whether the reference block closely matches the current CU. In some examples, video encoder 200 may use unidirectional prediction or bi-directional prediction to predict the current CU.
Some examples of VVCs also provide affine motion compensation modes, which may be considered inter prediction modes. In affine motion compensation mode, the video encoder 200 may determine two or more motion vectors representing non-translational motion (such as zoom in or out, rotation, perspective motion, or other irregular motion types).
To perform intra prediction, the video encoder 200 may select an intra prediction mode to generate a prediction block. Some examples of VVCs provide sixty-seven intra-prediction modes, including various directional modes, as well as planar modes and DC modes. In general, video encoder 200 selects an intra prediction mode that describes predicting samples of a current block (e.g., a block of a CU) from neighboring samples of the current block. Assuming that the video encoder 200 encodes CTUs and CUs in raster scan order (left to right, top to bottom), such samplings may generally be above, above to the left or left of the current block in the same picture as the current block.
The video encoder 200 encodes data representing a prediction mode of the current block. For example, for inter prediction modes, the video encoder 200 may encode data indicating which of various available inter prediction modes is used, and motion information for the corresponding mode. For unidirectional or bi-directional inter prediction, for example, the video encoder 200 may use Advanced Motion Vector Prediction (AMVP) or merge mode to encode the motion vectors. The video encoder 200 may encode the motion vectors for the affine motion compensation mode using a similar mode.
After prediction (such as intra prediction or inter prediction of a block), the video encoder 200 may calculate residual data for the block. Residual data (such as a residual block) represents a sample-by-sample difference between a block and a prediction block of the block, the prediction block being formed using a corresponding prediction mode. The video encoder 200 may apply one or more transforms to the residual block to produce transformed data in the transform domain instead of the sample domain. For example, video encoder 200 may apply a Discrete Cosine Transform (DCT), an integer transform, a wavelet transform, or a conceptually similar transform to the residual video data. Additionally, the video encoder 200 may apply a secondary transform after the first transform, such as a mode dependent inseparable secondary transform (MDNSST), a signal dependent transform, a Karhunen-Loeve transform (KLT), and the like. The video encoder 200 generates transform coefficients after applying one or more transforms.
As described above, after generating any transform of transform coefficients, the video encoder 200 may perform quantization of the transform coefficients. Quantization generally refers to a process in which transform coefficients are quantized to potentially reduce the amount of data used to represent the transform coefficients to provide further compression. By performing the quantization process, the video encoder 200 may reduce the bit depth associated with some or all of the transform coefficients. For example, the video encoder 200 may round an n-bit value to an m-bit value during quantization, where n is greater than m. In some examples, to perform quantization, the video encoder 200 may perform a bit-wise right shift of the value to be quantized.
After quantization, the video encoder 200 may scan the transform coefficients, producing a one-dimensional vector from a two-dimensional matrix comprising the quantized transform coefficients. The scan may be designed to place higher energy (and therefore lower frequency) transform coefficients in front of the vector and lower energy (and therefore higher frequency) transform coefficients in back of the vector. In some examples, video encoder 200 may scan the quantized transform coefficients using a predefined scan order to produce a serialized vector and then entropy encode the quantized transform coefficients of the vector. In other examples, video encoder 200 may perform adaptive scanning. After scanning the quantized transform coefficients to form a one-dimensional vector, the video encoder 200 may entropy encode the one-dimensional vector, e.g., according to context-adaptive binary arithmetic coding (CABAC). The video encoder 200 may also entropy encode values of syntax elements that describe metadata associated with the encoded video data for use by the video decoder 300 in decoding the video data.
To perform CABAC, the video encoder 200 may assign context within the context model to the symbol to be transmitted. The context may relate to, for example, whether the adjacent value of the symbol is a zero value. The probability determination may be based on the context assigned to the symbol.
The video encoder 200 may also generate syntax data to the video decoder 300, such as block-based syntax data, picture-based syntax data, and sequence-based syntax data, for example, in a picture header, a block header, a slice header, or other syntax data, such as a Sequence Parameter Set (SPS), a Picture Parameter Set (PPS), or a Video Parameter Set (VPS). The video decoder 300 may similarly decode this syntax data to determine how to decode the corresponding video data.
In this manner, video encoder 200 may generate a bitstream including encoded video data, e.g., syntax elements describing the partitioning of pictures into blocks (e.g., CUs) and prediction and/or residual information for the blocks. Finally, the video decoder 300 may receive the bitstream and decode the encoded video data.
In general, the video decoder 300 performs a process reciprocal to that performed by the video encoder 200 to decode encoded video data of a bitstream. For example, the video decoder 300 may decode the values of the syntax elements of the bitstream using CABAC in a manner substantially similar to (although reciprocal to) the CABAC encoding process of the video encoder 200. The syntax element may define partition information for partitioning a picture into CTUs and partitioning each CTU according to a corresponding partition structure such as a QTBT structure to define CUs of the CTU. Syntax elements may also define prediction and residual information for blocks (e.g., CUs) of video data.
The residual information may be represented by, for example, quantized transform coefficients. The video decoder 300 may inverse quantize and inverse transform the quantized transform coefficients of the block to reconstruct the residual block of the block. The video decoder 300 forms a prediction block of a block using a signaled prediction mode (intra prediction or inter prediction) and associated prediction information (e.g., motion information for inter prediction). The video decoder 300 may then combine the prediction block and the residual block (on a sample-by-sample basis) to reproduce the original block. The video decoder 300 may perform additional processing, such as performing a deblocking process to reduce visual artifacts along the boundaries of the blocks.
The present disclosure may generally relate to "signaling" certain information, such as syntax elements. The term "signaling" may generally refer to the communication of syntax elements and/or values of other data used to decode encoded video data. That is, the video encoder 200 may signal the value of the syntax element in the bitstream. Typically, signaling refers to generating values in the bit stream. As described above, the source device 102 may stream the bits to the destination device 116 in substantially real-time or non-real-time (such as may occur when storing the syntax elements to the storage device 112 for later retrieval by the destination device 116).
Fig. 2 is a conceptual diagram illustrating a hybrid video codec framework. Since h.261, video codec standards have been based on the so-called hybrid video codec principle, as illustrated in fig. 2. The term "hybrid" refers to a combination of two means of reducing redundancy in a video signal, namely prediction and quantized transform codec with prediction residuals. Although prediction and transformation reduce redundancy in video signals by decorrelation, quantization reduces the data represented by the transform coefficients by reducing their accuracy, ideally by removing only the details that are not relevant. This hybrid video codec design principle is also used for the two latest standards, ITU-T H.265/HEVC and ITU-T H.266/VVC.
As shown in fig. 2, modern hybrid video codec 130 typically performs block segmentation, motion compensation or inter-picture prediction, intra-picture prediction, transformation, quantization, entropy coding, and post-loop/in-filtering. In the example of fig. 2, video codec 130 includes a summing unit 134, a transform unit 136, a quantization unit 138, an entropy coding unit 140, an inverse quantization unit 142, an inverse transform unit 144, a summing unit 146, a loop filter unit 148, a Decoded Picture Buffer (DPB) 150, an intra prediction unit 152, an inter prediction unit 154, and a motion estimation unit 156.
In general, the video codec 130 may receive input video data 132 when encoding the video data. Block segmentation is used to divide a picture (image) of received video data into smaller blocks for operation of the prediction and transformation processes. Early video codec standards used fixed block sizes, typically 16 x 16 samples. Recent standards such as HEVC and VVC employ tree-based partitioning structures to provide flexible partitioning.
Motion estimation unit 156 and inter-prediction unit 154 may, for example, predict input video data 132 from previously decoded data of DPB 150. Motion compensation or inter-picture prediction exploits redundancy that exists between pictures (hence the name "inter") of a video sequence. According to block-based motion compensation used in all modern video codecs, prediction is obtained from one or more previously decoded pictures (i.e., reference picture (s)). The corresponding region where the inter prediction is generated is indicated by motion information, including a motion vector and a reference picture index.
The summing unit 134 may calculate residual data as a difference between the input video data 132 and prediction data from the intra prediction unit 152 or the inter prediction unit 154. The summing unit 134 provides the residual block to the transform unit 136, and the transform unit 136 applies one or more transforms to the residual block to generate a transformed block. The quantization unit 138 quantizes the transform block to form quantized transform coefficients. The entropy encoding unit 140 entropy encodes the quantized transform coefficients, as well as other syntax elements, such as motion information or intra-prediction information, to generate an output bitstream 158.
Meanwhile, the inverse quantization unit 142 inversely quantizes the quantized transform coefficient, and the inverse transform unit 144 inversely transforms the transform coefficient to reproduce the residual block. The summing unit 146 combines the residual block with the prediction block (on a sample-by-sample basis) to generate a decoded block of video data. Loop filter unit 148 applies one or more filters (e.g., at least one of a neural network-based filter, a neural network-based loop filter, a neural network-based post-loop filter, an adaptive in-loop filter, or a predefined adaptive in-loop filter) to the decoded block to produce a filtered decoded block.
In accordance with the techniques of this disclosure, the neural network filtering unit of loop filter unit 148 may receive data of the decoded pictures of video data from summing unit 146 and one or more other units of hybrid video codec 130 (e.g., transform unit 136, quantization unit 138, intra-prediction unit 152, inter-prediction unit 154, motion estimation unit 156, and/or one or more other filtering units within loop filter unit 148). For example, the neural network filtering unit may receive data from a deblocking filtering unit (also referred to as a "deblocking unit") of the loop filter unit 148. The neural network filtering unit may receive, for example, a boundary strength value that indicates whether a particular boundary is to be filtered to deblock, and if so, the extent to which the boundary is to be filtered. For example, the boundary strength value may correspond to the number of samples on either side of the boundary to be modified and/or the degree to which the samples are to be modified.
In other examples, the neural network filtering unit may receive any or all of Coding Unit (CU) partition data, prediction Unit (PU) partition data, transform Unit (TU) partition data, deblocking filter data, quantization Parameter (QP) data, intra prediction data, inter prediction data, data representing a distance between a decoded picture and one or more reference pictures, or motion information of one or more decoded blocks of the decoded picture, in addition to or as an alternative to the boundary strength values. The deblocking filtered data may also include one or more of whether a long filter or a short filter is used for deblocking or whether a strong filter or a weak filter is used for deblocking. The data representing the distance between the decoded picture and the reference picture may be represented as a Picture Order Count (POC) difference between POC values of the pictures.
The neural network filtering unit may determine one or more neural network models for filtering at least a portion of the decoded picture. The neural network filtering unit may also filter at least a portion of the decoded picture using the determined one or more neural network models and data from other units, including boundary strength data. For example, the neural network filtering unit may provide the additional data as one or more additional input planes to a Convolutional Neural Network (CNN).
A block of video data, such as a CTU or CU, may actually include multiple color components, for example, a brightness or "luma" component, a blue hues or "chroma" component, and a red hues (chroma) component. The luminance component may have a greater spatial resolution than the chrominance components, and one of the chrominance components may have a greater spatial resolution than the other chrominance component. Alternatively, the luminance component may have a larger spatial resolution than the chrominance components, and the two chrominance components may have spatial resolutions equal to each other. For example, in a 4:2:2 format, the luminance component may be twice the chrominance component in the horizontal direction and equal to the chrominance component in the vertical direction. As another example, in a 4:2:0 format, the luminance component may be twice the chrominance component in the horizontal and vertical directions. The various operations discussed above may generally be applied to each of the luma and chroma components separately (although some codec information (such as motion information or intra prediction direction) may be determined for the luma component and inherited by the corresponding chroma component).
Fig. 3 is a conceptual diagram illustrating a group of pictures (GOP) sized hierarchical prediction structure 166 using 16. In recent video codecs, a hierarchical prediction structure within a group of pictures (GOP) is applied to improve codec efficiency.
Referring again to fig. 2, intra-picture prediction exploits spatial redundancy present within a picture (hence the term "intra-picture") by deriving a prediction of a block from spatially adjacent (reference) samples that have been decoded/decoded. Direction angle prediction, DC prediction, and plane or planar prediction are used in the latest video codecs, including AVC, HEVC, and VVC.
The hybrid video codec standard applies a block transform to the prediction residual (whether it is from inter-picture prediction or intra-picture prediction). In early standards, including h.261, h.262, and h.263, discrete Cosine Transform (DCT) was employed. In HEVC and VVC, more transform kernels are applied in addition to DCT in order to account for different statistics in a particular video signal.
Quantization aims at reducing the accuracy of an input value or set of input values in order to reduce the amount of data needed to represent these values. In hybrid video codec, quantization is typically applied to individual transformed residual samples, i.e., to transform coefficients, resulting in integer coefficient levels. In the latest video codec standards, the step size is derived from so-called Quantization Parameters (QP) that control the fidelity and bit rate. A larger step size reduces the bit rate but also deteriorates the quality, e.g. it causes the video picture to present details of block artifacts and blurring.
Due to its high efficiency, context Adaptive Binary Arithmetic Coding (CABAC) is a form of entropy coding used in recent video codecs (e.g., AVC, HEVC, and VVC).
Post loop/in-filter is a filtering process (or a combination of these processes) applied to reconstructed pictures to reduce codec artifacts. The input to the filtering process is typically a reconstructed picture, which is a combination of the reconstructed residual signal (which includes quantization errors) and the prediction. As shown in fig. 2, the intra-loop filtered reconstructed picture is stored and used as a reference for inter-picture prediction of subsequent pictures. The codec artifacts are mainly determined by QP, so QP information is typically used for the design of the filtering process. In HEVC, in-loop filters include deblocking filtering and Sample Adaptive Offset (SAO) filtering. In the VVC standard, an Adaptive Loop Filter (ALF) is introduced as a third filter. The filtering process of ALF is as follows:
R′(i,j)=R(i,j0+((∑ k≠0l≠0 f(k,l)×K(R9i+k,j+l)-R(i,j),c9k,l))+64)>>7) (1)
where R (i, j) is the set of samples before the filtering process and R' (i, j) is the value of the samples after the filtering process. f (K, l) denotes the filter coefficients, K (x, y) is the clipping function, and c (K, l) denotes the clipping parameters. The variables k and l are in And->Where L represents the filter length. Clipping function K (x, y) =min (y, max (-y, x)), which corresponds to function Clip3 (-y, y, x). The clipping operation introduces non-linearities that make ALF more efficient by reducing the effect of neighboring sample values that differ too much from the current sample value. In VVC, filtering parameters may be signaled in the bitstream, which may be filtered from predefinedThe wave device is selected in a centralized way. The ALF filtering process can also be summarized using the following formula:
R′(i,j)=R(i,j)+ALF_residual_ouput(R) (2)
fig. 4 is a conceptual diagram illustrating a neural network-based filter 170 having four layers. Various studies have shown that embedding a Neural Network (NN) into, for example, the hybrid video codec framework of fig. 2 can improve compression efficiency. Neural networks have been used in intra-prediction and inter-prediction modules to improve prediction efficiency. In-loop filtering based on NN is also a recent research hotspot. Sometimes, the filtering process is applied as post-loop filtering. In this case, the filtering process is applied only to the output picture, and the unfiltered picture is used as a reference picture.
The NN-based filter 170 may be applied in addition to existing filters, such as deblocking filters, sample Adaptive Offset (SAO), and/or Adaptive Loop Filtering (ALF). The NN-based filter may also be applied exclusively, wherein the NN-based filter is designed to replace all existing filters. Additionally or alternatively, NN-based filters (such as NN-based filter 170) may be designed to supplement, augment, or replace any or all of the other filters.
As shown in fig. 4, the NN-based filtering process may take reconstructed samples as inputs and the intermediate outputs are residual samples that are added back to the inputs to refine the input samples. The NN filter may use all color components (e.g., Y, U and V, or Y, cb and Cr, i.e., brightness, blue chroma, and red chroma) as inputs to exploit cross-component correlation. The different color components may share the same filters (including network structure and model parameters), or each component may have its own specific filter.
The filtering process can also be summarized as follows: r' (i, j) =r (i, j) +nn_filter_residual_output (R). Model structure and model parameters of the NN-based filter(s) may be predefined and stored in the encoder and decoder. The filter may also be signaled in the bitstream.
The present disclosure recognizes that in some cases, when a video codec, such as video encoder 200 or video decoder 300 of fig. 1, applies Neural Network (NN) -based filtering as an add-on module, the video codec may generate different kinds of information that may be used by the NN-based filter to further improve filtering performance.
In general, in accordance with the techniques of this disclosure, the video encoder 200 and the video decoder 300 may include respective filtering units configured to perform NN-based filtering. The filtering unit may use information generated by other units (e.g., block segmentation information, motion information, deblocking filter information, or other such information) when performing the NN-based filtering process.
When applying the NN filter, the filtering unit may use any available information generated by other units or modules. Examples of such units or modules that may generate information that may be used by the filtering unit include an intra prediction unit, an inter prediction unit, a transform processing unit, a quantization unit, a loop filtering unit (e.g., a deblocking filter unit, a Sample Adaptive Offset (SAO) unit, an Adaptive Loop Filter (ALF) unit, etc.), a preprocessing unit (e.g., a motion compensated temporal filtering unit), and/or another NN-based module or unit that co-exists with an NN filtering unit or module.
The video encoder 200 and/or the video decoder 300 may comprise a plurality of NN-based filtering units, one of which performs NN-based filtering before another (current) NN-based filtering unit, for example as shown in fig. 4. In this case, when NN filtering is performed, the current NN-based filtering unit may use information generated by one or more previous NN-based filtering units.
In various examples, when performing NN-based filtering, the NN-based filtering unit(s) may use any or all of the following information: CU, PU, and/or TU partition information, deblocking filtering information (e.g., boundary strength values for the deblocking filtering process, long or short filters, strong or weak filters, etc.), quantization Parameters (QP) for the current picture/block and/or reference picture(s), intra and/or inter prediction mode information, distance (e.g., picture Order Count (POC) value differences) between the current picture and the reference picture used to predict the current picture, and/or motion information of the codec block. The boundary strength value may also be referred to as boundary filter strength. In general, the boundary strength value or boundary filter strength value may represent whether a particular block boundary is to be deblocked, and if so, the extent to which deblocking is to be applied. For example, a relatively strong deblocking filter may modify more values of samples on either side of a block boundary and to a greater extent than a relatively weak deblocking filter.
In some examples, when the NN-based filtering unit uses information from other units or modules, the NN-based filtering unit may provide some similar functionality provided by the other units or modules. In such examples, the NN-based filtering unit may modify elements of other units or modules to improve cooperation between the NN-based filtering unit and the other units or modules. For example, the NN-based filtering unit may be interfaced with a deblocking filter. In such an example, the deblocking filter unit may generate boundary strength information, but does not perform actual filtering. The NN-based filtering unit may receive the boundary strength information from the deblocking filter unit and provide the received boundary strength information as an input to the NN-based filter.
The NN-based filtering unit may use information received from other units or modules in various ways. For example, the NN-based filtering unit may use information as an additional input plane to a Convolutional Neural Network (CNN). As another example, the NN-based filtering unit may use the information to modify or adjust the output of the NN-based filter. For example, after the NN-based filter is applied to the picture to form a filtered picture, the video encoder 200 or video decoder 300 may also adjust the filtered picture based on other information (such as QP).
Information from other units or modules may be converted to be more suitable for NN-based filtering units. For example, the NN-based filtering unit may convert values between integer and floating point values, scale values to a range that is more appropriate for the NN filter (e.g., the boundary strength values of the deblocking filter may be scaled to the same range as the input pixels), or scale values to any other range (where the range may be predefined or signaled in the bitstream).
Fig. 5 is a conceptual diagram illustrating an example portion of a picture 180 including boundaries, boundary samples, and internal samples. Specifically, a portion of picture 180 includes vertical boundaries 182A-182D (vertical boundary 182) and horizontal boundaries 184A-184C (horizontal boundary 184). As shown in the example of fig. 5, two contiguous boundary samples (labeled 'N' and gray shading in fig. 5) define a respective boundary 182, 184 (represented by a solid black line in fig. 5) between the two boundary samples. The internal (i.e., non-boundary) samples are labeled "M" in fig. 5 and are not shaded. The boundaries may be CU, PU, and/or TU boundaries. When performing NN-based filtering, the NN-based filtering unit may use information of the internal/boundary samples and boundary positions depicted in fig. 5.
For example, the NN-based filtering unit may use CU, PU, and TU partition information as one or more additional input planes for the CNN-based filter. First, the video encoder 200 and/or the video decoder 300 may convert partition information (for CUs, PUs, and/or TUs) into planes by setting boundary samples to a different value (e.g., a predefined value N) than the internal samples (e.g., M) shown in fig. 5. In one example, the video encoder 200 and the video decoder 300 may set values of n=1 and m=0.
Video encoder 200 and video decoder 300 may generate multiple partition planes, e.g., one each of a CU partition, a PU partition, and/or a TU partition. In some examples, planes may be combined, e.g., one for CU partitioning and another for PU and TU partitioning. In the case where "dual tree" segmentation is enabled, where the luma and chroma components may be segmented differently (and/or the two chroma components may be segmented differently from each other), the different color components may have different segmentation planes. The NN-based filter may use any or all of these different planes as the CNN-based filterAn input plane of the filter. Various examples of processing multiple segmentation planes include: using the segmentation plane as a separate input plane for the CNN-based filter; combining multiple segmented planes into one Plane (e.g., plane for each pixel sample at position (i, j) combined (i,j)=MAX(Plane a (i,j),Plane b (i, j), …); or a combination of separate and/or combined segmentation planes (e.g., combining CU and TU segmentation planes for each color component into one plane, and using multiple combined planes for each color as inputs to a CNN-based filter). That is, in one example, to combine planes to form a combined input plane, for each position (i, j) of the input planes, the video encoder 200 and the video decoder 300 may set the value of the position (i, j) of the combined input plane to be equal to the maximum value of the values at the positions (i, j) of the plurality of input planes.
After the plane(s) are created, the values of the planes may be converted as needed in various examples before they are used as inputs to the CNN-based filter. For example, the values may be converted between integer and floating point, the values may be scaled to have the same range as the input pixel values, and/or the values may be scaled to any other range, which may be predefined or signaled in the bitstream.
As described above, in some examples, boundary strength computation logic of the deblocking filter may be used to derive boundary strength parameters. The NN-based filtering unit may use the boundary strength parameter as additional input plane(s) of the CNN-based filter (e.g., VVC boundary strength calculation of the DB filter). When applying the CNN filter, the actual filtering process of the deblocking filter may be disabled.
First, the deblocking filtering unit may derive boundary strength values of edges conforming to the deblocking filtering. The conversion may be applied as needed for an example of the techniques of this disclosure (e.g., conversion between integer and floating point value types, scaling the values to have the same range as the input pixels or any other range deemed suitable for CNN filter use, etc.).
The NN-based filtering unit or another unit of the video codec may convert the boundary strength values into a plane(s) that may be used as input to the CNN-based filter along with other input planes. One example of such a transformation is similar to that described above with respect to fig. 5, where boundary samples may be set to boundary strength values, and non-boundary samples may be set to 0. In the case of VVC, the range of boundary samples is [0,2].
Since the boundary strengths of the different color components can be calculated separately, the horizontal and vertical boundaries can also be calculated separately. For a picture or decoded region, multiple boundary strength planes may be generated. Similar to the discussion above, in one example, the NN-based filtering unit may choose to use a single input plane or multiple input planes. When multiple planes are used, different ways may be applied to organize the planes. Several examples include: using the plane as a separate input plane for the CNN-based filter; combining a plurality of boundary strength planes into a single plane; or a combination of these examples.
As an example of combining two planes, for each sample position (i, j), plane combined (i,j)=MAX(Plane A (i,j),Plane B (i, j)). As another example, given 2 planes A and B, plane combined (i,j)=Plane A (i,j)+Plane B (i, j). As another example, given 2 planes A and B, let R be B Is a range of values in plane B; for each sample position (i, j), plane combined (i,j)=R B *Plane A (i,j)+Plane B (i, j). In this example, the effect can be considered similar to using PlanA as the principal factor, and PlanB as the refinement. The above technique may be applied multiple times in order to combine more than two planes. And the above techniques may be used at different stages. For example, for planes A, B and C, combined plane AB is obtained using one of the above techniques, while AB is combined with C using the other of the above techniques to obtain ABC.
To combine the above techniques, in one example, the boundary strength planes of the vertical and horizontal planes may be combined using any of the various techniques described above, and then the boundary strength planes of the different color components may be provided as separate input planes to the CNN-based filter. As described above, the values of the planes may be converted as desired, such as conversion between integer and floating point, and/or scaled to a particular range, which may be predetermined or signaled in the bitstream.
In some examples, the information of the long/short filter may be used as additional or alternative input plane(s) for the CNN-based filter. Similar to the case of using boundary strength, information using long or short deblocking filters may be generated for use by the CNN-based filter(s), multiple planes may be used as separate planes or combined prior to use as CNN filter inputs.
In some examples, the information of the strong/weak filter may be used as an additional or alternative input plane to the CNN-based filter. Similar to the case of using boundary strength, information using strong or weak deblocking filters may be generated for use by the CNN-based filter(s), multiple planes may be used as separate planes or combined prior to use as CNN filter inputs.
The various techniques discussed above may be combined in various ways. For example, the following planes may be generated and used in the CNN filtering process: boundary strength (range of values: 0,1, 2), long/short and strong/weak filters (value: 2 for long & strong filter, 1 for short & strong filter, 0 for short & weak filter (in VVC, strong filter conditions must be satisfied to have a long filter)). Similar to the other examples discussed above, the generated planes may be used as separate input planes for the CNN filter, or some/all of the planes may be combined together.
Downsampling/upsampling may occur in order to feed pictures or decoded regions to the CNN filter. For example, in YUV 420, YUV422 color format video, etc., the luminance and chrominance components have different resolutions. In this case, downsampling/upsampling of the color components may be required to create the input plane of the CNN filter. Some techniques include: upsampling the chrominance components to have the same resolution as the luminance components; downsampling the luminance component to have the same resolution as the chrominance component; or one luminance pixel plane into several smaller pixel planes of the same size as the chrominance plane.
When downsampling/upsampling of a color plane is desired, the corresponding information plane introduced in the present disclosure may also be downsampled/upsampled. Downsampling/upsampling may follow the same rules as the corresponding pixel plane. In one example, where one luma pixel plane is converted to several smaller pixel planes to align the size with chroma, video encoder 200 or video decoder 300 may only hold one downsampled plane, rather than all planes as luma pixels.
Fig. 6 is a block diagram illustrating an example video encoder 200 that may perform the techniques of this disclosure. Fig. 6 is provided for purposes of explanation and should not be considered limiting of the techniques broadly illustrated and described in this disclosure. For purposes of explanation, the present disclosure describes video encoder 200 in the context of video codec standards, such as the ITU-t h.265/HEVC video codec standard and the VVC video codec being developed. However, the techniques of this disclosure are not limited to these video coding and decoding standards, and are generally applicable to other video coding and decoding standards.
In the example of fig. 6, video encoder 200 includes video data memory 230, mode selection unit 202, residual generation unit 204, transform processing unit 206, quantization unit 208, inverse quantization unit 210, inverse transform processing unit 212, reconstruction unit 214, filter unit 216, decoded Picture Buffer (DPB) 218, and entropy encoding unit 220. Any or all of video data memory 230, mode selection unit 202, residual generation unit 204, transform processing unit 206, quantization unit 208, inverse quantization unit 210, inverse transform processing unit 212, reconstruction unit 214, filter unit 216, DPB 218, and entropy encoding unit 220 may be implemented in one or more processors or processing circuits. For example, the elements of video encoder 200 may be implemented as one or more circuits or logic elements as part of a hardware circuit or as part of a processor, ASIC, or FPGA. Furthermore, the video encoder 200 may include additional or alternative processors or processing circuits to perform these and other functions.
Video data memory 230 may store video data to be encoded by components of video encoder 200. Video encoder 200 may receive video data stored in video data store 230 from, for example, video source 104 (fig. 1). DPB 218 may serve as a reference picture memory that stores reference video data for use by video encoder 200 in predicting subsequent video data. Video data memory 230 and DPB 218 may be formed from any of a variety of memory devices, such as Dynamic Random Access Memory (DRAM), including Synchronous DRAM (SDRAM), magnetoresistive RAM (MRAM), resistive random access RAM (RRAM), or other types of memory devices. Video data memory 230 and DPB 218 may be provided by the same memory device or separate memory devices. In various examples, video data memory 230 may be on-chip (as illustrated) with other components of video encoder 200, or off-chip with respect to those components.
In this disclosure, references to video data memory 230 should not be construed as limited to memory internal to video encoder 200 unless specifically described as such, or to memory external to video encoder 200 unless specifically described as such. Conversely, references to video data memory 230 should be understood to refer to a memory that stores video data received by video encoder 200 for encoding (e.g., video data of a current block to be encoded). The memory 106 of fig. 1 may also provide temporary storage of the output from the various units of the video encoder 200.
Various elements of fig. 6 are illustrated to aid in understanding the operations performed by video encoder 200. These units may be implemented as fixed function circuits, programmable circuits or a combination thereof. The fixed function circuit refers to a circuit that provides a specific function and presets an operation that can be performed. Programmable circuitry refers to circuitry that can be programmed to perform various tasks and provide flexible functionality in the operations that can be performed. For example, the programmable circuit may run software or firmware that causes the programmable circuit to operate in a manner defined by instructions of the software or firmware. Fixed function circuitry may execute software instructions (e.g., to receive parameters or output parameters) but the type of operation that fixed function circuitry performs is typically not variable. In some examples, one or more of the units may be different (fixed function or programmable) circuit blocks, and in some examples, one or more of the units may be integrated circuits.
The video encoder 200 may include an Arithmetic Logic Unit (ALU), a basic functional unit (EFU), digital circuitry, analog circuitry, and/or a programmable core formed from programmable circuitry. In examples where the operation of video encoder 200 is performed using software executed by programmable circuitry, memory 106 (fig. 1) may store instructions (e.g., object code) of the software received and executed by video encoder 200, or another memory (not shown) within video encoder 200 may store such instructions.
The video data memory 230 is configured to store received video data. The video encoder 200 may retrieve pictures of the video data from the video data memory 230 and provide the video data to the residual generation unit 204 and the mode selection unit 202. The video data in the video data memory 230 may be raw video data to be encoded.
The mode selection unit 202 comprises a motion estimation unit 222, a motion compensation unit 224 and an intra prediction unit 226. The mode selection unit 202 may include additional functional units to perform video prediction according to other prediction modes. As an example, mode selection unit 202 may include a palette unit, an intra block copy unit (which may be part of motion estimation unit 222 and/or motion compensation unit 224), an affine unit, a Linear Model (LM) unit, and the like.
Mode selection unit 202 typically coordinates multiple encoding traversals (pass) to test combinations of encoding parameters and resulting rate-distortion (RD) values for such combinations. The coding parameters may include CTU-to-CU partitioning, prediction modes of the CU, transform types of residual data of the CU, quantization parameters of residual data of the CU, and the like. The mode selection unit 202 may finally select a combination of coding parameters having better rate-distortion values than the other tested combinations.
Video encoder 200 may segment pictures retrieved from video data store 230 into a series of CTUs and encapsulate one or more CTUs within a slice. The mode selection unit 202 may partition CTUs of a picture according to a tree structure, such as the QTBT structure of HEVC or the quadtree structure described above. As described above, the video encoder 200 may form one or more CUs by dividing CTUs according to a tree structure. Such CUs may also be commonly referred to as "video blocks" or "blocks.
Typically, mode selection unit 202 also controls its components (e.g., motion estimation unit 222, motion compensation unit 224, and intra prediction unit 226) to generate a prediction block for the current block (e.g., the overlapping portion of PU and TU in the current CU or HEVC). To inter-predict the current block, motion estimation unit 222 may perform a motion search to identify one or more closely matching reference blocks in one or more reference pictures (e.g., one or more previously-encoded pictures stored in DPB 218). Specifically, the motion estimation unit 222 may calculate a value indicating that the potential reference block has the same phase as the current block, for example, according to a Sum of Absolute Differences (SAD), a Sum of Squared Differences (SSD), a Mean Absolute Difference (MAD), a Mean Squared Difference (MSD), and the like. The motion estimation unit 222 may typically perform these calculations using sample-by-sample differences between the current block and the reference block under consideration. The motion estimation unit 222 may identify the reference block having the lowest value resulting from these calculations, indicating the reference block that most closely matches the current block.
The motion estimation unit 222 may form one or more Motion Vectors (MVs) defining the position of a reference block in a reference picture relative to the position of a current block in a current picture. The motion estimation unit 222 may then provide the motion vectors to the motion compensation unit 224. For example, for unidirectional inter prediction, the motion estimation unit 222 may provide a single motion vector, while for bi-directional inter prediction, the motion estimation unit 222 may provide two motion vectors. The motion compensation unit 224 may then generate a prediction block using the motion vector. For example, the motion compensation unit 224 may retrieve data of the reference block using the motion vector. As another example, if the motion vector has fractional sample precision, the motion compensation unit 224 may interpolate the values of the prediction block according to one or more interpolation filters. Furthermore, for bi-directional inter prediction, the motion compensation unit 224 may retrieve data of two reference blocks identified by respective motion vectors and combine the retrieved data (e.g., by a sample-wise average or a weighted average).
As another example, for intra prediction or intra prediction codec, the intra prediction unit 226 may generate a prediction block from neighboring samples of the current block. For example, for directional modes, intra-prediction unit 226 may typically mathematically combine the values of neighboring samples and populate these calculated values in a defined direction across the current block to produce a predicted block. As another example, for the DC mode, the intra prediction unit 226 may calculate an average value of neighboring samples of the current block and generate the prediction block to include the average value obtained by predicting each of the samples of the block.
The mode selection unit 202 supplies the prediction block to the residual generation unit 204. The residual generation unit 204 receives the original, non-decoded version of the current block from the video data store 230 and the prediction block from the mode selection unit 202. The residual generation unit 204 calculates a sample-by-sample difference between the current block and the prediction block. The resulting sample-by-sample point difference defines the residual block of the current block. In some examples, residual generation unit 204 may also determine differences between sample values in the residual block to generate the residual block using Residual Differential Pulse Code Modulation (RDPCM). In some examples, residual generation unit 204 may be formed using one or more subtractor circuits that perform binary subtraction.
In examples where mode selection unit 202 partitions a CU into PUs, each PU may be associated with a luma prediction unit and a corresponding chroma prediction unit. Video encoder 200 and video decoder 300 may support PUs having various sizes. As indicated above, the size of a CU may refer to the size of the luma codec block of the CU, while the size of a PU may refer to the size of the luma prediction unit of the PU. Assuming that the size of a particular CU is 2Nx2N, video encoder 200 may support PU sizes of 2Nx2N or NxN for intra prediction, as well as 2Nx2N, 2NxN, nx2N, nxN, or similar symmetric PU sizes for inter prediction. The video encoder 200 and the video decoder 300 may also support asymmetric partitioning of PU sizes of 2nxnu, 2nxnd, nl×2n, and nr×2n for inter prediction.
In examples where the mode selection unit 202 does not further partition the CUs into PUs, each CU may be associated with a luma codec block and a corresponding chroma codec block. As above, the size of a CU may refer to the size of a luma codec block of the CU. The video encoder 200 and the video decoder 300 may support CU sizes of 2nx2n, 2nxn, or nx2n.
For other video coding techniques, such as intra block copy mode coding, affine mode coding, and Linear Model (LM) mode coding, as some examples, mode selection unit 202 generates a prediction block of the current block being coded via a corresponding unit associated with the coding technique. In some examples, such as palette mode codec, the mode selection unit 202 may not generate a prediction block, but rather generate a syntax element indicating a manner of reconstructing the block based on the selected palette. In such a mode, the mode selection unit 202 may provide these syntax elements to the entropy encoding unit 220 to encode it.
As described above, the residual generation unit 204 receives video data of the current block and the corresponding prediction block. Then, the residual generating unit 204 generates a residual block of the current block. In order to generate the residual block, the residual generation unit 204 calculates a sample-by-sample difference between the prediction block and the current block.
The transform processing unit 206 applies one or more transforms to the residual block to generate a block of transform coefficients (referred to herein as a "block of transform coefficients"). The transform processing unit 206 may apply various transforms to the residual block to form a block of transform coefficients. For example, transform processing unit 206 may apply a Discrete Cosine Transform (DCT), a direction transform, a Karhunen-Loeve transform (KLT), or a conceptually similar transform to the residual block. In some examples, transform processing unit 206 may perform multiple transforms on the residual block, e.g., a primary transform and a secondary transform, such as a rotational transform. In some examples, transform processing unit 206 does not apply a transform to the residual block.
The quantization unit 208 may quantize the transform coefficients in the block of transform coefficients to generate a block of quantized transform coefficients. The quantization unit 208 may quantize transform coefficients of the block of transform coefficients according to a Quantization Parameter (QP) value associated with the current block. The video encoder 200 (e.g., via the mode selection unit 202) may adjust the degree of quantization applied to the transform coefficient block associated with the current block by adjusting the QP value associated with the CU. Quantization may introduce information loss and, as a result, the quantized transform coefficients may have lower precision than the original transform coefficients generated by the transform processing unit 206.
The inverse quantization unit 210 and the inverse transform processing unit 212 may apply inverse quantization and inverse transform, respectively, to the quantized transform coefficient block to reconstruct a residual block from the transform coefficient block. The reconstruction unit 214 may generate a reconstructed block corresponding to the current block (although possibly with some degree of distortion) based on the reconstructed residual block and the prediction block generated by the mode selection unit 202. For example, the reconstruction unit 214 may add samples of the reconstructed residual block to corresponding samples of the prediction block generated from the mode selection unit 202 to generate a reconstructed block.
The filter unit 216 may perform one or more filtering operations on the reconstructed block. For example, the filter unit 216 may perform a deblocking operation to reduce blocking artifacts along edges of the CU. When performing the deblocking operation, the filter unit 216 (or its deblocking filter unit) may first calculate the boundary strength value. The deblocking filter unit of filter unit 216 may then use the boundary strength values to determine other deblocking parameters, such as t C And beta (β), which may generally represent the filter strength and coefficients for the deblocking and/or deblocking decision function. In some examples, the operation of filter unit 216 may be skipped.
The filter unit 216 may be configured to perform various techniques of the present disclosure, such as determining one or more neural network models (NN models) 232 to be used for filtering the decoded picture and/or whether to apply NN model filtering. The mode selection unit 202 may perform RD calculations using both filtered and unfiltered pictures to determine RD costs, to determine whether to perform NN model filtering, and then provide data to the entropy encoding unit 220 that represents, for example, whether to perform NN model filtering, one or more NN models 232 for the current picture, and so forth.
In particular, the filter unit 216 may receive the decoded (reconstructed) picture from the reconstruction unit 214. The filter unit 216 may also obtain (e.g., receive) additional data from one or more other units (e.g., the mode selection unit 202, the motion estimation unit 222, the motion compensation unit 224, the intra-prediction unit 226, the transform processing unit 206, the quantization unit 208, another filtering unit (e.g., separate from the filter unit 216 or contained within the filter unit 216), etc.). For example, the filter unit 216 may perform NN-based filtering and other types of filtering, such as deblocking filtering, SAO filtering, ALF filtering, and the like. The filter unit 216 may thus obtain deblocking parameters, such as boundary strength values, for the block of the current picture. The filter unit 216 may provide boundary strength values (and/or other received data) to the NN filter unit of the filter unit 216.
The NN filtering unit may use additional data (e.g., boundary strength values and/or other data received from other units of the video encoder 200) to select an NN model from the NN models 232 for performing NN-based filtering and perform actual NN-based filtering. For example, the NN filtering unit of the filter unit 216 may provide additional data (e.g., boundary strength values) to the selected one or more NN models of the NN model 232 in the form of additional input layers. In some examples, filter unit 216 may modify the additional data, for example, by converting the representation format (such as between floating point and decimal) or modifying the range of additional values to correspond to the input sample range.
Video encoder 200 stores the reconstructed block in DPB 218. For example, in examples where the operation of filter unit 216 is not required, reconstruction unit 214 may store the reconstructed block to DPB 218. In examples where the operation of filter unit 216 is required, filter unit 216 may store the filtered reconstructed block to DPB 218. Motion estimation unit 222 and motion compensation unit 224 may retrieve a reference picture formed from the reconstructed (and possibly filtered) block from DPB 218 to inter-predict a block of a subsequent coded picture. Furthermore, intra-prediction unit 226 may use the reconstructed block in DPB 218 of the current picture to intra-predict other blocks in the current picture.
In general, entropy encoding unit 220 may entropy encode syntax elements received from other functional components of video encoder 200. For example, entropy encoding unit 220 may entropy encode the quantized transform coefficient block from quantization unit 208. As another example, the entropy encoding unit 220 may entropy encode a prediction syntax element (e.g., motion information for inter prediction or intra mode information for intra prediction) from the mode selection unit 202. The entropy encoding unit 220 may perform one or more entropy encoding operations on syntax elements that are another example of video data to generate entropy encoded data. For example, the entropy encoding unit 220 may perform a Context Adaptive Variable Length Coding (CAVLC) operation, a CABAC operation, a variable-to-variable (V2V) length coding operation, a syntax-based context adaptive binary arithmetic coding (SBAC) operation, a Probability Interval Partitioning Entropy (PIPE) coding operation, an exponential Golomb coding operation, or another type of entropy coding operation on the data. In some examples, entropy encoding unit 220 may operate in a bypass mode that does not entropy encode syntax elements.
The video encoder 200 may output a bitstream that includes entropy encoded syntax elements required to reconstruct blocks of slices or pictures. In particular, the entropy encoding unit 220 may output a bitstream.
The above operations are described with respect to blocks. Such description should be understood as the operation for the luma codec block and/or the chroma codec block. As described above, in some examples, the luma codec block and the chroma codec block are a luma component and a chroma component of the CU. In some examples, the luma and chroma codec blocks are luma and chroma components of the PU.
In some examples, for a chroma codec block, the operations performed with respect to a luma codec block need not be repeated. As one example, the operation of identifying a Motion Vector (MV) and a reference picture of a luma codec block does not need to repeat the MV and the reference picture for identifying a chroma block. In contrast, MVs of luma codec blocks may be scaled to determine MVs of chroma blocks, and reference pictures may be the same. As another example, the intra prediction process may be the same for both luma and chroma codec blocks.
Fig. 7 is a block diagram illustrating an example video decoder 300 that may perform the techniques of this disclosure. Fig. 7 is provided for purposes of explanation, and not limitation, of the techniques broadly illustrated and described in the present disclosure. For purposes of explanation, this disclosure describes video decoder 300 in terms of techniques for VCC and HEVC (ITU-T H.265). However, the techniques of this disclosure may be performed by video codec devices configured as other video codec standards.
In the example of fig. 7, video decoder 300 includes a decoded picture buffer (CPB) memory 320, an entropy decoding unit 302, a prediction processing unit 304, an inverse quantization unit 306, an inverse transform processing unit 308, a reconstruction unit 310, a filter unit 312, and a Decoded Picture Buffer (DPB) 314. Any or all of CPB memory 320, entropy decoding unit 302, prediction processing unit 304, inverse quantization unit 306, inverse transform processing unit 308, reconstruction unit 310, filter unit 312, and DPB 314 may be implemented in one or more processors or processing circuits. For example, the elements of video encoder 300 may be implemented as one or more circuits or logic elements as part of a hardware circuit or as part of a processor, ASIC, or FPGA. Furthermore, the video decoder 300 may include additional or alternative processors or processing circuits to perform these and other functions.
The prediction processing unit 304 includes a motion compensation unit 316 and an intra prediction unit 318. The prediction processing unit 304 may include additional units to perform prediction according to other prediction modes. As an example, the prediction processing unit 304 may include a palette unit, an intra block copy unit (which may form part of the motion compensation unit 316), an affine unit, a Linear Model (LM) unit, and the like. In other examples, video decoder 300 may include more, fewer, or different functional components.
The CPB memory 320 may store video data, such as an encoded video bitstream, to be decoded by components of the video decoder 300. For example, video data stored in the CPB memory 320 may be obtained from the computer-readable medium 110 (fig. 1). The CPB memory 320 may include CPBs that store encoded video data (e.g., syntax elements) from an encoded video bitstream. Also, the CPB memory 320 may store video data other than syntax elements of the decoded pictures, such as temporary data representing outputs from various units of the video decoder 300. DPB 314 typically stores decoded pictures that video decoder 300 may output and/or use as reference video data when decoding subsequent data or pictures of an encoded video bitstream. CPB memory 320 and DPB 314 may be formed of any of a variety of memory devices, such as Dynamic Random Access Memory (DRAM), including Synchronous DRAM (SDRAM), magnetoresistive RAM (MRAM), resistive RAM (RRAM), or other types of memory devices. CPB memory 320 and DPB 314 may be provided by the same memory device or separate memory devices. In various examples, CPB memory 320 may be on-chip with other components of video decoder 300, or off-chip with respect to those components.
Additionally or alternatively, in some examples, video decoder 300 may retrieve decoded video data from memory 120 (fig. 1). That is, memory 120 may store data as discussed above with respect to CPB memory 320. Likewise, when some or all of the functions of video decoder 300 are implemented in software to be executed by the processing circuitry of video decoder 300, memory 120 may store instructions to be executed by video decoder 300.
Various units shown in fig. 7 are illustrated to aid in understanding the operations performed by video decoder 300. These units may be implemented as fixed function circuits, programmable circuits or a combination thereof. Similar to fig. 6, the fixed function circuit refers to a circuit that provides a specific function and presets operations that can be performed. Programmable circuitry refers to circuitry that can be programmed to perform various tasks and provide flexible functionality in the operations that can be performed. For example, the programmable circuit may run software or firmware that causes the programmable circuit to operate in a manner defined by instructions of the software or firmware. Fixed function circuitry may execute software instructions (e.g., to receive parameters or output parameters) but the type of operation that fixed function circuitry performs is typically not variable. In some examples, one or more of the units may be different circuit blocks (fixed function or programmable), and in some examples, one or more of the units may be integrated circuits.
The video decoder 300 may include an ALU, an EFU, a digital circuit, an analog circuit, and/or a programmable core formed of programmable circuits. In examples where the operations of video decoder 300 are performed by software running on programmable circuits, on-chip or off-chip memory may store instructions (e.g., object code) of the software received and run by video decoder 300.
The entropy decoding unit 302 may receive encoded video data from the CPB and entropy decode the video data to reproduce syntax elements. The prediction processing unit 304, the inverse quantization unit 306, the inverse transform processing unit 308, the reconstruction unit 310, and the filter unit 312 may generate decoded video data based on syntax elements extracted from the bitstream.
Typically, video decoder 300 reconstructs the pictures on a block-by-block basis. The video decoder 300 may perform the reconstruction operation on each block separately (where the block currently being reconstructed (i.e., decoded) may be referred to as the "current block").
The entropy decoding unit 302 may entropy decode syntax elements defining quantized transform coefficients of the quantized transform coefficient block, as well as transform information, such as Quantization Parameter (QP) and/or transform mode indication(s). The inverse quantization unit 306 may determine a quantization degree using a QP associated with the quantized transform coefficient block and, as such, an inverse quantization degree for the inverse quantization unit 306 to apply. The inverse quantization unit 306 may, for example, perform a bitwise left shift operation to inversely quantize the quantized transform coefficients. The inverse quantization unit 306 may thus form a transform coefficient block including the transform coefficients.
After the inverse quantization unit 306 forms the transform coefficient block, the inverse transform processing unit 308 may apply one or more inverse transforms to the transform coefficient block to generate a residual block associated with the current block. For example, the inverse transform processing unit 308 may apply an inverse DCT, an inverse integer transform, an inverse Karhunen-Loeve transform (KLT), an inverse rotation transform, an inverse direction transform, or another inverse transform to the transform coefficient block.
Further, the prediction processing unit 304 generates a prediction block from the prediction information syntax element entropy-decoded by the entropy decoding unit 302. For example, if the prediction information syntax element indicates that the current block is inter predicted, the motion compensation unit 316 may generate the prediction block. In this case, the prediction information syntax element may indicate a reference picture in the DPB 314 from which the reference block is retrieved, and a motion vector identifying a position of the reference block in the reference picture relative to a position of the current block in the current picture. Motion compensation unit 316 may generally perform the inter-prediction process in a substantially similar manner as described with respect to motion compensation unit 224 (fig. 6).
As another example, if the prediction information syntax element indicates that the current block is intra-predicted, the intra-prediction unit 318 may generate the prediction block according to the intra-prediction mode indicated by the prediction information syntax element. Again, intra-prediction unit 318 may generally perform an intra-prediction process in a manner substantially similar to that described with respect to intra-prediction unit 226 (fig. 6). Intra-prediction unit 318 may retrieve data for neighboring samples of the current block from DPB 314.
The reconstruction unit 310 may reconstruct the current block using the prediction block and the residual block. For example, the reconstruction unit 310 may add samples of the residual block to corresponding samples of the prediction block to reconstruct the current block.
The filter unit 312 may perform one or more filtering operations on the reconstructed block. For example, the filter unit 312 may perform a deblocking operation to reduce blocking artifacts along edges of the reconstructed block. The operation of the filter unit 312 is not necessarily performed in all examples. For example, the video decoder 300 may explicitly or implicitly determine whether to perform neural network model filtering using the NN model 322, e.g., using any or all of the various techniques discussed herein. Further, the video decoder 300 may explicitly or implicitly determine the one or more NN models 322 and/or the grid size of the current picture to be decoded and filtered. Accordingly, when filtering is turned on, the filter unit 312 may use one or more NN models 322 to filter a portion of the currently decoded picture.
In some examples, filter unit 312 may perform deblocking operations to reduce blockiness artifacts along edges of a CU, PU, or TU. When performing the deblocking operation, the filter unit 312 (or its deblocking filter unit) may first calculate the boundary strength value. The deblocking filter unit of filter unit 312 may then use the boundary strength values to determine other deblocking parameters, such as tC and beta (β), which may generally represent the filter strength and coefficients for the deblocking and/or deblocking decision functions. In some examples, the operation of filter unit 312 may be skipped.
The filter unit 312 may be configured to perform various techniques of the present disclosure, such as determining one or more neural network models (NN models) 322 to be used for filtering the decoded picture and/or whether to apply NN model filtering. The entropy decoding unit 302 may decode data representing whether boundary filtering is performed on blocks of a particular picture, slice, or other unit.
The filter unit 312 may receive the decoded (reconstructed) picture from the reconstruction unit 310. The filter unit 312 may also obtain (e.g., receive) additional data from one or more other units (e.g., the prediction processing unit 304, the motion compensation unit 316, the intra-prediction unit 318, the inverse transform processing unit 308, the inverse quantization unit 306, another filter unit (e.g., separate from the filter unit 312 or contained within the filter unit 312), etc.). For example, the filter unit 312 may perform both NN-based filtering and other types of filtering (such as deblocking filtering, SAO filtering, ALF filtering, etc.). The filter unit 312 may thus obtain deblocking parameters, such as boundary strength values, for the block of the current picture. The filter unit 312 may provide boundary strength values (and/or other received data) to the NN filter unit of the filter unit 312.
The NN filtering unit may use additional data (e.g., boundary strength values and/or other data received from other units of the video decoder 300) to select an NN model from the NN models 322 for performing NN-based filtering and perform actual NN-based filtering. For example, the NN filtering unit of the filter unit 312 may provide additional data (e.g., boundary strength values) to the selected one or more NN models of the NN model 322 in the form of additional input layers. In some examples, filter unit 312 may modify the additional data, for example, by converting the representation format (such as between floating point and decimal) or modifying the range of additional values to correspond to the input sample range.
Video decoder 300 may store the reconstructed (and filtered) blocks in DPB 314. For example, in an example in which the operation of filter unit 312 is not performed, reconstruction unit 310 may store the reconstructed block to DPB 314. In an example of performing the operation of filter unit 312, filter unit 312 may store the filtered reconstructed block to DPB 314. As described above, DPB 314 may provide reference information to prediction processing unit 304, such as samples of a current picture for intra prediction and a previously decoded picture for subsequent motion compensation. Further, video decoder 300 may output the decoded pictures from DPB 314 for subsequent presentation on a display device, such as display device 118 of fig. 1.
Fig. 8 is a flowchart illustrating an example method for encoding a current block in accordance with the techniques of this disclosure. The current block may include the current CU. Although described with respect to video encoder 200 (fig. 1 and 3), it should be understood that other devices may also be configured to perform a method similar to fig. 8.
In this example, video encoder 200 first predicts the current block (350). For example, the video encoder 200 may form a prediction block of the current block. The video encoder 200 may then calculate a residual block for the current block (352). To calculate the residual block, the video encoder 200 may calculate the difference between the original, non-decoded block and the predicted block of the current block. The video encoder 200 may then transform and quantize coefficients of the residual block (354). Next, video encoder 200 may scan the quantized transform coefficients of the residual block (356). During or after scanning, the video encoder 200 may entropy encode the transform coefficients (358). For example, the video encoder 200 may encode the coefficients using CAVLC or CABAC. The video encoder 200 may then output entropy encoded data of the block (360).
The video encoder 200 may also decode the current block after encoding the current block to use the decoded version of the current block as reference data for subsequently decoded data (e.g., in inter or intra prediction mode). Accordingly, the video encoder 200 may inverse quantize and inverse transform the coefficients to reproduce the residual block (362). The video encoder 200 may combine the residual block with the prediction block to form a decoded block (364). The video encoder 200 may decode all blocks of the current picture in this manner, forming a fully decoded picture. Video encoder 200 may also filter the decoded picture including the decoded block according to any of the various techniques of this disclosure (366). Video encoder 200 may then store the decoded picture in DPB 218 (368).
Fig. 9 is a flowchart illustrating an example method for decoding a current block in accordance with the techniques of this disclosure. The current block may include the current CU. Although described with respect to video decoder 300 (fig. 1 and 4), it should be understood that other devices may be configured to perform a method similar to that of fig. 9.
The video decoder 300 may receive entropy encoded data of the current block, such as entropy encoded prediction information and entropy encoded data of coefficients of a residual block corresponding to the current block (370). The video decoder 300 may entropy decode the entropy encoded data to determine prediction information of the current block and reproduce coefficients of the residual block (372). The video decoder 300 may predict the current block (374) (e.g., using intra or inter prediction modes indicated by the prediction information of the current block) to calculate a predicted block of the current block. The video decoder 300 may then inverse scan the reproduced coefficients (376) to create blocks of quantized transform coefficients. The video decoder 300 may then inverse quantize and inverse transform the quantized transform coefficients to generate a residual block (378). The video decoder 300 may eventually decode the current block by combining the prediction block and the residual block (380). The video decoder 300 may also filter (382) the decoded video data, e.g., using one or more NN models as discussed above in accordance with the techniques of the present disclosure. Video decoder 300 may also store (filtered) decoded video data in, for example, DPB 314 (384).
Fig. 10 is a flowchart illustrating an example method of filtering decoded video data in accordance with the techniques of this disclosure. The method of fig. 10 may be performed by the video encoder 200 or the video decoder 300. For example, the method of fig. 10 may be performed by a Neural Network (NN) filtering unit of the filter unit 216 of the video encoder 200, e.g., during step 366 of the method of fig. 8. As another example, the method of fig. 10 may be performed by the NN filtering unit of the filter unit 312 of the video decoder 300, for example, during step 382 of the method of fig. 9. For purposes of example and explanation, the method of fig. 10 is explained with respect to the video decoder 300 (and in particular, the NN filtering unit of the filter unit 312 of the video decoder 300).
First, the NN filtering unit of the filter unit 312 receives the decoded video data (400). The decoded video data may be at least a portion of a picture, e.g., a block set, a slice, one or more slices, a sub-picture, or an entire picture.
Although not shown in the example of fig. 10, the deblocking filter of the filter unit 312 and/or other filtering units (such as an SAO filter, an ALF filter, and/or another NN filtering unit) may first filter the decoded video data. Thus, the received decoded video data may have been pre-filtered (e.g., deblocking, SAO filtering, ALF filtering, and/or NN filtering) prior to being received by the NN filtering unit that performs the method of fig. 10 as discussed herein. As such, the term "decoded video data" should be understood to include filtered, decoded video data (which may include, for example, deblock, decoded video data).
Accordingly, the at least a portion may include a plurality of blocks including edges that were deblocked, for example, by a deblocking filter of filter unit 312. The deblocking filter may calculate boundary strength values for boundaries between blocks. The boundary strength value may generally represent whether and to what extent samples near the block boundary are modified by the deblocking filter.
The NN filtering unit of the filter unit 312 may receive additional data from one or more other units of the video decoder 300 (402). For example, the NN filtering unit of the filter unit 312 may receive the boundary strength values from the deblocking filter of the filter unit 312. The NN filtering unit of the filter unit 312 may additionally or alternatively receive other data from other units, such as Coding Unit (CU) partition data, prediction Unit (PU) partition data, transform Unit (TU) partition data, deblocking filter data, quantization Parameter (QP) data, intra prediction data, inter prediction data, data representing a distance (e.g., POC distance) between a decoded picture and one or more reference pictures, or motion information of one or more decoded blocks of the decoded picture.
The NN filtering unit of the filter unit 312 may then determine one or more of the Neural Network (NN) models 322 for filtering at least a portion of the current picture (404). The NN filtering unit of the filter unit 312 may then use the received additional data and the determined one or more NN models to filter at least a portion of the current picture (406). In some examples, the NN filtering unit of the filter unit 312 may modify the additional data, for example, by adjusting a range of values of the additional data to fit a range of input sample values and/or by converting the values of the additional data to a different format, such as integer or floating point. The NN filtering unit of the filter unit 312 may convert the additional data to one or more input planes, similar to the brightness and/or chrominance planes to be filtered.
In this way, the method of fig. 10 represents an example of a method of filtering decoded video data, comprising: receiving, by a neural network filtering unit of a video decoding device, data of a decoded picture of video data; receiving, by the neural network filtering unit, data from one or more other units of the video decoding device, the data from the one or more other units being different from the data of the decoded picture, and wherein receiving the data from the one or more other units comprises receiving boundary strength data from the deblocking unit; determining, by a neural network filtering unit, one or more neural network models for filtering a portion of the decoded picture; and filtering, by the neural network filtering unit, a portion of the decoded picture using the one or more neural network models and data from one or more other units of the video decoding device, including boundary strength data.
Some examples of the technology of the present disclosure are summarized in the following clauses:
clause 1: a method of filtering decoded video data, the method comprising: receiving, by a filtering unit of a video decoding device, data from one or more other units of the video decoding device; determining, by a filtering unit, one or more neural network models to be used for filtering a portion of a decoded picture of video data; and filtering, by the filtering unit, a portion of the decoded picture using the one or more neural network models and data from one or more other units of the video decoding device.
Clause 2: the method of clause 1, wherein receiving data from one or more other units comprises receiving data from one or more of the following: an intra prediction unit of the video decoding apparatus; an inter prediction unit of the video decoding apparatus; a transform processing unit of the video decoding apparatus; a quantization unit of the video decoding apparatus; a loop filter unit of the video decoding apparatus; a preprocessing unit of the video decoding apparatus; or a second filtering unit of the video decoding apparatus.
Clause 3: the method of clause 2, wherein the filtering unit comprises a first neural network based filtering unit.
Clause 4: the method of any of clauses 2 and 3, wherein the loop filter unit comprises at least one of a deblocking filter unit, a Sample Adaptive Offset (SAO) filter unit, or an Adaptive Loop Filter (ALF) unit.
Clause 5: the method of any of clauses 1-4, wherein receiving data comprises receiving one or more of Coding Unit (CU) partition data, prediction Unit (PU) partition data, transform Unit (TU) partition data, deblocking filter data, quantization Parameter (QP) data, intra prediction data, inter prediction data, data representing a distance between a decoded picture and one or more reference pictures, or motion information of one or more decoded blocks of the decoded picture.
Clause 6: the method of clause 5, wherein the deblocking filtered data includes one or more of boundary strength values, whether a long or short filter is used for deblocking, or whether a strong or weak filter is used for deblocking.
Clause 7: the method according to any one of clauses 5 and 6, wherein the intra-prediction data comprises an intra-prediction mode.
Clause 8: the method of any of clauses 5-7, wherein the data representing the distance comprises data representing a difference between a Picture Order Count (POC) value of the decoded picture and a POC value of a reference picture used to predict a block of the decoded picture.
Clause 9: the method according to any of clauses 1-8, further comprising performing, by the filtering unit, a function attributed to one or more other units.
Clause 10: the method of any of clauses 1-9, wherein filtering a portion of the decoded picture using the one or more neural network models and data from one or more other units of the video decoding device comprises providing the data from the one or more other units of the video decoding device to a Convolutional Neural Network (CNN) as one or more additional input planes.
Clause 11: the method of any of clauses 1-10, wherein filtering a portion of the decoded picture using the one or more neural network models and data from one or more other units of the video decoding device comprises adjusting an output of the one or more neural network models using data from the one or more other units of the video decoding device.
Clause 12: the method of any of clauses 1-11, further comprising adjusting data from one or more other units of the video decoding device prior to filtering a portion of the decoded picture.
Clause 13: the method of claim 12, wherein adjusting the data includes converting values of the data between an integer representation and a floating point representation.
Clause 14: the method of any of clauses 12 and 13, wherein adjusting the data comprises scaling the value of the data to be within a range of values suitable for one or more neural network models.
Clause 15: the method of any of clauses 1-14, wherein receiving data comprises receiving segmentation data of the decoded picture, and wherein filtering a portion of the decoded picture using one or more neural network models and data from one or more other units of a video decoding device comprises: setting a value in the input plane at a position juxtaposed with a position of a boundary sample defining a partition boundary in the decoded picture as indicated by the partition data to a first value; setting a value at a position in the input plane juxtaposed with a position of an internal sample point that is a non-boundary sample point to a second value; and filtering a portion of the decoded picture using the input plane as an input to at least one of the one or more neural network models.
Clause 16: the method of clause 15, wherein the first value comprises 1 and the second value comprises 0.
Clause 17: the method of any of clauses 15 and 16, wherein the partition data comprises Codec Unit (CU) partition data and the input plane comprises a first partition plane, the method further comprising: receiving Prediction Unit (PU) partition data; forming a second input plane using the PU partition data; receiving Transform Unit (TU) partition data; and forming a third input plane using the TU partition data, wherein filtering a portion of the decoded picture using the one or more neural network models and data from the one or more other units of the video decoding device includes filtering a portion of the decoded picture using the first input plane, the second input plane, and the third input plane as inputs to at least one of the one or more neural network models.
Clause 18: the method of any of clauses 1-14, wherein receiving data comprises receiving deblocking filter data of the decoded picture, and wherein filtering a portion of the decoded picture using one or more neural network models and data from one or more other units of a video decoding device comprises: converting deblocking filter data of the decoded picture to one or more input planes of at least one of the one or more neural network models; and filtering a portion of the decoded picture using the one or more input planes as input to at least one of the one or more neural network models.
Clause 19: the method of clause 18, wherein the deblocking filter data comprises one or more of boundary strength data, long or short filter data, or strong or weak filter data.
Clause 20: the method of any of clauses 10 or 15-19, further comprising upsampling or downsampling data of the input plane(s).
Clause 21: the method according to any of clauses 1-20, further comprising: encoding a current picture; and decoding the current picture to form a decoded picture.
Clause 22: the method of clause 21, wherein making the determination comprises making the determination based on a rate-distortion calculation.
Clause 23: an apparatus for filtering decoded video data, the apparatus comprising one or more components for performing the method of any of clauses 1-22.
Clause 24: the apparatus of clause 23, wherein the one or more components comprise one or more processors implemented in circuitry.
Clause 25: the apparatus of clause 23, further comprising a display configured to display the decoded video data.
Clause 26: the device of clause 23, wherein the device comprises one or more of a camera, a computer, a mobile device, a broadcast receiver device, or a set-top box.
Clause 27: the apparatus of clause 23, further comprising a memory configured to store video data.
Clause 28: a computer readable storage medium having instructions stored thereon that, when executed, cause a processor to perform the method of any of clauses 1-22.
Clause 29: an apparatus for filtering decoded video data, the apparatus comprising a filtering unit comprising: means for receiving data from one or more units of the video decoding apparatus other than the filtering unit; means for determining one or more neural network models for filtering a portion of a decoded picture of video data; and means for filtering a portion of the decoded picture using the one or more neural network models and data from one or more other units of the video decoding device.
Clause 30: a method of filtering decoded video data, the method comprising: receiving, by a neural network filtering unit of a video decoding device, data of a decoded picture of video data; receiving, by the neural network filtering unit, data from one or more other units of the video decoding device, the data from the one or more other units being different from the data of the decoded picture, and wherein receiving the data from the one or more other units of the video decoding device comprises receiving boundary strength data from a deblocking unit of the video decoding device; determining, by a neural network filtering unit, one or more neural network models for filtering a portion of the decoded picture; and filtering, by the neural network filtering unit, a portion of the decoded picture using the one or more neural network models and data from one or more other units of the video decoding device, including boundary strength data.
Clause 31: the method of clause 30, wherein receiving data from one or more other units of the video decoding device further comprises receiving data from one or more of the following: an intra prediction unit of the video decoding apparatus; an inter prediction unit of the video decoding apparatus; a transform processing unit of the video decoding apparatus; a quantization unit of the video decoding apparatus; a loop filter unit of the video decoding apparatus; a preprocessing unit of the video decoding apparatus; or a second neural network filtering unit of the video decoding device.
Clause 32: the method of clause 31, wherein the loop filter unit comprises at least one of a Sample Adaptive Offset (SAO) filtering unit or an Adaptive Loop Filtering (ALF) unit.
Clause 33: the method of clause 32, wherein receiving data further comprises receiving one or more of Coding Unit (CU) partition data, prediction Unit (PU) partition data, transform Unit (TU) partition data, deblocking filter data, quantization Parameter (QP) data, intra prediction data, inter prediction data, data representing a distance between a decoded picture and one or more reference pictures, or motion information of one or more decoded blocks of the decoded picture.
Clause 34: the method of clause 33, wherein the deblocking filtered data includes one or more of whether a long or short filter is used for deblocking or whether a strong or weak filter is used for deblocking.
Clause 35: the method of clause 33, wherein the intra-prediction data comprises an intra-prediction mode.
Clause 36: the method of clause 33, wherein the data representing the distance comprises data representing a difference between a Picture Order Count (POC) value of the decoded picture and a POC value of a reference picture used to predict a block of the decoded picture.
Clause 37: the method of clause 30, wherein filtering the portion of the decoded picture using the one or more neural network models and the data from the one or more other units of the video decoding device comprises providing the data from the one or more other units of the video decoding device to a Convolutional Neural Network (CNN) as one or more additional input planes.
Clause 38: the method of clause 30, wherein filtering the portion of the decoded picture using the one or more neural network models and data from the one or more other units of the video decoding device comprises adjusting an output of the one or more neural network models using data from the one or more other units of the video decoding device.
Clause 39: the method of clause 30, further comprising adjusting data from one or more other units of the video decoding device prior to filtering a portion of the decoded picture.
Clause 40: the method of clause 39, wherein adjusting the data includes converting the value of the data between an integer representation and a floating point representation.
Clause 41: the method of clause 39, wherein adjusting the data comprises scaling the value of the data to be within a range of values suitable for the one or more neural network models.
Clause 42: the method of clause 30, wherein receiving the data comprises receiving segmented data of the decoded picture, and wherein filtering a portion of the decoded picture using the one or more neural network models and the data from the one or more other units of the video decoding device comprises: setting a value in the input plane at a position juxtaposed with a position of a boundary sample defining a partition boundary in the decoded picture as indicated by the partition data to a first value; setting a value at a position in the input plane juxtaposed with a position of an internal sample point that is a non-boundary sample point to a second value; and filtering a portion of the decoded picture using the input plane as an input to at least one of the one or more neural network models.
Clause 43: the method of clause 42, wherein the first value comprises 1 and the second value comprises 0.
Clause 44: the method of clause 42, wherein the partition data comprises Coding Unit (CU) partition data and the input plane comprises a first partition plane, the method further comprising: receiving Prediction Unit (PU) partition data; forming a second input plane using the PU partition data; receiving Transform Unit (TU) partition data; and forming a third input plane using the TU partition data, wherein filtering a portion of the decoded picture using the one or more neural network models and data from the one or more other units of the video decoding device includes filtering a portion of the decoded picture using the first input plane, the second input plane, and the third input plane as inputs to at least one of the one or more neural network models.
Clause 45: the method of clause 30, wherein filtering a portion of the decoded picture using the one or more neural network models and data from one or more other units of the video decoding device comprises: converting deblocking filter data from the decoded picture of the deblocking unit to one or more input planes of at least one of the one or more neural network models; and filtering a portion of the decoded picture using the one or more input planes as input to at least one of the one or more neural network models.
Clause 46: the method according to clause 30, further comprising: encoding a current picture; and decoding the current picture to form a decoded picture.
Clause 47: the method of clause 46, wherein determining the one or more neural network models comprises determining the one or more neural network models from a rate-distortion calculation.
Clause 48: an apparatus for filtering decoded video data, the apparatus comprising: a memory configured to store decoded pictures of video data; and one or more processors implemented in the circuitry and configured to operate the neural network filtering unit to: receiving data from one or more other units of the device, the data from the one or more other units of the device being different from the data of the decoded picture, and wherein to receive the data from the one or more other units of the device, the one or more processors are configured to operate the neural network filtering unit to receive boundary strength data from the deblocking unit of the device; determining one or more neural network models for filtering a portion of the decoded picture; and filtering a portion of the decoded picture using the one or more neural network models and data from one or more other units of the device, including boundary strength data.
Clause 49: the apparatus of clause 48, wherein to receive data from one or more other units of the apparatus, the one or more processors are further configured to operate the neural network filtering unit to receive data from one or more of the following: an intra prediction unit of the device; an inter prediction unit of the apparatus; a conversion processing unit of the apparatus; a quantization unit of the apparatus; a loop filter unit of the apparatus; a preprocessing unit of the device; or a second neural network filtering unit of the device.
Clause 50: the apparatus of clause 48, wherein to receive data from one or more other units of the apparatus, the one or more processors are further configured to operate the neural network filtering unit to receive one or more of Codec Unit (CU) partition data, prediction Unit (PU) partition data, transform Unit (TU) partition data, deblocking filter data, quantization Parameter (QP) data, intra prediction data, inter prediction data, data representing a distance between a decoded picture and one or more reference pictures, or motion information of one or more decoded blocks of the decoded picture.
Clause 51: the apparatus of clause 48, wherein to filter a portion of the decoded picture using the one or more neural network models and data from one or more other units of the apparatus, the one or more processors are configured to run the neural network filtering unit to provide the data from the one or more other units of the apparatus as one or more additional input planes to a Convolutional Neural Network (CNN).
Clause 52: the apparatus of clause 48, wherein to filter a portion of the decoded picture using the one or more neural network models and data from one or more other units of the apparatus, the one or more processors are configured to run the neural network filtering unit to adjust an output of the one or more neural network models using the data from the one or more other units of the apparatus.
Clause 53: the apparatus of clause 48, wherein to filter a portion of the decoded picture using the one or more neural network models and data from one or more other units of the apparatus, the one or more processors are configured to operate the neural network filtering unit to adjust the data from the one or more other units of the apparatus prior to filtering the portion of the decoded picture.
Clause 54: the device of clause 48, wherein to filter a portion of the decoded picture using the one or more neural network models and data from one or more other units of the device, the one or more processors are configured to operate the neural network filtering unit to: converting deblocking filter data from the decoded picture of the deblocking unit to one or more input planes of at least one of the one or more neural network models; and filtering a portion of the decoded picture using the one or more input planes as input to at least one of the one or more neural network models.
Clause 55: the apparatus of clause 48, further comprising a display configured to display the decoded picture of the video data.
Clause 56: the device of clause 48, wherein the device comprises one or more of a camera, a computer, a mobile device, a broadcast receiver device, or a set-top box.
Clause 57: a computer readable storage medium having instructions stored thereon that, when executed, cause a processor of a video decoding device to operate a neural network filtering unit to: receiving data of a decoded picture of video data; receiving data from one or more other units of the video decoding device, the data from the one or more other units of the video decoding device being different from the data of the decoded picture, and wherein the instructions that cause the processor to receive the data from the one or more other units of the video decoding device comprise instructions that cause the processor to receive boundary strength data from a deblocking unit of the video decoding device; determining one or more neural network models for filtering a portion of the decoded picture; and filtering a portion of the decoded picture using the one or more neural network models and data from one or more other units of the video decoding device, including boundary strength data.
Clause 58: an apparatus for filtering decoded video data, the apparatus comprising a filtering unit comprising: means for receiving data of a decoded picture of video data; means for receiving data from one or more other units of the video decoding device, the data from the one or more other units being different from the data of the decoded picture, and wherein the means for receiving data from the one or more other units of the video decoding device comprises means for receiving boundary strength data from a deblocking unit of the video decoding device; means for determining one or more neural network models for filtering a portion of the decoded picture; and means for filtering a portion of the decoded picture using the one or more neural network models and data from one or more other units of the video decoding device, including boundary strength data.
Clause 59: a method of filtering decoded video data, the method comprising: receiving, by a neural network filtering unit of a video decoding device, data of a decoded picture of video data; receiving, by the neural network filtering unit, data from one or more other units of the video decoding device, the data from the one or more other units being different from the data of the decoded picture, and wherein receiving the data from the one or more other units of the video decoding device comprises receiving boundary strength data from a deblocking unit of the video decoding device; determining, by a neural network filtering unit, one or more neural network models for filtering a portion of the decoded picture; and filtering, by the neural network filtering unit, a portion of the decoded picture using the one or more neural network models and data from one or more other units of the video decoding device, including boundary strength data.
Clause 60: the method of clause 59, wherein receiving data from one or more other units of the video decoding device further comprises receiving data from one or more of the following: an intra prediction unit of the video decoding apparatus; an inter prediction unit of the video decoding apparatus; a transform processing unit of the video decoding apparatus; a quantization unit of the video decoding apparatus; a loop filter unit of the video decoding apparatus; a preprocessing unit of the video decoding apparatus; or a second neural network filtering unit of the video decoding device.
Clause 61: the method of clause 60, wherein the loop filter unit comprises at least one of a Sample Adaptive Offset (SAO) filtering unit or an Adaptive Loop Filtering (ALF) unit.
Clause 62: the method of any of clauses 59-61, wherein receiving data comprises receiving one or more of Coding Unit (CU) partition data, prediction Unit (PU) partition data, transform Unit (TU) partition data, deblocking filter data, quantization Parameter (QP) data, intra prediction data, inter prediction data, data representing a distance between a decoded picture and one or more reference pictures, or motion information of one or more decoded blocks of the decoded picture.
Clause 63: the method of clause 62, wherein the deblocking filtered data includes one or more of whether a long or short filter is used for deblocking or whether a strong or weak filter is used for deblocking.
Clause 64: the method of any of clauses 62 and 63, wherein the intra-prediction data comprises intra-prediction modes.
Clause 65: the method of any of clauses 62-64, wherein the data representing the distance comprises data representing a difference between a Picture Order Count (POC) value of the decoded picture and a POC value of a reference picture used to predict a block of the decoded picture.
Clause 66: the method of any of clauses 59-65, wherein filtering a portion of the decoded picture using the one or more neural network models and data from one or more other units of the video decoding device comprises providing the data from the one or more other units of the video decoding device to a Convolutional Neural Network (CNN) as one or more additional input planes.
Clause 67: the method of any of clauses 59-66, wherein filtering a portion of the decoded picture using the one or more neural network models and data from one or more other units of the video decoding device comprises adjusting an output of the one or more neural network models using data from the one or more other units of the video decoding device.
Clause 68: the method of any of clauses 59-67, further comprising adjusting data from one or more other units of the video decoding device prior to filtering a portion of the decoded picture.
Clause 69: the method of clause 68, wherein adjusting the data includes converting the value of the data between an integer representation and a floating point representation.
Clause 70: the method of any of clauses 68 and 69, wherein adjusting the data comprises scaling the value of the data to be within a range of values suitable for one or more neural network models.
Clause 71: the method of any of clauses 59-70, wherein receiving data comprises receiving segmented data of the decoded picture, and wherein filtering a portion of the decoded picture using the one or more neural network models and data from one or more other units of the video decoding device comprises: setting a value in the input plane at a position juxtaposed with a position of a boundary sample defining a partition boundary in the decoded picture as indicated by the partition data to a first value; setting a value at a position in the input plane juxtaposed with a position of an internal sample point that is a non-boundary sample point to a second value; and filtering a portion of the decoded picture using the input plane as an input to at least one of the one or more neural network models.
Clause 72: the method of clause 71, wherein the first value comprises 1 and the second value comprises 0.
Clause 73: the method of any of clauses 71 and 72, wherein the partition data comprises Codec Unit (CU) partition data and the input plane comprises a first partition plane, the method further comprising: receiving Prediction Unit (PU) partition data; forming a second input plane using the PU partition data; receiving Transform Unit (TU) partition data; and forming a third input plane using the TU partition data, wherein filtering a portion of the decoded picture using the one or more neural network models and data from the one or more other units of the video decoding device includes filtering a portion of the decoded picture using the first input plane, the second input plane, and the third input plane as inputs to at least one of the one or more neural network models.
Clause 74: the method of any of clauses 59-73, wherein filtering a portion of the decoded picture using one or more neural network models and data from one or more other units of the video decoding device comprises: converting deblocking filter data from the decoded picture of the deblocking unit to one or more input planes of at least one of the one or more neural network models; and filtering a portion of the decoded picture using the one or more input planes as input to at least one of the one or more neural network models.
Clause 75: the method according to any of clauses 59-74, further comprising: encoding a current picture; and decoding the current picture to form a decoded picture.
Clause 76: the method of any of clauses 59-75, wherein determining one or more neural network models comprises determining the one or more neural network models from a rate-distortion calculation.
Clause 77: an apparatus for filtering decoded video data, the apparatus comprising: a memory configured to store decoded pictures of video data; and one or more processors implemented in the circuitry and configured to operate the neural network filtering unit to: receiving data from one or more other units of the device, the data from the one or more other units of the device being different from the data of the decoded picture, and wherein to receive the data from the one or more other units of the device, the one or more processors are configured to operate the neural network filtering unit to receive boundary strength data from the deblocking unit of the device; determining one or more neural network models for filtering a portion of the decoded picture; and filtering a portion of the decoded picture using the one or more neural network models and data from one or more other units of the device, including boundary strength data.
Clause 78: the device of clause 77, wherein to receive data from one or more other units of the device, the one or more processors are further configured to operate the neural network filtering unit to receive data from one or more of the following: an intra prediction unit of the device; an inter prediction unit of the apparatus; a conversion processing unit of the apparatus; a quantization unit of the apparatus; a loop filter unit of the apparatus; a preprocessing unit of the device; or a second neural network filtering unit of the device.
Clause 79: the apparatus of any one of clauses 77 and 78, wherein to receive data from one or more other units of the apparatus, the one or more processors are further configured to operate the neural network filtering unit to receive one or more of Coding Unit (CU) partition data, prediction Unit (PU) partition data, transform Unit (TU) partition data, deblocking filter data, quantization Parameter (QP) data, intra prediction data, inter prediction data, data representing a distance between a decoded picture and one or more reference pictures, or motion information of one or more decoded blocks of the decoded picture.
Clause 80: the apparatus of any of clauses 77-79, wherein to filter a portion of the decoded picture using the one or more neural network models and data from one or more other units of the apparatus, the one or more processors are configured to run the neural network filtering unit to provide the data from the one or more other units of the apparatus as one or more additional input planes to a Convolutional Neural Network (CNN).
Clause 81: the device of any of clauses 77-80, wherein to filter a portion of the decoded picture using the one or more neural network models and data from one or more other units of the device, the one or more processors are configured to run the neural network filtering unit to adjust an output of the one or more neural network models using the data from the one or more other units of the device.
Clause 82: the device of any of clauses 77-81, wherein to filter a portion of the decoded picture using the one or more neural network models and data from one or more other units of the device, the one or more processors are configured to run the neural network filtering unit to adjust the data from the one or more other units of the device prior to filtering the portion of the decoded picture.
Clause 83: the device of any of clauses 77-82, wherein to filter a portion of the decoded picture using one or more neural network models and data from one or more other units of the device, the one or more processors are configured to operate the neural network filtering unit to: converting deblocking filter data from the decoded picture of the deblocking unit to one or more input planes of at least one of the one or more neural network models; and filtering a portion of the decoded picture using the one or more input planes as input to at least one of the one or more neural network models.
Clause 84: the device of any of clauses 77-83, further comprising a display configured to display the decoded picture of the video data.
Clause 85: the device of any of clauses 77-84, wherein the device comprises one or more of a camera, a computer, a mobile device, a broadcast receiver device, or a set top box.
It should be appreciated that certain acts or events of any of the techniques described herein can be performed in a different order, may be added, combined, or omitted entirely, depending on the example (e.g., not all of the described acts or events are necessary for the practice of the technique). Further, in some examples, acts or events may be performed concurrently (e.g., through multi-threaded processing, interrupt processing, or multiple processors) rather than sequentially.
In one or more examples, the described functions may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium and executed by a hardware-based processing unit. The computer-readable medium may include a computer-readable storage medium corresponding to a tangible medium, such as a data storage medium, or a communication medium including any medium that facilitates transfer of a computer program from one place to another, e.g., according to a communication protocol. In this manner, a computer-readable medium may generally correspond to (1) a non-transitory tangible computer-readable storage medium, or (2) a communication medium such as a signal or carrier wave. Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for use in implementing the techniques described in this disclosure. The computer program product may include a computer-readable medium.
By way of example, and not limitation, such computer-readable storage media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, flash memory, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer. Further, any connection is properly termed a computer-readable medium. For example, if the instructions are transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. However, it should be understood that computer-readable storage media and data storage media do not include connections, carrier waves, signals, or other transitory media, but are instead directed to non-transitory tangible storage media. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
The instructions may be executed by one or more processors, such as one or more Digital Signal Processors (DSPs), general purpose microprocessors, application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs), or other equivalent integrated or discrete logic circuitry. Thus, the terms "processor" and "processing circuitry" as used herein may refer to any of the foregoing structures or any other structure suitable for implementation of the techniques described herein. Furthermore, in some aspects, the functionality described herein may be provided in dedicated hardware and/or software modules configured for encoding and decoding, or incorporated in a combined codec. Moreover, the techniques may be implemented entirely in one or more circuits or logic elements.
The techniques of this disclosure may be implemented in a wide variety of devices or apparatuses including a wireless handset, an Integrated Circuit (IC), or a group of ICs (e.g., a chipset). Various components, modules, or units are described in this disclosure to emphasize functional aspects of devices configured to perform the disclosed techniques but do not necessarily require realization by different hardware units. Rather, as described above, the various units may be combined in a codec hardware unit or provided by a collection of interoperable hardware units including one or more processors as described above in combination with appropriate software and/or firmware.
Various examples have been described. These and other examples are within the scope of the following claims.

Claims (32)

1. A method of filtering decoded video data, the method comprising:
receiving, by a neural network filtering unit of a video decoding device, data of a decoded picture of video data;
receiving, by the neural network filtering unit, data from one or more other units of the video decoding device, the data from the one or more other units being different from the data of the decoded picture, and wherein receiving the data from the one or more other units of the video decoding device comprises receiving boundary strength data from a deblocking unit of the video decoding device;
determining, by the neural network filtering unit, one or more neural network models for filtering a portion of the decoded picture; and
the portion of the decoded picture is filtered by the neural network filtering unit using the one or more neural network models and data from one or more other units of the video decoding device including the boundary strength data.
2. The method of claim 1, wherein receiving the data from the one or more other units of the video decoding device further comprises receiving data from one or more of:
an intra prediction unit of the video decoding apparatus;
an inter prediction unit of the video decoding apparatus;
a transform processing unit of the video decoding apparatus;
a quantization unit of the video decoding apparatus;
a loop filter unit of the video decoding apparatus;
a preprocessing unit of the video decoding apparatus; or alternatively
A second neural network filtering unit of the video decoding apparatus.
3. The method of claim 2, wherein the loop filter unit comprises at least one of a sample adaptive offset, SAO, filter unit or an adaptive loop filter, ALF, unit.
4. The method of claim 1, wherein receiving the data further comprises receiving one or more of coding unit CU partition data, prediction unit PU partition data, transform unit TU partition data, deblocking filter data, quantization parameter QP data, intra prediction data, inter prediction data, data representing a distance between the decoded picture and one or more reference pictures, or motion information of one or more decoded blocks of the decoded picture.
5. The method of claim 4, wherein the deblocking filtered data includes one or more of whether deblocking is performed using a long filter or a short filter, or whether deblocking is performed using a strong filter or a weak filter.
6. The method of claim 4, wherein the intra-prediction data comprises an intra-prediction mode.
7. The method of claim 4, wherein the data representative of the distance comprises data representative of a difference between a picture order count, POC, value of the decoded picture and a POC value of a reference picture used to predict a block of the decoded picture.
8. The method of claim 1, wherein filtering the portion of the decoded picture using the one or more neural network models and the data from the one or more other units of the video decoding device comprises providing the data from the one or more other units of the video decoding device to a convolutional neural network CNN as one or more additional input planes.
9. The method of claim 1, wherein filtering the portion of the decoded picture using the one or more neural network models and the data from the one or more other units of the video decoding device comprises:
Combining a plurality of input planes into a combined input plane, including for each position (i, j) of the plurality of input planes, setting a value of the position (i, j) of the combined input plane equal to a maximum of values at the position (i, j) of the plurality of input planes; and
the combined input planes are provided to a convolutional neural network CNN.
10. The method of claim 1, wherein filtering the portion of the decoded picture using the one or more neural network models and the data from the one or more other units of the video decoding device comprises adjusting an output of the one or more neural network models using the data from the one or more other units of the video decoding device.
11. The method of claim 1, further comprising adjusting the data from the one or more other units of the video decoding apparatus prior to filtering the portion of the decoded picture.
12. The method of claim 11, wherein adjusting the data comprises converting a value of the data between an integer representation and a floating point representation.
13. The method of claim 11, wherein adjusting the data comprises scaling values of the data to be within a range of values suitable for the one or more neural network models.
14. The method of claim 1, wherein receiving the data comprises receiving partitioned data of the decoded picture, and wherein filtering the portion of the decoded picture using the one or more neural network models and the data from the one or more other units of the video decoding device comprises:
setting a value in an input plane at a position juxtaposed with a position of a boundary sample defining a partition boundary in the decoded picture as indicated by partition data to a first value;
setting a value at a position in the input plane juxtaposed with a position of an internal sample that is a non-boundary sample to a second value; and
the portion of the decoded picture is filtered using the input plane as an input to at least one of the one or more neural network models.
15. The method of claim 14, wherein the first value comprises 1 and the second value comprises 0.
16. The method of claim 14, wherein the partition data comprises codec unit CU partition data and the input plane comprises a first partition plane, the method further comprising:
receiving the PU partition data of a prediction unit;
forming a second input plane using the PU partition data;
receiving TU split data of a transformation unit; and
a third input plane is formed using the TU partition data,
wherein filtering the portion of the decoded picture using the one or more neural network models and the data from the one or more other units of the video decoding device includes filtering the portion of the decoded picture using the first input plane, the second input plane, and the third input plane as inputs to at least one of the one or more neural network models.
17. The method of claim 1, wherein filtering the portion of the decoded picture using the one or more neural network models and the data from the one or more other units of the video decoding device comprises:
Converting deblocking filter data of the decoded picture from the deblocking unit to one or more input planes of at least one of the one or more neural network models; and
the one or more input planes are used as input to at least one of the one or more neural network models to filter the portion of the decoded picture.
18. The method of claim 1, further comprising:
encoding the current picture; and
the current picture is decoded to form the decoded picture.
19. The method of claim 18, wherein determining the one or more neural network models comprises determining the one or more neural network models from a rate-distortion calculation.
20. The method of claim 1, wherein the boundary strength data indicates a boundary strength value of zero.
21. The method of claim 1, wherein the boundary strength data indicates that a boundary strength value is one or two.
22. An apparatus for filtering decoded video data, the apparatus comprising:
a memory configured to store decoded pictures of video data; and
One or more processors implemented in circuitry and configured to operate the neural network filtering unit to:
receiving data from one or more other units of the device, the data from the one or more other units of the device being different from the data of the decoded picture, and wherein to receive the data from the one or more other units of the device, the one or more processors are configured to run the neural network filtering unit to receive boundary strength data from a deblocking unit of the device;
determining one or more neural network models for filtering a portion of the decoded picture; and
the portion of the decoded picture is filtered using the one or more neural network models and the one or more other units from the device, data including the boundary strength data.
23. The device of claim 22, wherein to receive the data from the one or more other units of the device, the one or more processors are further configured to operate the neural network filtering unit to receive data from one or more of the following:
An intra prediction unit of the apparatus;
an inter prediction unit of the apparatus;
a transformation processing unit of the apparatus;
a quantization unit of the apparatus;
a loop filter unit of the apparatus;
a pre-processing unit of the apparatus; or alternatively
A second neural network filtering unit of the device.
24. The device of claim 22, wherein to receive the data from the one or more other units of the device, the one or more processors are further configured to run the neural network filtering unit to receive one or more of codec unit CU partition data, prediction unit PU partition data, transform unit TU partition data, deblocking filter data, quantization parameter QP data, intra prediction data, inter prediction data, data representing a distance between the decoded picture and one or more reference pictures, or motion information of one or more decoded blocks of the decoded picture.
25. The apparatus of claim 22, wherein to filter the portion of the decoded picture using the one or more neural network models and the data from the one or more other units of the apparatus, the one or more processors are configured to run the neural network filtering unit to provide the data from the one or more other units of the apparatus as one or more additional input planes to a convolutional neural network CNN.
26. The apparatus of claim 22, wherein to filter the portion of the decoded picture using the one or more neural network models and the data from the one or more other units of the apparatus, the one or more processors are configured to run the neural network filtering unit to adjust an output of the one or more neural network models using the data from the one or more other units of the apparatus.
27. The apparatus of claim 22, wherein to filter the portion of the decoded picture using the one or more neural network models and the data from the one or more other units of the apparatus, the one or more processors are configured to run the neural network filtering unit to adjust the data from the one or more other units of the apparatus prior to filtering the portion of the decoded picture.
28. The apparatus of claim 22, wherein to filter the portion of the decoded picture using the one or more neural network models and the data from the one or more other units of the apparatus, the one or more processors are configured to run the neural network filtering unit to:
Converting deblocking filter data of the decoded picture from the deblocking unit to one or more input planes of at least one of the one or more neural network models; and
the one or more input planes are used as input to at least one of the one or more neural network models to filter the portion of the decoded picture.
29. The device of claim 22, further comprising a display configured to display the decoded picture of the video data.
30. The device of claim 22, wherein the device comprises one or more of a camera, a computer, a mobile device, a broadcast receiver device, or a set-top box.
31. A computer readable storage medium having instructions stored thereon that, when executed, cause a processor of a video decoding device to operate a neural network filtering unit to:
receiving data of a decoded picture of video data;
receiving data from one or more other units of the video decoding device, the data from the one or more other units of the video decoding device being different from the data of the decoded picture, and wherein the instructions that cause the processor to receive the data from the one or more other units of the video decoding device comprise instructions that cause the processor to receive boundary strength data from a deblocking unit of the video decoding device;
Determining one or more neural network models for filtering a portion of the decoded picture; and
the portion of the decoded picture is filtered using the one or more neural network models and the data from the one or more other units of the video decoding device including the boundary strength data.
32. An apparatus for filtering decoded video data, the apparatus comprising a filtering unit comprising:
means for receiving data of a decoded picture of video data;
means for receiving data from one or more other units of the video decoding device, the data from the one or more other units being different from the data of the decoded picture, and wherein the means for receiving the data from the one or more other units of the video decoding device comprises means for receiving boundary strength data from a deblocking unit of the video decoding device;
means for determining one or more neural network models for filtering a portion of the decoded picture; and
Means for filtering the portion of the decoded picture using the one or more neural network models and the data from the one or more other units of the video decoding device including the boundary strength data.
CN202280008942.7A 2021-01-04 2022-01-03 Multiple neural network models for filtering during video codec Pending CN116746146A (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US63/133,733 2021-01-04
US17/566,282 US20220215593A1 (en) 2021-01-04 2021-12-30 Multiple neural network models for filtering during video coding
US17/566,282 2021-12-30
PCT/US2022/011021 WO2022147494A1 (en) 2021-01-04 2022-01-03 Multiple neural network models for filtering during video coding

Publications (1)

Publication Number Publication Date
CN116746146A true CN116746146A (en) 2023-09-12

Family

ID=87913745

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202280008942.7A Pending CN116746146A (en) 2021-01-04 2022-01-03 Multiple neural network models for filtering during video codec

Country Status (1)

Country Link
CN (1) CN116746146A (en)

Similar Documents

Publication Publication Date Title
US11206400B2 (en) Low-frequency non-separable transform (LFNST) simplifications
CN113940069A (en) Transform and last significant coefficient position signaling for low frequency non-separable transforms in video coding
CN114258675A (en) Cross-component adaptive loop filtering for video coding
JP2023542841A (en) Multiple neural network models for filtering during video coding
JP2023542840A (en) Filtering process for video coding
US11825101B2 (en) Joint-component neural network based filtering during video coding
CN114223202A (en) Low frequency inseparable transform (LFNST) signaling
US20220030232A1 (en) Multiple adaptive loop filter sets
CN114424570B (en) Transform unit design for video coding and decoding
CN113728629A (en) Motion vector derivation in video coding
CN113557734A (en) Coefficient domain block differential pulse code modulation in video coding
CN114080805A (en) Non-linear extension of adaptive loop filtering for video coding
CN113924776A (en) Video coding with unfiltered reference samples using different chroma formats
US11778213B2 (en) Activation function design in neural network-based filtering process for video coding
CN114846803A (en) Residual coding supporting both lossy and lossless coding
US20220215593A1 (en) Multiple neural network models for filtering during video coding
EP4272448A1 (en) Multiple neural network models for filtering during video coding
US20240015284A1 (en) Reduced complexity multi-mode neural network filtering of video data
US11729381B2 (en) Deblocking filter parameter signaling
CN116325738A (en) Adaptively deriving Rice parameter values for high bit depth video coding
US20240015312A1 (en) Neural network based filtering process for multiple color components in video coding
CN116746146A (en) Multiple neural network models for filtering during video codec
WO2024010790A1 (en) Reduced complexity multi-mode neural network filtering of video data
KR20230075443A (en) Operation of adaptive loop filtering for coding of video data at different bit depths Limitation of bit depth
CN116235495A (en) Fixed bit depth processing for cross-component linear model (CCLM) mode in video coding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination