HK40080098A - Methods, apparatus and storage medium for decoding video data - Google Patents

Methods, apparatus and storage medium for decoding video data Download PDF

Info

Publication number
HK40080098A
HK40080098A HK62023069256.3A HK62023069256A HK40080098A HK 40080098 A HK40080098 A HK 40080098A HK 62023069256 A HK62023069256 A HK 62023069256A HK 40080098 A HK40080098 A HK 40080098A
Authority
HK
Hong Kong
Prior art keywords
transform
type
block
partition
predetermined
Prior art date
Application number
HK62023069256.3A
Other languages
Chinese (zh)
Other versions
HK40080098B (en
Inventor
马杜·柏林加色·克里什南
赵欣
刘杉
Original Assignee
腾讯美国有限责任公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯美国有限责任公司 filed Critical 腾讯美国有限责任公司
Publication of HK40080098A publication Critical patent/HK40080098A/en
Publication of HK40080098B publication Critical patent/HK40080098B/en

Links

Description

Interaction between transform partitioning and primary/secondary transform type selection
Is incorporated by reference
This application is based on and claims priority from U.S. provisional application No. 63/175,897, filed on 16/4/2021 and U.S. non-provisional application No. 17/568,275, filed on 4/2022, both of which are hereby incorporated by reference in their entireties.
Technical Field
This disclosure describes a set of advanced video coding techniques. More particularly, the disclosed technology relates to the interaction between transform partitioning schemes and primary/secondary transform type selection in video encoding and decoding.
Background
The background description provided herein is for the purpose of generally presenting the disclosure. Work of the presently named inventors, to the extent it is described in this background section, as well as aspects of the description, is not admitted to be prior art by filing this application, nor is it expressly or implied that it is prior art to the present disclosure.
Video encoding and decoding may be performed using inter-picture prediction with motion compensation. Uncompressed digital video may comprise a series of pictures, each picture having a spatial size of luma samples and associated fully or sub-sampled chroma samples, e.g., 1920 x 1080. The series of pictures may have a fixed or variable picture rate (alternatively referred to as a frame rate), for example, of 60 pictures per second or 60 frames per second. Uncompressed video has certain bit rate requirements for streaming or data processing. For example, with a pixel resolution of 1920 × 1080, a frame rate of 60 frames/second, a chroma sub-sampling of 4:2: 0. video at 8 bits per color channel per pixel requires a bandwidth of approximately 1.5 Gbit/s. An hour of such video requires more than 600GB of storage space.
One purpose of video encoding and decoding may be to reduce redundancy in an uncompressed input video signal by compression. Compression may help reduce the bandwidth and/or storage requirements described above, in some cases by two orders of magnitude or more. Lossless compression and lossy compression, and combinations thereof, may be employed. Lossless compression refers to a technique by which an exact copy of an original signal can be reconstructed from a compressed original signal by a decoding process. Lossy compression refers to the following encoding/decoding process: the original video information is not completely retained during encoding and cannot be completely recovered during decoding. When lossy compression is used, the reconstructed signal may be different from the original signal, but the distortion between the original signal and the reconstructed signal is small enough that the reconstructed signal is usable for the intended application, although some information is lost. In the case of video, lossy compression is widely employed in many applications. The amount of distortion that can be tolerated depends on the application. For example, some users consuming video streaming applications may tolerate higher distortion than users of movie or television broadcast applications. The compression rate achievable by a particular encoding algorithm may be selected or adjusted to reflect various distortion tolerances: higher tolerable distortion generally allows the encoding algorithm to produce higher losses and higher compression rates.
Video encoders and decoders may utilize techniques and steps from a number of broad categories including, for example, motion compensation, fourier transforms, quantization, and entropy coding.
Video codec techniques may include a technique referred to as intra-coding. In intra coding, sample values are represented without reference to samples or other data from previously reconstructed reference pictures. In some video codecs, a picture is spatially subdivided into blocks of samples. When all sample blocks are encoded in intra mode, the picture may be referred to as an intra picture. The intra picture and its derivatives (e.g., stand-alone decoder refresh pictures) may be used to reset the decoder state and thus may be used as the first picture in an encoded video codestream and video session, or as a still image. Then, the samples of the block after intra prediction may be transformed into the frequency domain, and the thus generated transform coefficients may be quantized before entropy encoding. Intra prediction represents a technique to minimize sample values in the pre-transform domain. In some cases, the smaller the DC value after transform, and the smaller the AC coefficient, the fewer bits are needed to represent the block after entropy encoding at a given quantization step.
Conventional intra-frame coding, for example, known from techniques such as MPEG-2 generation coding, does not use intra-prediction. However, some newer video compression techniques include techniques that attempt to encode/decode blocks based on, for example, surrounding sample data and/or metadata that is obtained during spatially adjacent encoding and/or decoding and that precedes a data block being intra-encoded or decoded in decoding order. Such techniques are hereinafter referred to as "intra prediction" techniques. It should be noted that, at least in some cases, intra prediction uses only reference data from the current picture being reconstructed, and does not use reference data from the reference picture.
Intra prediction can take many different forms. When more than one such technique may be used in a given video coding technique, the technique in use may be referred to as intra-prediction mode. One or more intra prediction modes may be provided in a particular codec. In some cases, a mode may have sub-modes and/or may be associated with various parameters, and mode/sub-mode information and intra-coding parameters of a video block may be separately encoded or jointly included in a mode codeword. Which codeword is used in a given mode, sub-mode, and/or parameter combination may affect the coding efficiency gain through intra prediction, and thus may affect the entropy coding technique used to convert the codeword into a code stream.
H.264 introduces an intra prediction mode that is perfected in h.265 and further perfected in newer coding techniques such as Joint Exploration Model (JEM), next generation video coding (VVC), and reference set (BMS). In general, for intra prediction, a prediction value block (predictor block) may be formed using values belonging to neighboring samples that have become available. For example, the available values for a particular set of neighboring samples may be copied into the block of prediction values along a certain direction and/or row. The reference to the usage direction may be encoded in the codestream or may itself be predicted.
Referring to fig. 1A, a subset of the 9 prediction directions specified in the 33 possible intra prediction directions of h.265 (corresponding to 33 angular modes of the 35 intra modes specified in h.265) is depicted in the bottom right. The point (101) where the arrows converge represents the sample being predicted. The arrows indicate the direction along which the samples at 101 are predicted using neighboring samples. For example, arrow (102) indicates that the sample (101) is predicted from one or more neighboring samples at an angle of 45 degrees to the horizontal at the upper right. Similarly, arrow (103) indicates that the sample (101) is predicted from one or more neighboring samples at 22.5 degrees to the horizontal at the lower left of the sample (101).
Still referring to fig. 1A, a square block (104) of 4 × 4 samples is depicted at the upper left (indicated by the bold dashed line). The square block (104) includes 16 samples, each labeled with "S", its position in the Y dimension (e.g., row index), and its position in the X dimension (e.g., column index). For example, sample S21 is the second sample in the Y dimension (from the top) and the first sample in the X dimension (from the left). Similarly, sample S44 is the fourth sample in block (104) in the Y and X dimensions. Since the block size is 4 × 4 samples, S44 is located in the lower right corner. Exemplary reference samples following a similar numbering scheme are also shown. The reference sample is labeled with R, its Y position (e.g., row index) and X position (column index) relative to block (104). In h.264 and h.265, prediction samples adjacent to the block being reconstructed are used.
Intra picture prediction for block 104 may begin by copying reference sample values from neighboring samples according to the signaled prediction direction. For example, assume that the encoded video stream comprises signaling indicating for this block 104 the prediction direction of the arrow (102), i.e. the samples are predicted from one or more predicted samples at the upper right, 45 degrees to the horizontal direction. In this case, samples S41, S32, S23 and S14 are predicted from the same reference sample R05. The sample S44 is then predicted from the reference sample R08.
In some cases, the values of multiple reference samples may be combined, for example by interpolation, to compute a reference sample; especially when the direction is not evenly divisible by 45 degrees.
As video coding techniques continue to evolve, the number of possible directions increases. In h.264 (2003), for example, nine different directions may be used for intra prediction. In h.265 (2013), there are an increase to 33 directions, and, as of the present disclosure, JEM/VVC/BMS can support up to 65 directions. Experimental research has been conducted to help identify the most suitable intra-prediction directions, and some techniques in entropy coding can be used to code those most suitable directions with small bits, for which a certain bit cost is accepted. Furthermore, the direction itself may sometimes be predicted from the neighboring directions used in intra prediction of the already decoded neighboring blocks.
Fig. 1B shows a schematic diagram (180) depicting 65 intra prediction directions according to JEM to illustrate the increase in the number of prediction directions in various coding techniques developed over time.
In the coded video code stream, the manner of mapping the bits representing the intra-frame prediction direction to the prediction direction may be different according to different video coding techniques; for example, its range may be simply mapped directly from the prediction direction to the intra prediction mode, changed to a prediction direction mapped to a codeword, mapped to a complex adaptive scheme involving the most probable mode, and the like. In all cases, however, there may be certain intra-prediction directions that are statistically less likely to appear in the video content than certain other directions. Since the goal of video compression is to reduce redundancy, in a well-designed video coding technique, those less likely directions can be represented by a larger number of bits than more likely directions.
Inter picture prediction or inter prediction may be based on motion compensation. In motion compensation, sample data from a previously reconstructed picture or part thereof (reference picture), after being spatially offset along a direction indicated by a motion vector (hereafter MV), may be used to predict a newly reconstructed picture or picture part (e.g., block). In some cases, the reference picture may be the same as the picture currently being reconstructed. The MV may have two dimensions of X and Y, or three dimensions, the third being an indication of the reference picture being used (similar to the temporal dimension).
In some video compression techniques, a current MV applicable to a region of sample data may be predicted from other MVs, e.g., from other MVs that are related to another region of sample data that is spatially adjacent to the region being reconstructed and that precede the current MV in decoding order. This can greatly reduce the total amount of data required to encode the MV by eliminating redundancy in the relevant MVs, thereby increasing compression efficiency. MV prediction works effectively, for example, because when encoding an input video signal (referred to as natural video) obtained from a camera, there are statistical possibilities: a larger area than one for which a single MV applies moves in a similar direction in a video sequence and therefore, in some cases, the larger area may be predicted using similar motion vectors derived from MVs of neighboring areas. This makes the actual MV for a given region similar or identical to the MV predicted from the surrounding MVs. Further after entropy coding, the MV can be represented using a smaller number of bits than the number of bits used when directly coding the MV (rather than predicting the MV from neighboring MVs). In some cases, MV prediction may be an example of lossless compression of a signal (i.e., MV) derived from an original signal (i.e., sample stream). In other cases, MV prediction itself may be lossy, for example due to rounding errors that occur when calculating the predicted values from multiple surrounding MVs.
Various MV prediction mechanisms are described in H.265/HEVC (ITU-T H.265 recommendation, "High Efficiency Video Coding", 2016 month 12). Among the various MV prediction mechanisms specified by h.265, described below is a technique referred to hereinafter as "spatial merge".
In particular, referring to fig. 2, the current block (201) comprises samples that have been found by the encoder during the motion search process, which can be predicted from previous blocks of the same size that have generated a spatial offset. Instead of directly encoding the MV, the MV may be derived from metadata associated with one or more reference pictures, e.g. from the nearest reference picture (in decoding order) using the MV associated with any of the five surrounding samples labeled A0, A1 and B0, B1, B2 (202 to 206, respectively). In h.265, MV prediction may use prediction values from the same reference picture that neighboring blocks are using.
Disclosure of Invention
The present disclosure describes various embodiments of methods, apparatuses, and computer-readable storage media for video encoding and/or decoding.
According to an aspect, one embodiment of the present disclosure provides a method for encoding/decoding video data in a decoder. The method comprises the following steps: receiving an encoded video stream of a data block; extracting a transform partition type associated with a data block from an encoded video bitstream; and in response to the transform partition type belonging to a subset of a predetermined set of transform partition types, each transform partition type in the predetermined set of transform partition types specifying a splitting mode for splitting the block of data into transform blocks: extracting a transform type of a transform associated with a transform block split from a data block, the transform type signaled in an encoded video bitstream, wherein the transform type belongs to a first set of predetermined transform types; and performing an inverse transform on the transform block according to the transform type.
According to another aspect, an embodiment of the present disclosure provides a method for encoding/decoding video data. The method comprises the following steps: receiving an encoded video stream of a data block; extracting a transform partition type associated with a data block of video data from an encoded video bitstream; extracting a transform type associated with a transform block of a data block from the encoded video bitstream in response to the transform partition type belonging to a subset of a predetermined set of transform partition types; and identifying a transform type of the data block in a default manner in response to the transform partition type not belonging to the set of predetermined transform partition types.
According to another aspect, an embodiment of the present disclosure provides a method for encoding/decoding video data. The method comprises the following steps: receiving an encoded video stream of a data block; extracting a transformed transform type of a transform associated with a transform block of a data block from an encoded video bitstream; and in response to the transform type belonging to a predetermined set of transform types: transform partition types associated with the data blocks are extracted from the encoded video bitstream.
According to another aspect, an embodiment of the present disclosure provides an apparatus for video encoding and/or decoding. The device includes: a memory storing instructions; and a processor in communication with the memory. When the processor executes the instructions, the processor is configured to cause the apparatus to perform the above-described method for video decoding and/or encoding.
According to yet another aspect, an embodiment of the present disclosure provides a non-transitory computer-readable medium storing instructions that, when executed by a computer for video decoding and/or encoding, cause the computer to perform the above-described method for video decoding and/or encoding.
The above and other aspects and implementations thereof are described in more detail in the accompanying drawings, description and claims.
Drawings
Further features, properties and various advantages of the disclosed subject matter will become more apparent from the following detailed description and the accompanying drawings, in which:
fig. 1A shows a schematic illustration of an exemplary subset of intra-prediction directional modes.
Fig. 1B shows a diagram of exemplary intra prediction directions.
Fig. 2 shows a schematic illustration of a current block and its surrounding spatial merge candidates for motion vector prediction, in one example.
Fig. 3 shows a schematic illustration of a simplified block diagram of a communication system (300) according to an example embodiment.
Fig. 4 shows a schematic illustration of a simplified block diagram of a communication system (400) according to an example embodiment.
Fig. 5 shows a schematic illustration of a simplified block diagram of a video decoder according to an exemplary embodiment.
Fig. 6 shows a schematic illustration of a simplified block diagram of a video encoder according to an example embodiment.
Fig. 7 shows a block diagram of a video encoder according to another exemplary embodiment.
Fig. 8 shows a block diagram of a video decoder according to another exemplary embodiment.
Fig. 9 illustrates a directional (directional) intra prediction mode according to an exemplary embodiment of the present disclosure.
Fig. 10 illustrates a non-directional intra prediction mode according to an exemplary embodiment of the present disclosure.
Fig. 11 illustrates a recursive intra prediction mode according to an exemplary embodiment of the present disclosure.
Fig. 12 illustrates transform block partitioning and scanning of an intra-predicted block according to an exemplary embodiment of the present disclosure.
Fig. 13 illustrates transform block division and scanning of an inter-predicted block according to an exemplary embodiment of the present disclosure.
Fig. 14 illustrates a low frequency non-separable transformation process according to an exemplary embodiment of the present disclosure.
Fig. 15 illustrates an intra prediction scheme based on respective reference lines (reference lines) according to an exemplary embodiment of the present disclosure.
Fig. 16 illustrates a non-recursive block partitioning scheme according to an exemplary embodiment of the present disclosure.
Fig. 17 shows a flowchart according to an exemplary embodiment of the present disclosure.
Fig. 18 shows a schematic illustration of a computer system according to an exemplary embodiment of the present disclosure.
Detailed Description
Fig. 3 shows a simplified block diagram of a communication system (300) according to one embodiment of the present disclosure. The communication system (300) comprises a plurality of terminal devices which can communicate with each other via, for example, a network (350). For example, a communication system (300) includes a first pair of end devices (310) and (320) interconnected by a network (350). In the example of fig. 3, the first terminal device pair (310) and (320) may perform unidirectional data transmission. For example, the end device (310) may encode video data (e.g., video data of a video picture stream captured by the end device (310)) for transmission over the network (350) to another end device (320). The encoded video data may be transmitted in the form of one or more encoded video streams. The terminal device (320) may receive encoded video data from the network (350), decode the encoded video data to recover video pictures, and display the video pictures according to the recovered video data. Unidirectional data transmission may be implemented in applications such as media services.
In another example, the communication system (300) includes a second pair of terminal devices (330) and (340) that perform a bi-directional transmission of encoded video data, which may be implemented, for example, during a video conferencing application. For bi-directional data transmission, in one example, each of the end devices (330) and (340) may encode video data (e.g., video data of a video picture stream captured by the end device) for transmission over the network (350) to the other of the end devices (330) and (340). Each of the terminal devices (330) and (340) may also receive encoded video data transmitted by the other of the terminal devices (330) and (340), and may decode the encoded video data to recover the video picture, and may display the video picture on an accessible display device according to the recovered video data.
In the example of fig. 3, terminal device (310), terminal device (320), terminal device (330), and terminal device (340) may be implemented as a server, a personal computer, and a smartphone, although the applicability of the underlying principles of the disclosure may not be limited thereto. Embodiments of the present disclosure may be implemented in desktop computers, laptop computers, tablets, media players, wearable computers, dedicated video conferencing devices, and the like. Network (350) represents any number of networks that convey encoded video data between terminal device (310), terminal device (320), terminal device (330), and terminal device (340), including, for example, wired (wired) and/or wireless communication networks. The communication network (350) 9 may exchange data in a circuit-switched channel, a packet-switched channel, and/or other types of channels. Representative networks include telecommunications networks, local area networks, wide area networks, and/or the internet. For purposes of this discussion, the architecture and topology of the network (350) may be immaterial to the operation of the present disclosure, unless explicitly stated herein.
As an example of an application for the disclosed subject matter, fig. 4 shows the placement of a video encoder and a video decoder in a video streaming environment. The disclosed subject matter is equally applicable to other video applications including, for example, video conferencing, digital TV, broadcasting, gaming, virtual reality, storing compressed video on digital media including CDs, DVDs, memory sticks, and the like.
The video streaming system may comprise a video capture subsystem (413), the capture subsystem (413) may comprise a video source (401), such as a digital camera, the video source (401) being used to create an uncompressed video picture or image stream (402). In one example, the video picture stream (402) includes samples recorded by a digital camera of the video source 401. The video picture stream (402), depicted as a bold line to emphasize high data volume, may be processed by an electronic device (420) compared to the encoded video data (404) (or encoded video bitstream), the electronic device (420) comprising a video encoder (403) coupled to a video source (401). The video encoder (403) may comprise hardware, software, or a combination of hardware and software to implement or embody aspects of the disclosed subject matter as described in more detail below. The encoded video data (404) (or encoded video codestream (404)) depicted as thin lines to emphasize lower data amounts may be stored on a streaming server (405) for future use or directly stored to a downstream video device (not shown) as compared to the uncompressed video picture stream (402). One or more streaming client subsystems, such as client subsystem (406) and client subsystem (408) in fig. 4, may access streaming server (405) to retrieve copies (407) and copies (409) of encoded video data (404). The client subsystem (406) may include, for example, a video decoder (410) in an electronic device (430). The video decoder (410) decodes incoming copies (407) of the encoded video data and generates an uncompressed output video picture stream (411) that can be presented on a display (412) (e.g., a display screen) or other presentation device (not depicted). The video decoder 410 may be configured to perform some or all of the various functions described in this disclosure. In some streaming systems, the encoded video data (404), (407), and (409) (e.g., video streams) may be encoded according to certain video encoding/compression standards. Examples of such standards include the ITU-T H.265 recommendation. In one example, the video coding standard under development is informally referred to as next generation video coding (VVC). The disclosed subject matter can be used in the context of VVC and other video coding standards.
It should be noted that electronic device (420) and electronic device (430) may include other components (not shown). For example, the electronic device (420) may include a video decoder (not shown), and the electronic device (430) may also include a video encoder (not shown).
In the following, fig. 5 shows a block diagram of a video decoder (510) according to any embodiment of the present disclosure. The video decoder (510) may be included in an electronic device (530). The electronic device (530) may include a receiver (531) (e.g., a receive circuit). The video decoder (510) may be used in place of the video decoder (410) in the example of fig. 4.
The receiver (531) may receive one or more encoded video sequences to be decoded by the video decoder (510). In the same or another embodiment, encoded video sequences may be decoded one at a time, with each encoded video sequence being decoded independently of the other encoded video sequences. Each video sequence may be associated with a plurality of video frames or images. The encoded video sequence may be received from a channel (501), which channel (501) may be a hardware/software link to a storage device that stores encoded video data or a streaming source that transmits encoded video data. The receivers (531) may receive encoded video data and other data, such as encoded audio data and/or auxiliary data streams, which may be forwarded to their respective processing circuits (not depicted). The receiver (531) may separate the encoded video sequence from other data. To prevent network jitter, a buffer memory (515) may be provided between the receiver (531) and the entropy decoder/parser (520) (hereinafter "parser (520)"). In some applications, the buffer memory (515) may be implemented as part of the video decoder (510). In other applications, the buffer memory (515) may be located external to the video decoder (510) and separate from the video decoder (510) (not depicted). In yet other applications, a buffer memory (not depicted) may be provided external to the video decoder (510), for the purpose of, for example, preventing network jitter, and another additional buffer memory (515) may be provided internal to the video decoder (510), for example, to handle playback timing. When the receiver (531) receives data from a store/forward device with sufficient bandwidth and controllability or from an isochronous network, the buffer memory (515) may not be needed or may be made smaller. For use over a traffic packet network such as the internet, a buffer memory (515) of sufficient size may be required, the size of the buffer memory (515) may be relatively large. Such a buffer memory may be implemented to have an adaptive size, and may be implemented at least partially in an operating system or similar element (not depicted) external to video decoder (510).
The video decoder (510) may comprise a parser (520) to reconstruct symbols (521) from the encoded video sequence. The categories of these symbols include information for managing the operation of the video decoder (510), as well as potential information for controlling a rendering device, such as a display (512) (e.g., a display screen), which may or may not be an integral part of the electronic device (530), but which may be coupled to the electronic device (530), as shown in fig. 5. The control information for the rendering device may be in the form of auxiliary enhancement information (SEI message) or Video Usability Information (VUI) parameter set fragments (not depicted). The parser (520) may parse/entropy decode the encoded video sequence received by the parser (520). Entropy encoding of an encoded video sequence may be performed in accordance with video coding techniques or standards, and may follow various principles, including variable length coding, huffman coding, arithmetic coding with or without context sensitivity, and so forth. A parser (520) may extract a subgroup parameter set for at least one of the subgroups of pixels in the video decoder from the encoded video sequence based on at least one parameter corresponding to the subgroup. A sub-group may include a group of pictures (GOP), a picture, a tile, a slice, a macroblock, a Coding Unit (CU), a block, a Transform Unit (TU), a Prediction Unit (PU), and so on. The parser (520) may also extract information from the encoded video sequence, such as transform coefficients (e.g., fourier transform coefficients), quantizer parameter values, motion vectors, and so forth.
The parser (520) may perform entropy decoding/parsing operations on the video sequence received from the buffer memory (515) to create symbols (521).
The reconstruction of the symbols (521) may involve a number of different processes or functional units depending on the type of the encoded video picture or portion of the encoded video picture (e.g., inter and intra pictures, inter and intra blocks), among other factors. The units involved, and the manner in which the units are involved, may be controlled by a parser (520) via sub-group control information parsed from the coded video sequence. For simplicity, such sub-group control information flow between parser (520) and various processing or functional units below is not depicted.
In addition to the already mentioned functional blocks, the video decoder (510) may be conceptually subdivided into a plurality of functional units as described below. In a practical implementation operating under business constraints, many of these functional units interact closely with each other and may be at least partially integrated with each other. However, for the purpose of clearly describing various functions of the disclosed subject matter, a number of functions conceptually subdivided are employed in the context of the present disclosure.
The first unit may include a scaler/inverse transform unit (551). The scaler/inverse transform unit (551) may receive the quantized transform coefficients as symbols (521) from the parser (520) along with control information including information indicating which type of inverse transform to use, block size, quantization factor/parameter, quantization scaling matrix, etc. The scaler/inverse transform unit (551) may output a block comprising sample values, which may be input into an aggregator (555).
In some cases, the output samples of the sealer/inverse transform (551) may belong to an intra-coded block; that is, prediction information from previously reconstructed pictures is not used, but blocks of prediction information from previously reconstructed portions of the current picture may be used. Such prediction information may be provided by an intra picture prediction unit (552). In some cases, the intra picture prediction unit (552) may use surrounding block information that has been reconstructed and stored in the current picture buffer (558) to generate a block that is the same size and shape as the block being reconstructed. For example, the current picture buffer (558) buffers a partially reconstructed current picture and/or a fully reconstructed current picture. In some implementations, the aggregator (555) may add the prediction information generated by the intra prediction unit (552) to the output sample information provided by the scaler/inverse transform unit (551) on a per sample basis.
In other cases, the output samples of sealer/inverse transform unit (551) may belong to inter-coded and potentially motion-compensated blocks. In this case, the motion compensated prediction unit (553) may access a reference picture memory (557) to fetch samples for inter-picture prediction. After motion compensating the extracted samples according to the symbols (521) belonging to the block, these samples may be added by an aggregator (555) to the output of the scaler/inverse transform unit (551) (the output of unit 551 may be referred to as residual samples or residual signals), thereby generating output sample information. The extraction of prediction samples by the motion compensated prediction unit (553) from addresses within the reference picture memory (557) may be controlled by motion vectors, which may be provided to the motion compensated prediction unit (553) in the form of symbols (521), which symbols (521) may have, for example, an X-component, a Y-component (offset), and a reference picture component (temporal). Motion compensation may also include interpolation of sample values fetched from a reference picture memory (557) when using sub-sample precision motion vectors, motion compensation may also be associated with motion vector prediction mechanisms, etc.
The output samples of the aggregator (555) may be subjected to various loop filtering techniques in loop filter unit (556). The video compression techniques may include in-loop filter techniques that are controlled by parameters included in the encoded video sequence (also referred to as the encoded video stream) and available to the loop filter unit (556) as symbols (521) from the parser (520), however, the video compression techniques may also be responsive to meta-information obtained during decoding of previous (in decoding order) portions of the encoded picture or encoded video sequence, as well as to sample values previously reconstructed and loop filtered. Various types of loop filters may be included in various orders as part of loop filter unit 556, as will be described in further detail below.
The output of the loop filter unit (556) may be a sample stream that may be output to a rendering device (512) and stored in a reference picture memory (557) for future inter picture prediction.
Once fully reconstructed, some of the coded pictures may be used as reference pictures for future inter-picture prediction. For example, once the encoded picture corresponding to the current picture is fully reconstructed and the encoded picture is identified (by, e.g., parser (520)) as a reference picture, current picture buffer (558) may become part of reference picture memory (557) and a new current picture buffer may be reallocated before starting reconstruction of a subsequent encoded picture.
The video decoder (510) may perform decoding operations according to predetermined video compression techniques employed in standards such as the ITU-T h.265 recommendation. The encoded video sequence may conform to the syntax specified by the video compression technique or standard used, in the sense that the encoded video sequence conforms to the syntax of the video compression technique or standard and the configuration files recorded in the video compression technique or standard. In particular, the configuration file may select certain tools from all tools available in the video compression technology or standard as the only tools available under the configuration file. To comply with the standard, the complexity of the encoded video sequence may be within a range defined by a level of video compression technique or standard. In some cases, the hierarchy limits the maximum picture size, the maximum frame rate, the maximum reconstruction sampling rate (measured in units of, e.g., mega samples per second), the maximum reference picture size, etc. In some cases, the limits set by the hierarchy may be further defined by Hypothetical Reference Decoder (HRD) specifications and HRD buffer-managed metadata signaled in the encoded video sequence.
In some example embodiments, the receiver (531) may receive additional (redundant) data when receiving encoded video. The additional data may be included as part of the encoded video sequence. The additional data may be used by the video decoder (510) to properly decode the data and/or to more accurately reconstruct the original video data. The additional data may take the form of, for example, temporal, spatial, or signal-to-noise ratio (SNR) enhancement layers, redundant slices, redundant pictures, forward error correction codes, and the like.
Fig. 6 shows a block diagram of a video encoder (603) according to an example embodiment of the present disclosure. The video encoder (603) may be included in an electronic device (620). The electronic device (620) may further include a transmitter (640) (e.g., a transmission circuit). The video encoder (603) may be used in place of the video encoder (403) in the example of fig. 4.
The video encoder (603) may receive video samples from a video source (601) (not part of the electronic device (620) in the example of fig. 6), and the video source (601) may capture video images to be encoded by the video encoder (603). In another example, the video source (601) may be implemented as part of the electronic device (620).
The video source (601) may provide a source video sequence in the form of a digital video sample stream to be encoded by the video encoder (603), which may have any suitable bit depth (e.g., 8-bit, 10-bit, 12-bit.. Said.), any color space (e.g., bt.601ycrcb, RGB, xyz.. Said.) and any suitable sampling structure (e.g., YCrCb 4. In a media service system, a video source (601) may be a storage device capable of storing previously prepared video. In a video conferencing system, the video source (601) may be a camera that captures local image information as a video sequence. Video data may be provided as a plurality of individual pictures or images that are given motion when viewed in sequence. The picture itself may be constructed as an array of spatial pixels, where each pixel may comprise one or more samples, depending on the sampling structure, color space, etc. used. The relationship between the pixel and the sample can be easily understood by one of ordinary skill in the art. The following text focuses on describing the samples.
According to some example embodiments, the video encoder (603) may encode and compress pictures of a source video sequence into an encoded video sequence (643), in real time or under any other temporal constraint required by the application. The implementation of the appropriate encoding speed constitutes a function of the controller (650). In some embodiments, the controller (650) may be functionally coupled to and control other functional units as described below. For simplicity, the coupling is not depicted in the figures. The parameters set by the controller (650) may include rate control related parameters (picture skip, quantizer, lambda value of rate distortion optimization technique), picture size, group of pictures (GOP) layout, maximum motion vector search range, etc. The controller (650) may be configured with other suitable functions relating to the video encoder (603) optimized for a certain system design.
In some example embodiments, the video encoder (603) may be configured to operate in an encoding loop. As an oversimplified description, in one example, the encoding loop may include a source encoder (630) (e.g., responsible for creating symbols, e.g., a stream of symbols, based on input pictures and reference pictures to be encoded) and a (local) decoder (633) embedded in the video encoder (603). The decoder (633) reconstructs the symbols to create sample data in a manner similar to that which a (remote) decoder can create, even though the embedded decoder 633 processes the video stream encoded by the source encoder 630 without entropy encoding (as any compression between the symbols in entropy encoding and the encoded video stream can be lossless in the video compression techniques contemplated in the disclosed subject matter). The reconstructed sample stream (sample data) is input to a reference picture memory (634). Since the decoding of the symbol stream produces bit accurate results independent of decoder location (local or remote), the contents in the reference picture store (634) also correspond bit-wise accurately between the local encoder and the remote encoder. In other words, the reference picture samples that the prediction portion of the encoder "sees" are identical to the sample values that the decoder would "see" when using prediction during decoding. This reference picture synchronization philosophy (and the drift that occurs if synchronization cannot be maintained, e.g., due to channel errors) is used to improve coding quality.
The operation of the "local" decoder (633) may be the same as a "remote" decoder, such as the video decoder (510) that has been described in detail above in connection with fig. 5. However, referring briefly to fig. 5 additionally, since symbols are available and the entropy encoder (645) and parser (520) are able to losslessly encode/decode the symbols into an encoded video sequence, the entropy decoding portion of the video decoder (510) including the buffer memory (515) and parser (520) may not be fully implemented in the local decoder (633) in the encoder.
It can be observed at this point that any decoder technique other than just parsing/entropy decoding, which may be present in a decoder, may also necessarily need to be present in the corresponding encoder in substantially the same functional form. For this reason, the disclosed subject matter may sometimes focus on decoder operations, which are integrated with the decoding portion of the encoder. Thus, the description of the encoder techniques may be simplified, as the encoder techniques are reciprocal to the fully described decoder techniques. In the following, a more detailed description of the encoder is provided only in certain areas or aspects.
During operation, in some example implementations, the source encoder (630) may perform motion compensated predictive coding that predictively codes an input picture by referencing one or more previously coded pictures from the video sequence that are designated as "reference pictures". In this way, the encoding engine (632) encodes the difference (or residual) in color channels between the pixel block of the input picture and the pixel block of the reference picture, which may be selected as a prediction reference for the input picture. The term "residual" and its adjective form "residual" are used interchangeably.
The local video decoder (633) can decode encoded video data that can be designated as a picture of a reference picture based on the symbols created by the source encoder (630). The operation of the encoding engine (632) may advantageously be a lossy process. When the encoded video data can be decoded in a video decoder (not shown in fig. 6), the reconstructed video sequence may typically be a copy of the source video sequence with some errors. The local video decoder (633) replicates a decoding process that may be performed on reference pictures by the video decoder, and may cause reconstructed reference pictures to be stored in a reference picture cache (634). In this way, the video encoder (603) may locally store a copy of the reconstructed reference picture that has common content (no transmission errors) with the reconstructed reference picture to be obtained by the remote (remote) video decoder.
Predictor (635) may perform a prediction search for coding engine (632). That is, for a new picture to be encoded, the predictor (635) may search the reference picture memory (634) for sample data (as candidate reference pixel blocks) or some metadata, such as reference picture motion vectors, block shape, etc., that may be used as a suitable prediction reference for the new picture. The predictor (635) may operate on a block-by-block basis of samples to find a suitable prediction reference. In some cases, the input picture may have prediction references taken from multiple reference pictures stored in a reference picture memory (634), as determined by search results obtained by the predictor (635).
The controller (650) may manage the encoding operations of the source encoder (630), including, for example, setting parameters and subgroup parameters for encoding the video data.
The outputs of all of the above functional units may be entropy encoded in an entropy encoder (645). The entropy encoder (645) losslessly compresses the symbols generated by the various functional units according to techniques such as huffman coding, variable length coding, arithmetic coding, etc., to convert the symbols into an encoded video sequence.
The transmitter (640) may buffer the encoded video sequence created by the entropy encoder (645) in preparation for transmission over the communication channel (660), which may be a hardware/software link to a storage device that may store the encoded video data. The transmitter (640) may combine the encoded video data from the video encoder (603) with other data to be transmitted, such as encoded audio data and/or an auxiliary data stream (source not shown).
The controller (650) may manage the operation of the video encoder (603). During encoding, the controller (650) may assign a certain encoded picture type to each encoded picture, but this may affect the encoding techniques applicable to the respective picture. For example, pictures may be generally assigned to any of the following picture types:
intra pictures (I pictures), which may be pictures that can be encoded and decoded without using any other picture in the sequence as a prediction source. Some video codecs tolerate different types of intra pictures, including, for example, independent decoder refresh ("IDR") pictures. Those of ordinary skill in the art are aware of these variations of picture I and their corresponding applications and features.
A predictive picture (P picture), which may be a picture that can be encoded and decoded using intra prediction or inter prediction that uses at most one motion vector and a reference index to predict sample values of each block.
Bi-predictive pictures (B-pictures), which may be pictures that can be encoded and decoded using intra prediction or inter prediction that uses at most two motion vectors and reference indices to predict the sample values of each block. Similarly, multiple predictive pictures may use more than two reference pictures and associated metadata for reconstructing a single block.
A source picture may typically be spatially subdivided into a plurality of encoded blocks of samples (e.g., blocks of 4 × 4, 8 × 8, 4 × 8, or 16 × 16 samples) and encoded block by block. These blocks may be predictively encoded with reference to other (encoded) blocks determined by the encoding allocation applied to the respective pictures of the block. For example, a block of an I picture may be non-predictively encoded, or a block of an I picture may be predictively encoded (spatial prediction or intra prediction) with reference to an encoded block of the same picture. The pixel block of the P picture may be predictively encoded by spatial prediction or by temporal prediction with reference to one previously encoded reference picture. A block of a B picture may be predictively encoded by spatial prediction or by temporal prediction with reference to one or two previously encoded reference pictures. The source picture or the intermediately processed picture may be subdivided into other types of blocks for other purposes. The division of the coded blocks and other types of blocks may or may not follow the same manner, as described in further detail below.
The video encoder (603) may perform encoding operations according to a predetermined video encoding technique or standard, such as the ITU-T h.265 recommendation. In operation, the video encoder (603) may perform various compression operations, including predictive encoding operations that exploit temporal and spatial redundancies in the input video sequence. Thus, the encoded video data may conform to syntax specified by the video coding technique or standard used.
In some example embodiments, the transmitter (640) may transmit the additional data while transmitting the encoded video. The source encoder (630) may include such data as part of an encoded video sequence. The additional data may include temporal/spatial/SNR enhancement layers, other forms of redundant data such as redundant pictures and slices, SEI messages, VUI parameter set fragments, and the like.
The captured video may be taken as a plurality of source pictures (video pictures) in a time sequence. Intra picture prediction, which is typically reduced to intra prediction, exploits spatial correlation in a given picture, while inter picture prediction exploits temporal or other correlation between pictures. For example, a particular picture being encoded/decoded, referred to as a current picture, may be divided into blocks. When a block in a current picture is similar to a reference block in a reference picture that has been previously encoded in video and is still buffered, the block in the current picture may be encoded by a vector called a motion vector. The motion vector points to a reference block in a reference picture, and in the case of using multiple reference pictures, the motion vector may have a third dimension that identifies the reference pictures.
In some example embodiments, bi-directional prediction techniques may be used for inter-picture prediction. According to such bi-prediction techniques, two reference pictures are used, e.g., a first reference picture and a second reference picture that precede the current picture in video in decoding order (but may be past or future, respectively, in display order). A block in a current picture may be encoded by a first motion vector pointing to a first reference block in a first reference picture and a second motion vector pointing to a second reference block in a second reference picture. The block may be jointly predicted by a combination of the first reference block and the second reference block.
Furthermore, merge mode techniques may be used for inter picture prediction to improve coding efficiency.
According to some exemplary embodiments of the present disclosure, prediction such as inter picture prediction and intra picture prediction is performed in units of blocks. For example, pictures in a sequence of video pictures are divided into Coding Tree Units (CTUs) for compression, and the CTUs in the pictures may have the same size, e.g., 64 × 64 pixels, 32 × 32 pixels, or 16 × 16 pixels. In general, a CTU may comprise three parallel Coding Tree Blocks (CTBs): one luminance CTB and two chrominance CTBs. Each CTU may be recursively split in a quadtree into one or more Coding Units (CUs). For example, a CTU of 64 × 64 pixels may be split into one CU of 64 × 64 pixels, or 4 CUs of 32 × 32 pixels. Each block of the one or more 32 x 32 blocks may be further split into 4 CUs of 16 x 16 pixels. In some example embodiments, each CU may be analyzed during encoding to determine a prediction type for the CU among a plurality of prediction types, e.g., an inter prediction type or an intra prediction type. A CU may be split into one or more Prediction Units (PUs) according to temporal and/or spatial predictability. Typically, each PU includes a luma Prediction Block (PB) and two chroma PBs. In one embodiment, the prediction operation in encoding (encoding/decoding) is performed in units of prediction blocks. Splitting a CU into PUs (or PBs of different color channels) may be performed in various spatial modes. For example, the luma or chroma PB may include a matrix of values (e.g., luma values) for samples, such as 8 × 8 pixels, 16 × 16 pixels, 8 × 16 pixels, 16 × 8 pixels, and so on.
Fig. 7 shows a diagram of a video encoder (703) according to another example embodiment of the present disclosure. A video encoder (703) is configured to receive a processing block (e.g., a prediction block) of sample values within a current video picture in a sequence of video pictures, and encode the processing block into an encoded picture that is part of an encoded video sequence. In an example, the video encoder (703) may be used in place of the video encoder (403) in the example of fig. 4.
For example, the video encoder (703) receives a matrix of sample values for a processing block, e.g., a prediction block of 8 × 8 samples, or the like. The video encoder (703) then uses, for example, rate Distortion Optimization (RDO) to determine whether to use intra, inter, or bi-directional prediction modes to optimally encode the processing block. When it is determined that the processing block is encoded in intra mode, the video encoder (703) may use intra prediction techniques to encode the processing block into an encoded picture; and when it is determined that the processing block is encoded in inter mode or bi-prediction mode, the video encoder (703) may use inter prediction or bi-prediction techniques, respectively, to encode the processing block into the encoded picture. In some example embodiments, the merge mode may be used as a sub-mode of inter picture prediction, where motion vectors are derived from one or more motion vector predictors without resorting to coded motion vector components outside the predictor. In some other exemplary embodiments, there may be a motion vector component that is applicable to the subject block. Accordingly, the video encoder (703) may include components not explicitly shown in fig. 7, such as a mode decision module for determining a prediction mode for the processing block.
In the example of fig. 7, the video encoder (703) includes an inter-frame encoder (730), an intra-frame encoder (722), a residual calculator (723), a switch (726), a residual encoder (724), a general-purpose controller (721), and an entropy encoder (725) coupled together as shown in the exemplary arrangement in fig. 7.
The inter encoder (730) is configured to receive samples of a current block (e.g., a processed block), compare the block to one or more reference blocks in a reference picture (e.g., blocks in previous and subsequent pictures in display order), generate inter prediction information (e.g., redundant information descriptions, motion vectors, merge mode information according to inter coding techniques), and calculate an inter prediction result (e.g., a predicted block) using any suitable technique based on the inter prediction information. In some examples, the reference picture is a decoded reference picture that is decoded using a decoding unit 633 (shown as residual decoder 728 of fig. 7, as described in further detail below) embedded in the example encoder 620 of fig. 6 based on the encoded video information.
An intra encoder (722) is configured to receive samples of a current block (e.g., a processed block), compare the block to already encoded blocks in the same picture, and generate quantized coefficients after transformation, and in some cases, intra prediction information (e.g., intra prediction direction information according to one or more intra coding techniques). The intra encoder (722) may calculate an intra prediction result (e.g., a predicted block) based on the intra prediction information and a reference block in the same picture.
The general purpose controller (721) may be configured to determine general purpose control data and control other components of the video encoder (703) based on the general purpose control data. In one example, a general purpose controller (721) determines a prediction mode for a block and provides a control signal to a switch (726) based on the prediction mode. For example, when the prediction mode is intra mode, the general controller (721) controls the switch (726) to select an intra mode result for use by the residual calculator (723), and controls the entropy encoder (725) to select and include intra prediction information in the code stream; and when the prediction mode of the block is inter mode, the general purpose controller (721) controls the switch (726) to select an inter prediction result for use by the residual calculator (723), and controls the entropy encoder (725) to select and include inter prediction information in the code stream.
The residual calculator (723) may be configured to calculate a difference (residual data) between the prediction results of the received block and a block selected from the intra encoder (722) or the inter encoder (730). The residual encoder (724) may be configured to encode residual data to generate transform coefficients. For example, the residual encoder (724) may be configured to transform residual data from a spatial domain to a frequency domain to generate transform coefficients. The transform coefficients are then subjected to a quantization process to obtain quantized transform coefficients. In various exemplary embodiments, the video encoder (703) further includes a residual decoder (728). A residual decoder (728) is configured to perform the inverse transform and generate decoded residual data. The decoded residual data may be suitably used by an intra encoder (722) and an inter encoder (730). For example, inter encoder (730) may generate a decoded block based on decoded residual data and inter prediction information, and intra encoder (722) may generate a decoded block based on decoded residual data and intra prediction information. The decoded blocks are processed appropriately to generate decoded pictures, and the decoded pictures may be buffered in a memory circuit (not shown) and used as reference pictures.
The entropy encoder (725) may be configured to format the codestream to include the encoded blocks, and to perform entropy encoding. The entropy encoder (725) is configured to include various information in the codestream. For example, the entropy encoder (725) may be configured to include general control data, selected prediction information (e.g., intra prediction information or inter prediction information), residual information, and other suitable information in the code stream. When a block is encoded in a merge sub-mode of an inter mode or a bi-directional prediction mode, there may be no residual information.
Fig. 8 shows a diagram of an exemplary video decoder (810) according to another embodiment of the present disclosure. A video decoder (810) is configured to receive an encoded picture that is part of an encoded video sequence and decode the encoded picture to generate a reconstructed picture. In one example, the video decoder (810) may be used in place of the video decoder (410) in the example of fig. 4.
In the example of fig. 8, the video decoder (810) includes an entropy decoder (871), an inter-frame decoder (880), a residual decoder (873), a reconstruction module (874), and an intra-frame decoder (872) coupled together as shown in the exemplary arrangement in fig. 8.
The entropy decoder (871) may be configured to reconstruct from the encoded picture certain symbols representing syntax elements constituting the encoded picture. Such symbols may include, for example, a mode in which the block is encoded (e.g., intra mode, inter mode, bi-prediction mode, merge sub mode, or another sub mode), prediction information (e.g., intra prediction information or inter prediction information) that may identify certain samples or metadata for use by an intra decoder 872 or an inter decoder 880 to predict, residual information in the form of, for example, quantized transform coefficients, and so forth. In one example, when the prediction mode is an inter or bi-directional prediction mode, inter prediction information is provided to an inter decoder (880); and providing the intra prediction information to an intra decoder (872) when the prediction type is an intra prediction type. The residual information may be subjected to inverse quantization and provided to a residual decoder (873).
The inter-frame decoder (880) may be configured to receive the inter-frame prediction information and generate an inter-frame prediction result based on the inter-frame prediction information.
An intra-frame decoder (872) may be configured to receive intra-frame prediction information and generate a prediction result based on the intra-frame prediction information.
The residual decoder (873) may be configured to perform inverse quantization to extract dequantized transform coefficients and to process the dequantized transform coefficients to transform the residual from the frequency domain to the spatial domain. The residual decoder (873) may also utilize certain control information (to include Quantizer Parameters (QP)) that may be provided by the entropy decoder (871) (data path not depicted, as this is only low data amount control information).
The reconstruction module (874) may be configured to combine the residuals output by the residual decoder (873) and the prediction (which may be output by an inter prediction module or an intra prediction module, as the case may be) in the spatial domain to form reconstructed blocks, which form part of reconstructed pictures, which are part of reconstructed video. It should be noted that other suitable operations, such as deblocking operations, may also be performed to improve visual quality.
It should be noted that video encoder (403), video encoder (603), and video encoder (703), as well as video decoder (410), video decoder (510), and video decoder (810), may be implemented using any suitable techniques. In some example embodiments, the video encoder (403), the video encoder (603), and the video encoder (703) and the video decoder (410), the video decoder (510), and the video decoder (810) may be implemented using one or more integrated circuits. In another embodiment, the video encoder (403), the video encoder (603), and the video decoder (410), the video decoder (510), and the video decoder (810) may be implemented using one or more processors executing software instructions.
Returning to the intra prediction process, in the intra prediction process, samples in a block (e.g., a luma prediction block or a chroma prediction block, or an encoding block if not further split into prediction blocks) are predicted by samples of a neighboring, next neighboring, or one or more other rows, or a combination thereof, to generate a prediction block. The residual between the actual block being coded and the prediction block can then be processed by a transform, followed by quantization. Various intra prediction modes may be made available and parameters related to intra mode selection and other parameters may be signaled in the code stream. For example, various intra-prediction modes may involve one or more row positions for the prediction samples, a direction in which the prediction samples are selected from one or more prediction rows, and other special intra-prediction modes.
For example, a set of intra-prediction modes (interchangeably referred to as "intra-modes") may include a predetermined number of directional intra-prediction modes. As described above in relation to the example implementation of fig. 1, these intra-prediction modes may correspond to a predetermined number of directions along which samples outside the block (out-of-block samples) are selected as predictions of samples predicted in a particular block. In another particular exemplary implementation, eight (8) primary directional patterns may be supported and predefined, the 8 primary directional patterns corresponding to angles of 45 to 207 degrees from the horizontal axis.
In some other implementations of intra-prediction, to further exploit the greater variety of spatial redundancies in the directional texture, the directional intra-mode may be further extended to have a finer granularity set of angles. For example, the 8-angle implementation described above may be configured to provide 8 nominal angles (nominal angles), as shown in fig. 9, referred to as V _ PRED, H _ PRED, D45_ PRED, D135_ PRED, D113_ PRED, D157_ PRED, D203_ PRED, and D67_ PRED, and for each nominal angle, a predetermined number (e.g., 7) finer angles may be added. With this extension, a larger total number (e.g., 56 in this example) of directional angles may be used for intra prediction, corresponding to the same number of predetermined directional intra modes. The prediction angle may be represented by a nominal intra angle plus an angle increment. For the particular example above with 7 finer angular directions for each nominal angle, the angular increment may be-3 to 3 times the step size, which is 3 degrees.
In some implementations, a predetermined number of non-directional intra prediction modes may be predefined and made available in place of or in addition to the directional intra modes described above. For example, 5 non-directional intra modes, referred to as smooth intra prediction modes, may be specified. These non-directional intra-mode prediction modes may be specifically referred to as DC intra-mode, path intra-mode, SMOOTH _ V intra-mode, and SMOOTH _ H intra-mode. Fig. 10 illustrates prediction of samples for a particular block in these exemplary non-directional modes. As an example, fig. 10 shows a 4 x 4 block 1002 predicted by samples from the top adjacent row and/or the left adjacent row. A particular sample 1010 in block 1002 may correspond to a sample 1004 located directly above the sample 1010 in a top adjacent row of block 1002, an upper left sample 1006 of the sample 1010 that is the intersection of the top adjacent row and a left adjacent row, and a right left sample 1008 of the sample 1010 located in a left adjacent row of block 1002. For the exemplary DC intra prediction mode, the average of the left neighboring sample 1008 and the above neighboring sample 1004 may be used as the prediction value of the sample 1010. For the exemplary PAETH intra prediction mode, the top reference sample 1004, the left side reference sample 1008, and the top left reference sample 1006 may be extracted, and then the closest (top + left-top left) value of the three reference samples may be set as the prediction value of the sample 1010. For the exemplary SMOOTH _ V intra prediction mode, the samples 1010 may be predicted by interpolating the top-left neighboring sample 1006 and the left-side neighboring sample 1008 twice along the vertical direction. For the exemplary SMOOTH _ H intra prediction mode, the samples 1010 may be predicted by interpolating the top-left neighboring sample 1006 and the top neighboring sample 1004 bi-directionally along the horizontal direction. For the exemplary SMOOTH intra prediction mode, the samples 1010 may be predicted by an average of quadratic interpolation along the vertical and horizontal directions. The above described non-directional intra mode implementations are illustrated only as non-limiting examples. Other neighboring rows, other non-directional selections of samples, and the manner in which prediction samples are combined to predict a particular sample in a prediction block are also contemplated.
At different coding levels (pictures, slices, blocks, units, etc.), a particular intra-prediction mode is selected by the encoder from the directional or non-directional modes described above, which may be signaled in the codestream. In some exemplary implementations, exemplary 8 nominal directional modes and 5 non-angular smooth modes (13 options total) may be signaled first. Then, if the signaled mode is one of the 8 nominal angle intra modes, the index is further signaled to indicate the selected angle increment to the corresponding signaled nominal angle. In some other example implementations, all intra-prediction modes may be indexed together (e.g., 56 directional modes plus 5 non-directional modes to produce 61 intra-prediction modes) for signaling.
In some example implementations, an example 56 or other number of directional intra prediction modes may be implemented using a unified directional predictor that projects each sample of a block to a reference subsample position and interpolates the reference samples through a 2-tap bilinear filter.
In some implementations, to capture the attenuated spatial correlation with the reference on the edge, an additional filter pattern called FILTER INTRA pattern can be designed. For these modes, in addition to samples outside the block, the prediction samples within the block may be used as intra-prediction reference samples for some tiles within the block. For example, the modes may be predefined and made available for intra prediction of at least luma blocks (or luma blocks only). A predetermined number (e.g., five) of filter intra modes can be pre-designed, each represented by a set of n-tap filters (e.g., 7-tap filters), reflecting, for example, the correlation between samples in a 4 x 2 tile and n neighbors adjacent to the tile. In other words, the weight factor of the n-tap filter may be position dependent. Taking an 8 x 8 block, a 4 x 2 tile, and a 7 tap filter as an example, as shown in fig. 11, an 8 x 8 block 1102 may be split into 84 x 2 tiles. In fig. 11, these patches are denoted by B0, bl, B2, B3, B4, B5, B6, and B7. For each tile, its 7 neighbors (indicated by R0 to R7 in fig. 11) can be used to predict the samples in the current tile. For slice B0, all neighbors may have been reconstructed. But for other slices some neighbors are located in the current block and therefore may not be reconstructed, and then the predictor of the immediate neighbors is used as a reference. For example, not all neighbors of the patch B7 as indicated in fig. 11 are reconstructed, so instead prediction samples of the neighbors are used.
In some implementations of intra-prediction, one color component may be predicted using one or more other color components. The color component may be any one of YCrCb, RGB, XYZ color space, and the like. For example, prediction of a chroma component (e.g., a chroma block) from a luma component (e.g., a luma reference sample), referred to as chroma from luma or CfL, may be implemented. In some example implementations, cross-color prediction may only allow for luminance to chrominance. For example, the chroma samples in the chroma block may be modeled as a linear function of the simultaneously reconstructed luma samples. CfL prediction can be implemented as follows:
CfL(α)=α×L AC +DC (1)
wherein L is AC AC contribution indicating the luminance component, α indicates a parameter of the linear model, and DC indicates the DC contribution indicating the chrominance component. For example, the AC component is obtained for each sample of the block, while the DC component is obtained for the entire block. In particular, the reconstructed luma samples may be subsampled (subsampled) to chroma resolution, and then the average luma value (DC of luma) may be subtracted from each luma value to form the AC contribution of luma. The AC contribution of the luminance is then used in the linear mode of equation (1) to predictAC value of the chrominance component. To estimate or predict the chroma AC component from the luma AC contribution (rather than requiring the decoder to calculate scaling parameters), an exemplary CfL implementation may determine the parameter α based on the original chroma samples and signal the parameter α in the codestream. This reduces the complexity of the decoder and produces a more accurate prediction. As for the DC contribution of the chroma component, in some example implementations, the DC contribution of the chroma component may be calculated using an intra-frame DC mode within the chroma component.
In some exemplary implementations of the reference row, multi-line intra prediction may be used. In these implementations, more than one reference row is available for selection in intra prediction, and the encoder decides and signals which reference row to use to generate intra prediction. The reference row index may be signaled before the intra prediction mode, in case a non-zero reference row index is signaled, only the most probable prediction mode is allowed. Referring to fig. 15, an example of 4 reference rows and the top-left reference sample (from reference row 0 to reference row 3) is depicted, where each reference row consists of six sections, section a to section F (as indicated by 1502-1512). Further, sections a and F may be filled with the closest samples from sections B and E, respectively.
Then, the transformation of the residual of the intra-predicted block or the inter-predicted block may be implemented, followed by the quantization of the transform coefficients. To perform the transform, the intra-coded blocks and the inter-coded blocks may be further divided into a plurality of transform blocks (sometimes interchangeably used as "transform units," even though the term "unit" is often used to refer to a set of three color channels, e.g., a "coding unit" may include luma-coded blocks and chroma-coded blocks) prior to the transform. In some implementations, a maximum partition depth of a coding block (or prediction block) may be specified (the term "coded block" may be used interchangeably with "coding block"). For example, such a division does not exceed 2 levels. The operation of dividing a prediction block into transform blocks may be differently processed between an intra-prediction block and an inter-prediction block. However, in some implementations, such partitioning may be similar between intra-predicted blocks and inter-predicted blocks.
In some example implementations, for an intra-coded block, transform partitioning may be performed in such a way that all transform blocks have the same size, and the transform blocks are encoded in raster scan order. Fig. 12 shows an example of such a transform block division of an intra-coded block. In particular, fig. 12 shows the encoding block 1202 divided into 16 transform blocks of the same block size by an intermediate level quadtree split 1204, as shown at 1206. An exemplary raster scan order for encoding is illustrated by the sequential arrows in fig. 12.
In some example implementations, transform unit partitioning may be performed in a recursive manner for inter-coded blocks, where the partitioning depth may be up to a predetermined number of levels (e.g., 2 levels). As shown in FIG. 13, the split may be for any child partition and stop or continue recursively at any level. In particular, fig. 13 shows an example where block 1302 is split into four quadtree sub-blocks 1304, where one sub-block is further split into four two-level transform blocks, while the division of the other sub-blocks stops after the first level, resulting in a total of 7 transform blocks of two different sizes. An exemplary raster scan order for encoding is further illustrated by the sequential arrows in fig. 13. Although fig. 13 shows an exemplary implementation of quadtree splitting up to two-level square transform blocks, in some generation implementations, the transform partitions may support 1:1 (square), 1:2/2:1, and 1:4/4:1 transform block shapes and sizes, ranging from 4 x 4 to 64 x 64. In some example implementations, the transform block division may only be applied to the luma component if the coding block is less than or equal to 64 × 64 (in other words, the chroma transform block may be the same as the coding block under this condition). Otherwise, if the coding block width or height is greater than 64, the luma coding block and the chroma coding block may be implicitly split into a plurality of min (W, 64) × min (H, 64) transform blocks and min (W, 32) × min (H, 32) transform blocks, respectively.
In some example implementations, another alternative example scheme is provided for dividing an encoded block or a predicted block into transform blocks, as shown in fig. 16. As shown in fig. 16, instead of using recursive transform partitioning, a predetermined set of partition types may be applied to an encoding block according to the transform type of the encoding block. In the particular example shown in fig. 16, one of 6 exemplary partition types may be applied to split a coded block into various numbers of transform blocks. Such a scheme may be applicable to coding blocks or prediction blocks. In this disclosure, the term "partition type" may generally refer to a manner in which a block (e.g., a prediction block or a coding block) is partitioned, which may refer to a "transform partition type", a "prediction block partition type", or a "coding block partition type". Furthermore, for the description under "transform partition type", the same concept can also be applied to "coding block partition type", and vice versa.
In more detail, as shown in fig. 16, the partitioning scheme of fig. 16 provides up to 6 partition types for any given transform type. In this scheme, each coding block or prediction block may be assigned a transform type based on, for example, a rate-distortion cost. In one example, the partition type allocated to a coding block or prediction block may be determined based on a transform type of the coding block or prediction block. The particular partition type may correspond to a transform block split size and mode (or partition type), as shown by the 4 partition types shown in fig. 16. The correspondence between various transform types and various partition types may be defined in advance. An exemplary correspondence is shown below, where capital-letter labels indicate the type of transform that may be allocated to a coding block or prediction block based on a rate-distortion cost:
PARTITION _ NONE (PARTITION _ NONE): transform sizes equal to the block size are allocated.
PARTITION _ SPLIT (PARTITION _ SPLIT): a transform size is assigned that is 1/2 of the width of the block size and 1/2 of the height of the block size.
Part _ HORZ (PARTITION _ HORZ): a transform size having the same width as the block size and being 1/2 of the height of the block size is allocated.
PARTITION _ VERT (PARTITION _ VERT): a transform size that is 1/2 of the width of the block size and has the same height as the block size is allocated.
Part _ HORZ4 (PARTITION _ HORZ 4): a transform size having the same width as the block size and being 1/4 of the height of the block size is allocated.
PARTITION _ VERT4 (PARTITION _ VERT 4): a transform size that is 1/4 of the width of the block size and has the same height as the block size is allocated.
In the above example, the partition types as shown in fig. 16 each contain a uniform transform size for the divided transform blocks. This is merely an example and not a limitation. In some other implementations, in a particular partition type (or mode), a hybrid transform block size may be used for the partitioned transform blocks.
Then, the main transform may be performed on each transform block obtained as above. The main transform is essentially a movement of the residual in the transform block from the spatial domain to the frequency domain. In some implementations of the actual master transform, to support the exemplary extended coded block partitioning described above, a variety of transform sizes (ranging from 4 points to 64 points for each of the two dimensions) and transform shapes (squares; rectangles with aspect ratios of 2:1/1:2 and 4:1/1:4) may be allowed.
Turning to the actual primary transform, in some example implementations, the 2-D transform process may involve the use of a hybrid transform kernel (e.g., the hybrid transform kernel may consist of different 1-D transforms for each dimension of the encoded residual transform block). Exemplary 1-D transformation kernels may include, but are not limited to: a) 4-point DCT-2, 8-point DCT-2, 16-point DCT-2, 32-point DCT-2 and 64-point DCT-2; b) 4-point asymmetric DST, 8-point asymmetric DST, 16-point asymmetric DST (DST-4, DST-7) and inverted versions thereof; c) 4-point identity transform, 8-point identity transform, 16-point identity transform, and 32-point identity transform. The selection of the transform kernel for each dimension may be based on a rate-distortion (RD) criterion. For example, the basis functions of the DCT-2 and asymmetric DST that can be achieved are listed in Table 1.
Table 1: exemplary Primary transform basis functions (DCT-2, DST-4, and DST-7 for N-point input)
In some example implementations, the availability of a hybrid transform core for a particular primary transform implementation may be based on transform block size and prediction mode. Exemplary dependencies are listed in table 2. For chroma components, the transform type selection may be performed in an implicit manner. For example, for intra prediction residuals, the transform type may be selected according to the intra prediction mode, as specified in table 3. For inter prediction residuals, the transform type of the chroma block may be selected according to a transform type selection of a co-located luma block. Thus, for the chroma component, there is no transform type signaling in the codestream.
Table 2: AV1 hybrid transform kernels and their availability based on prediction mode and block size. Here, → and ↓ indicate a horizontal dimension and a vertical dimension; v and X indicate the availability of the kernel for the block size and prediction mode
Table 3: transform type selection for chroma component intra prediction residual
Intra prediction Vertical transformation Horizontal transformation
DC_PRED DCT DCT
V_PRED ADST DCT
H_PRED DCT ADST
D45_PRED DCT DCT
D135_PRED ADST ADST
D113_PRED ADST DCT
D157_PRED DCT ADST
D203_PRED DCT ADST
D67_PRED ADST DCT
SMOOTH_PRED ADST ADST
SMOOTH_V_PRED ADST DCT
SMOOTH_H_PRED DCT ADST
PAETH_PRED ADST ADST
In some implementations, a quadratic transform may be performed on the primary transform coefficients. For example, as shown in fig. 14, an LFNST (low frequency inseparable transform), referred to as a simplified quadratic transform, may be applied between the forward main transform and the quantization (at the encoder) and between the dequantization and inverse main transform (at the decoder side) to further decorrelate the main transform coefficients. In essence, LFNST may take a portion of the main transform coefficients, e.g. the low frequency portion (thus "simplified" from the complete set of main transform coefficients of the transform block), to perform a quadratic transform. In an exemplary LFNST, a 4 × 4 non-separable transform or an 8 × 8 non-separable transform may be applied according to a transform block size. For example, a 4 × 4LFNST may be applied to smaller transform blocks (e.g., min (width, height) < 8), while an 8 × 8LFNST may be applied to larger transform blocks (e.g., min (width, height) > 8). For example, if an 8 × 8 transform block is subjected to a 4 × 4LFNST, only the low-frequency 4 × 4 portion of the 8 × 8 main transform coefficients will undergo further secondary transforms.
Specifically, as shown in fig. 14, the transform block may be 8 × 8 (or 16 × 16). Thus, the forward main transform 1402 of the transform block generates an 8 × 8 (or 16 × 16) main transform coefficient matrix 1404, where each square cell represents a 2 × 2 (or 4 × 4) portion. For example, the input to the forward LFNST may not be the full 8 × 8 (or 16 × 16) main transform coefficient. For example, a 4 × 4 (or 8 × 8) LFNST may be used for the quadratic transform. Thus, only the 4 × 4 (or 8 × 8) low frequency primary transform coefficients of the primary transform coefficient matrix 1404 may be used as inputs to the LFNST, as indicated by the shaded portion (upper left corner) 1406. The remaining part of the main transform coefficient matrix may not be transformed twice. Thus, after the secondary transform, the portion of the primary transform coefficients that are subject to LFNST become secondary transform coefficients, while the remaining portion that are not subject to LFNST (e.g., the unshaded portion of matrix 1404) retain the corresponding primary transform coefficients. In some example implementations, the remaining portions that are not subjected to the quadratic transform may all be set to zero coefficients.
An application example of the non-separable transform used in LFNST is described below. To apply the exemplary 4X 4lfnst, the 4X 4 input block X (e.g., representing the 4X 4 low frequency portion of the primary transform coefficient block, such as shaded portion 1406 of primary transform matrix 1404 of fig. 14) may be represented as:
the 2-D input matrix may first be linearized or scanned into vectors in an exemplary order
Then, the inseparable transform of the 4 × 4LFNST can be calculated asWhereinIndicating the output transform coefficient vector, T is a 16 × 16 transform matrix. The resulting 16 x 1 coefficient vector is then aligned using the scan order (e.g., horizontal, vertical, or diagonal) of the blockBackward scanning into 4 x 4 blocks. The smaller indexed coefficients may be placed in a 4 x 4 coefficient block with the smaller scan index. In this manner, redundancy in the primary transform coefficients X can be further exploited by the second transform T, providing additional compressionAnd (4) enhancing.
The exemplary LFNST described above is based on a direct matrix multiplication method to apply an inseparable transformation such that the LFNST is implemented in a single pass without requiring multiple iterations. In some further exemplary implementations, the dimension of the non-separable transform matrix (T) of the exemplary 4 x 4LFNST may be further reduced to minimize computational complexity and memory space requirements for storing transform coefficients. This implementation may be referred to as a simplified non-split transform (RST). In more detail, the main idea of RST is to map N (in the above example, N is 4 × 4=16, but for an 8 × 8 block, N may be equal to 64) dimensional vectors to R dimensional vectors in different spaces, where N/R (R < N) represents a dimension reduction factor (dimension reduction factor). Thus, instead of an N × N transform matrix, the RST matrix becomes an R × N matrix as follows:
wherein the R rows of the transformation matrix are reduced R bases of the N-dimensional space. Thus, the transformation converts the input vector or N-dimension into a reduced R-dimension output vector. Thus, as shown in FIG. 14, the quadratic transform coefficients 1408 transformed from the primary coefficients 1406 are reduced in dimension by a factor or N/R. The three squares around 1408 in fig. 14 may be filled with zeros.
The inverse transform matrix of the RTS may be the transpose of its forward transform. For an exemplary 8 x 8LFNST (which is more versatile than the 4 x 4LFNST described above), an exemplary reduction factor of 4 may be applied, so that the 64 x 64 directly inseparable transform matrix is correspondingly reduced to a 16 x 64 direct matrix. Furthermore, in some implementations, a portion (but not all) of the input principal coefficients may be linearized into an input vector that is an LFNST. For example, only a portion of the exemplary 8X 8 input main transform coefficients may be linearized into the above-described X vector. For one particular example, among the four 4 × 4 quadrants of the 8 × 8 main transform coefficient matrix, the lower right corner (high frequency coefficients) may be ignored, and only the other three quadrants are linearized as a 48 × 1 vector, instead of a 64 × 1 vector, using the predetermined scan order. In such implementations, the non-separable transform matrix may be further simplified from 16 × 64 to 16 × 48.
Thus, an exemplary simplified 48 × 16 inverse RST matrix may be used at the decoder side to generate the upper-left, upper-right, and lower-left 4 × 4 quadrants of the 8 × 8 kernel (primary) transform coefficients. In particular, when a further simplified 16 × 48RST matrix (instead of 16 × 64 RST) is applied to the same transform set configuration, the non-separable quadratic transform may extract as input the vectorized 48 matrix elements from the three 4 × 4 bounding blocks of the 8 × 8 major coefficient block (excluding the bottom-right 4 × 4 block). In such an implementation, the deleted lower right 4 × 4 main transform coefficients may be ignored in the quadratic transform. This further simplified transformation may convert a 48 x 1 vector to a 16 x 1 output vector, scanning the 16 x 1 output vector back into a 4 x 4 matrix to fill 1408 of fig. 14. 1408 the three squares of quadratic transform coefficients around the transform coefficients may be filled with zeros.
With this dimensionality reduction of the RST, the amount of memory used to store all LFNST matrices is reduced. For example, in the above example, the memory usage may be reduced from 10KB to 8KB with a relatively small reduction in performance compared to an implementation without dimension reduction.
In some implementations, to reduce complexity, LFNST may be further limited to apply LFNST only if all coefficients that lie outside the portion of the main transform coefficients to be subjected to LFNST (e.g., lie outside portion 1406 of 1404 in fig. 14) are not significant. Thus, when applying LFNST, all primary-only transform coefficients (e.g., the unshaded portion of primary coefficient matrix 1404 of fig. 14) may approach zero. This restriction allows the LFNST index signaling to be adjusted on the last significant location, thus avoiding some extra coefficient scans that may be required to check the significant coefficients of a particular location when the restriction is not applied. In some implementations, the worst-case processing of LFNST (in terms of multiplication per pixel) may limit the non-separable transforms of 4 x 4 blocks and 8 x 8 blocks to 8 x 16 transforms and 8 x 48 transforms, respectively. In these cases, when applying LFNST, the last-significant scan position (last-significant scan position) must be less than 8, while the other dimensions are less than 16. For blocks with shapes 4 × N and N × 4 and N >8, the above limitation means that LFNST is now applied only once to the upper left 4 × 4 region. Since all primary-only coefficients (primary-only coefficients) are zero when the LFNST is applied, the number of operations required for the primary transformation is reduced in this case. From the encoder point of view, the quantization of the coefficients can be simplified when testing the LFNST transform. For the first 16 coefficients (in scan order), rate distortion optimized quantization (RDO) must be performed to the maximum, and the remaining coefficients can be forced to zero.
In some example implementations, the available RST kernels may be designated as multiple transformation sets, where each transformation set includes multiple non-separable transformation matrices. For example, there may be a total of 4 transform sets, each with 2 non-separable transform matrices (kernels) for use by LFNST. These kernels may be trained offline in advance, so they are data driven. The offline trained transform cores may be stored in memory or may be hard-coded in the encoding or decoding device for use during the encoding/decoding process. The selection of a transform set during an encoding or decoding process may be determined by an intra-prediction mode. The mapping from the intra prediction mode to the transform set may be predefined. An example of such a predetermined mapping is shown in table 4. For example, as shown in table 4, if one of three cross-component linear model (CCLM) modes (INTRA _ LT _ CCLM, INTRA _ T _ CCLM, or INTRA _ L _ CCLM) is used for the current block (i.e., 81< = predModeIntra < = 83), a transform set 0 may be selected for the current chroma block. For each transform set, the selected non-separable secondary transform candidates may be further specified by explicitly signaled LFNST indices. For example, after transforming the coefficients, the index may be signaled once for each intra-CU in the codestream.
Table 4: transformation selection table
IntraPredMode Transformation cableGuiding device
IntraPredMode<0 1
0<=IntraPredMode<=1 0
2<=IntraPredMode<=12 1
13<=IntraPredMode<=23 2
24<=IntraPredMode<=44 3
45<=IntraPredMode<=55 2
56<=IntraPredMode<=80 1
81<=IntraPredMode<=83 0
Since, in the above exemplary implementation, the LFNST is restricted from being applicable only when all coefficients lying outside the first subgroup or portion of coefficients are not significant, the LFNST index coding depends on the position of the last significant coefficient (last significant coefficient). Furthermore, the LFNST index may be context coded, but not dependent on the intra prediction mode, and only the first bin may be context coded. Furthermore, LFNST may be applied to intra CUs in intra slices and inter slices, as well as luma and chroma. If dual trees are enabled, the LFNST indices of luminance and chrominance may be signaled separately. For inter slices (dual tree disabled), a single LFNST index may be signaled, for both luma and chroma.
In some example implementations, when an intra sub-partition (ISP) mode is selected, LFNST may be disabled and RST index may not be signaled, since the improvement in performance may be insignificant even if RST is applied to every feasible partition block. Furthermore, disabling RST on the residual of ISP prediction can reduce coding complexity. In some further implementations, when a multiple linear regression intra prediction (MIP) mode is selected, LFNST may also be disabled and the RST index may not be signaled.
Given that large CUs larger than 64 x 64 (or any other predetermined size representing the maximum transform block size) are implicitly split (e.g., TU tiling) due to existing maximum transform size limitations (e.g., 64 x 64), LFNST index searches may increase data buffering by a factor of four for a certain number of decoding pipeline stages. Thus, in some implementations, the maximum size of the allowed LFNST may be limited to, for example, 64 × 64. In some implementations, LFNST may be enabled only when DCT2 is the primary transform.
In some other implementations, the luminance component is provided with an intra quadratic transform (IST) by defining, for example, 12 sets of quadratic transforms, where, for example, 3 kernels are present in each set. An intra mode dependent index (intra mode dependent index) may be used to select the transform set. The kernel selection in one set may be based on signaled syntax elements. IST may be enabled when DCT2 or ADST is used as both horizontal and vertical main transforms. In some implementations, depending on the block size, either a 4 x 4 non-separable transform or an 8 x 8 non-separable transform may be selected. If min (tx _ width, tx _ height) <8, 4 × 4IST can be selected. For larger blocks, an 8 × 8IST may be used. Here, the tx _ width (tx _ width) and the tx _ height (tx _ height) correspond to the width and height of the transform block, respectively. The input to the IST may be the low frequency main transform coefficients in the zigzag scan order.
Various transforms in the video encoding or decoding process, such as a main transform of samples in a residual block or a quadratic transform of a block in a main transform coefficient process, may not be very effective in capturing directional texture patterns, such as edges in 45 degree directions (e.g., directions substantially away from horizontal or vertical directions), when only separable transform schemes are used. As described above, in some example implementations, one or more non-separable transform designs may be used for the quadratic transform of the primary transform coefficients.
The transform block division and the transform type applied to the divided transform block may be associated with each other. For example, certain transform types may be more suitable for a particular partition type. For example, in contrast to recursive partitioning (e.g., the partitioning previously described in fig. 13), the transform partitioning scheme illustrated in fig. 16 and described above provides a non-recursive partitioning type. If all available transform types are allowed for the divided transform blocks in all available division modes (e.g., transform division types in fig. 16), the encoder needs to perform optimization in a large parameter space when determining which transform division type to use for obtaining the transform block and which transform type to use for each divided transform block. In practice, a certain set of transform types is generally more suitable for one particular type of transform partition type than other transform types. In various implementations described below, the transform partition type and the interaction between transform types may be considered and used to obtain a scheme to restrict the allowable transform types for a particular partition type, and similarly restrict the allowable partition types for a particular transform type. Such implementations may reduce the optimization space of the encoder when determining the transform partitioning mode and transform type selection for each partitioned transform block, particularly where non-recursive transform partitioning is used.
These exemplary implementations may be used separately or combined in any order or in any manner. In the disclosure above and below, the terms "encoded block," "encoded block," and the like may be used to refer to a picture unit that performs prediction or transformation. The coding block may be a luma coding block or may be a chroma coding block. In some cases, the coded/encoded block may refer to a prediction block. The term "block size" is used to refer to the width or height, or the maximum of the width and height, or the minimum of the width and height, or the area size (width x height), or the aspect ratio (width: height, or height: width) of a coded block.
Multiple candidate primary transform types
In one embodiment, a block may exist for a plurality of candidate primary transform types. The block may comprise a transform block resulting from the partitioning. The selection and/or signaling of the primary transform type may be limited to a set of predetermined transform partition types. The set of predetermined transform partition types may be a subset of a larger set of available transform partition types (e.g., the complete set of partition types in fig. 16). In other words, the selection of the dominant transform is signaled only if the transform partition type of the block belongs to a set of predetermined transform partition types. Otherwise, for example, if the transform partition type is other type, a default transform type may be used instead of selecting and signaling the primary transform type.
In one implementation, the primary transform type may include at least one of: discrete Cosine Transform (DCT) type 1 to DCT type 8; asymmetric Discrete Sine Transform (ADST); discrete Sine Transform (DST) type 1 to DST type 8; line Graph Transformation (LGT); or Karhunen-Loeve Transform (Karhunen-Loeve Transform, KLT).
In one implementation, the predetermined set of transform PARTITION types includes only PARTITION _ NONE, for example, of the various transform PARTITION types of fig. 16, i.e., the transform block size is equal to the prediction block (or encoding block) size. Thus, it is only possible to select and/or signal the primary transform type if it belongs to the predetermined set.
In one implementation, the set of predetermined transform partition types may also be determined in consideration of the number of partitions. For example, a particular transform partition type may be considered to be part of a set of predetermined transform partition types, for which a primary transform type may be selected and/or signaled, only when the number of partitions for the particular transform partition type is less than or equal to a predetermined threshold. In one implementation, the predetermined threshold may be an integer from 1 to 16.
Multiple candidate quadratic transform types
In one embodiment, a block may have multiple candidate quadratic transform types. The block may comprise a transform block resulting from the partitioning. The selection and/or signaling of the quadratic transform type is applicable only to a set of predetermined transform partition types. The set of predetermined transform partition types may be a subset of a larger set of available transform partition types (e.g., the complete set of transform partition types in fig. 16). In other words, the selection of a quadratic transform is signaled only if the transform partition type of the block belongs to a set of predetermined transform partition types. Otherwise, for example, if the transform partition type is other type, a default quadratic transform type may be used instead of selecting and signaling the quadratic transform type, or no quadratic transform may be performed.
In one implementation, the type of quadratic transform may include KLT. The KLT may be configured to have different cores.
In one implementation, the predetermined set of transform PARTITION types may only include, for example, PARTITION _ NONE in the various transform PARTITION types of fig. 16, i.e., the transform block size is equal to the prediction block (or encoding block) size. Thus, a quadratic transform type may be selected and/or signaled only if the transform type belongs to the predetermined set.
In one implementation, the set of predetermined transform partition types may also be determined taking into account the number of partitions. For example, a particular transform partition type may be considered to be part of a set of predetermined transform partition types, for which a quadratic transform type may be selected and/or signaled, only when the number of partitions for the particular transform partition type is less than or equal to a predetermined threshold. In one implementation, the predetermined threshold may be an integer from 1 to 16.
In one implementation, the selection and/or signaling of the quadratic transform type may be based on a combination of the transform partition type and the primary transform type. The combination may include a transform partition type in a predetermined set of transform partition types and a primary transform type in a predetermined set of transform types. For example, a secondary transform type may need to be selected and/or signaled only when the transform PARTITION type is PARTITION _ NONE and the primary transform type used by the block is DCT or ADST. Otherwise, for example, if the transform partition type is other type, a default quadratic transform type may be used instead of selecting and signaling the quadratic transform type, or no quadratic transform may be performed.
Transform related signaling
In this disclosure, various signaling mechanisms are disclosed with the purpose of improving signaling efficiency and taking into account the order in which the relevant syntax elements/parameters are transformed.
In one embodiment, the transform partition type information may be signaled prior to the primary/secondary transform type selection information. The primary/secondary transform type selection needs to be signaled only when the transform PARTITION belongs to a predetermined set of transform PARTITION types (e.g., PARTITION _ NONE). Otherwise, if the transform partition does not belong to the set of predetermined transform partition types, then the primary/secondary transform type selection may not need to be signaled. Instead, the primary/secondary transform type may be derived as a predetermined default transform type.
In one embodiment, the primary/secondary transform type selection information may be signaled prior to transforming the partition type information. In this case, the selection and/or signaling of the transform partition type may depend on the primary/secondary transform type selection information.
In one implementation, the transform partition type information may need to be signaled only when the primary transform type belongs to a predetermined set of transform types. Otherwise, the transform partition type information may not need to be signaled. For example, the set of predetermined transform types may include, but is not limited to, DCT type 1 through DCT type 8, ADST, DST type 1 through DST type 8, lgt, and KLT.
In one implementation, the transform partition type information may need to be signaled only when the quadratic transform type belongs to a predetermined set of transform types. As an example, the set of predetermined transform types may include, but is not limited to, a particular KLT having a kernel associated with a predetermined KLT index. Otherwise, if the quadratic transform type does not belong to the set of predetermined transform types, then the transform partition type information may not need to be signaled. Instead, the transform PARTITION type information may be derived as a predetermined default transform PARTITION type (e.g., PARTITION _ NONE).
Fig. 17 illustrates an example method 1700 for decoding video data. The method 1700 may include some or all of the following steps: step 1710, receiving the coded video code stream of the data block; step 1720, extracting transform partition types associated with data blocks from the encoded video stream; step 1730, in response to the transform partition type belonging to a subset of a set of predetermined transform partition types, each transform partition type of the set of predetermined transform partition types specifying a splitting mode for splitting the data block into transform blocks: extracting a transform type of a transform associated with a transform block split from a data block, the transform type signaled in an encoded video bitstream, wherein the transform type belongs to a first set of predetermined transform types; and performing an inverse transform on the transform block according to the transform type.
In embodiments of the present disclosure, any steps and/or operations may be combined or arranged in any number or order as desired. Two or more steps and/or operations may be performed in parallel.
The embodiments of the present disclosure may be used alone or in any order in combination. Further, each method (or embodiment), encoder, and decoder may be implemented by processing circuitry (e.g., one or more processors or one or more integrated circuits). In one example, one or more processors execute a program stored in a non-transitory computer readable medium. Embodiments of the present disclosure may be applicable to either luma or chroma blocks.
The techniques described above may be implemented as computer software using computer readable instructions and physically stored on one or more computer readable media. For example, fig. 18 illustrates a computer system (1800) suitable for implementing certain embodiments of the disclosed subject matter.
The computer software may be encoded using any suitable machine code or computer language that may be subject to assembly, compilation, linking, or similar mechanism to create code that includes instructions that may be executed directly by one or more computer Central Processing Units (CPUs), graphics Processing Units (GPUs), etc., or by interpretation, microcode execution, etc.
The instructions may be executed on various types of computers or components thereof, including, for example, personal computers, tablet computers, servers, smart phones, gaming devices, internet of things devices, and the like.
The components of computer system (1800) shown in FIG. 18 are exemplary in nature and are not intended to suggest any limitation as to the scope of use or functionality of the computer software implementing embodiments of the present disclosure. Neither should the configuration of the components be interpreted as having any dependency or requirement relating to any one or combination of components illustrated in the exemplary embodiments of the computer system (1800).
The computer system (1800) may include some human interface input devices. Such human interface input devices may be responsive to input by one or more human users through, for example: tactile input (e.g., keystrokes, strokes, data glove movements), audio input (e.g., speech, clapping hands), visual input (e.g., gestures), olfactory input (not depicted). The human interface device may also be used to capture certain media that are not necessarily directly related to human conscious input, such as audio (e.g., speech, music, ambient sounds), images (e.g., scanned images, captured images taken from a still image camera), video (e.g., two-dimensional video, three-dimensional video including stereoscopic video).
The human interface input device may include one or more of the following (only one shown in each): keyboard (1801), mouse (1802), touch pad (1803), touch screen (1810), data gloves (not shown), joystick (1805), microphone (1806), scanner (1807), camera (1808).
The computer system (1800) may also include certain human interface output devices. Such human interface output devices may stimulate one or more human user's senses through, for example, tactile output, sound, light, and smell/taste. Such human interface output devices may include tactile output devices (e.g., tactile feedback for a touch screen (1810), a data glove (not shown), or a joystick (1805), but may also be tactile feedback devices that do not act as input devices), audio output devices (e.g., speakers (1809), headphones (not depicted)), visual output devices (e.g., a screen (1810) including CRT screens, LCD screens, plasma screens, OLED screens, each with or without touch screen input functionality, each with or without tactile feedback functionality), some of which are capable of outputting two-dimensional visual output or output beyond three-dimensional through such means as stereoscopic image output, virtual reality glasses (not depicted), holographic displays and smoke boxes (not depicted), and printers (not depicted).
The computer system (1800) may also include human-accessible storage devices and their associated media, such as optical media including CD/DVD ROM/RW (1820) with CD/DVD and the like media (1821), finger drives (1822), removable hard or solid state drives (1823), conventional magnetic media (not depicted) such as magnetic tape and floppy disk, dedicated ROM/ASIC/PLD based devices (not depicted) such as a security dongle, and so forth.
Those skilled in the art will also appreciate that the term "computer-readable medium" used in connection with the presently disclosed subject matter does not encompass transmission media, carrier waves, or other transitory signals.
The computer system (1800) may also include an interface (1854) to one or more communication networks (1855). The network may be, for example, a wireless network, a wired network, an optical network. The network may further be a local network, a wide area network, a metropolitan area network, a vehicle and industrial network, a real time network, a delay tolerant network, etc. Examples of networks include local area networks such as ethernet, wireless LANs, cellular networks including GSM, 3G, 4G, 5G, LTE, etc., television wired or wireless wide area digital networks including cable television, satellite television, and terrestrial broadcast television, automotive and industrial networks including CAN buses, and so forth. Some networks typically require external network interface adapters (e.g., USB ports of computer system (1800)) attached to some general purpose data port or peripheral bus (1849); as described below, other network interfaces are typically integrated into the kernel of the computer system (1800) by attaching to a system bus (e.g., connecting to an ethernet interface in a PC computer system or to a cellular network interface in a smartphone computer system). Computer system (1800) may communicate with other entities using any of these networks. Such communications may be unidirectional reception only (e.g., broadcast television), unidirectional transmission only (e.g., CANBus connected to certain CANBus devices), or bidirectional, e.g., connected to other computer systems using a local or wide area network digital network. As noted above, certain protocols and protocol stacks may be used on each of those networks and network interfaces.
The human interface device, human-machine-accessible storage device, and network interface described above may be attached to the core (1840) of the computer system (1800).
The core (1840) may include one or more Central Processing Units (CPUs) (1841), graphics Processing Units (GPUs) (1842), special purpose programmable processing units in the form of Field Programmable Gate Arrays (FPGAs) (1843), hardware accelerators (1844) for certain tasks, graphics adapters (1850), and so forth. These devices, as well as Read Only Memory (ROM) (1845), random access memory (1846), internal mass storage (1847), such as internal non-user accessible hard drives, SSDs, etc., may be connected by a system bus (1848). In some computer systems, the system bus (1848) may be accessed in the form of one or more physical plugs to enable expansion by additional CPUs, GPUs, and the like. The peripheral devices may be attached directly to the system bus (1848) of the core or through a peripheral bus (1849) to the system bus (1848) of the core. In one example, screen (1810) may be connected to graphics adapter (1850). The architecture of the peripheral bus includes PCI, USB, etc.
The CPU (1841), GPU (1842), FPGA (1843), and accelerator (1844) may execute certain instructions, which may be combined to form the computer code described above. The computer code may be stored in ROM (1845) or RAM (1846). Transitional data may also be stored in RAM (1846), while persistent data may be stored in internal mass storage (1847), for example. Fast storage and retrieval to any storage device may be performed by using a cache, which may be closely associated with: one or more CPUs (1841), GPUs (1842), mass storage (1847), ROMs (1845), RAMs (1846), and the like.
The computer-readable medium may have computer code thereon for performing various computer-implemented operations. The media and computer code may be those specially designed and constructed for the purposes of the present disclosure, or they may be of the kind well known and available to those having skill in the computer software arts.
By way of non-limiting example, a computer system having the architecture (1800), and in particular the core (1840), may provide functionality as a result of one or more processors (including CPUs, GPUs, FPGAs, accelerators, and the like) executing software embodied in one or more tangible computer-readable media. Such computer-readable media may be media associated with user-accessible mass storage as described above, as well as some non-transitory core (1840) memory, such as core internal mass storage (1847) or ROM (1845). Software implementing various embodiments of the present disclosure may be stored in such devices and executed by the kernel (1840). The computer readable medium may include one or more memory devices or chips, according to particular needs. The software may cause the core (1840), and in particular the processors therein (including CPUs, GPUs, FPGAs, etc.), to perform certain processes or certain portions of certain processes described herein, including defining data structures stored in RAM (1846) and modifying such data structures according to processes defined by the software. Additionally or alternatively, the computer system may provide functionality as a result of logic that is hardwired or otherwise embodied in circuitry (e.g., accelerator (1844)) that may operate in place of or in conjunction with software to perform certain processes or certain portions of certain processes described herein. Where appropriate, reference to portions of software may include logic and vice versa. Where appropriate, reference to portions of a computer-readable medium may include circuitry (e.g., an Integrated Circuit (IC)) that stores software for execution, circuitry embodying logic for execution, or both. The present disclosure includes any suitable combination of hardware and software.
While this disclosure has described several exemplary embodiments, there are alterations, permutations, and various substitute equivalents, which fall within the scope of this disclosure. It will thus be appreciated that those skilled in the art will be able to devise numerous systems and methods which, although not explicitly shown or described herein, embody the principles of the disclosure and are thus within the spirit and scope of the disclosure.
Appendix A: acronyms
JEM: joint exploration model
VVC: next generation video coding
BMS: reference set
MV: motion vector
HEVC: efficient video coding
SEI: auxiliary enhancement information
VUI: video usability information
GOP: picture group
TU: conversion unit
PU (polyurethane): prediction unit
And (3) CTU: coding tree unit
CTB: coding tree block
PB: prediction block
HRD: hypothetical reference decoder
SNR: signal to noise ratio
A CPU: central processing unit
GPU: graphics processing unit
CRT: cathode ray tube having a shadow mask with a plurality of apertures
LCD: liquid crystal display device with a light guide plate
An OLED: organic light emitting diode
CD: optical disk
DVD: digital video CD
ROM: read-only memory
RAM: random access memory
ASIC: application specific integrated circuit
PLD: programmable logic device
LAN: local area network
GSM: global mobile communication system
LTE: long term evolution
CANBus: controller area network bus
USB: universal serial bus
PCI: interconnecting peripheral components
FPGA: field programmable gate area
SSD: solid state drive
IC: integrated circuit with a plurality of transistors
HDR: high dynamic range
SDR: standard dynamic range
JFET: joint video exploration team
MPM: most probable mode
WAIP: wide-angle intra prediction
CU: coding unit
PU (polyurethane): prediction unit
TU: conversion unit
And (3) CTU: coding tree unit
PDPC: position dependent predictive combining
ISP: intra sub-partition
SPS: sequence parameter setting
PPS: picture parameter set
APS: adaptive parameter set
VPS: video parameter set
DPS: decoding parameter set
ALF: adaptive loop filter
SAO: sampling adaptive offset
CC-ALF: cross-component adaptive loop filter
And (3) CDEF: constrained directional enhancement filter
CCSO: cross-component sample offset
LSO: local sample offset
LR: loop recovery filter
AV1: AOmedia video 1
AV2: AOMedia video 2.

Claims (22)

1. A method for decoding video data in a decoder, the method comprising:
receiving an encoded video stream of a data block;
extracting a transform partition type associated with the data block from the encoded video bitstream; and
in response to the transform partition type belonging to a subset of a predetermined set of transform partition types, each transform partition type in the predetermined set of transform partition types specifying a splitting mode for splitting the block of data into transform blocks:
extracting a transform type of a transform associated with a transform block split from the data block, the transform type signaled in an encoded video bitstream, wherein the transform type belongs to a first set of predetermined transform types; and
performing an inverse transform on the transform block according to the transform type.
2. The method of claim 1, further comprising: in response to the transform partition type not belonging to a subset of the set of predetermined transform partition types:
determining a transform type associated with the transform block as a predetermined default transform type.
3. The method of claim 1, wherein the transform is a primary transform and the first set of predetermined transform types comprises:
discrete Cosine Transform (DCT) type 1 to DCT type 8;
asymmetric Discrete Sine Transform (ADST);
discrete Sine Transform (DST) type 1 to DST type 8;
line Graph Transformation (LGT); and
karyon-loey transform (KLT).
4. The method of claim 1, wherein the subset of the set of predetermined transform PARTITION types comprises PARTITION _ NONE without transform block partitioning.
5. The method of claim 1, wherein a number of transform partitions associated with each transform partition type in the subset of the set of predetermined transform partition types is less than or equal to a predetermined threshold.
6. The method of claim 5, wherein the predetermined threshold comprises an integer from 1 to 16 and comprises 1 and 16.
7. The method according to any of claims 1 to 6, wherein said transform is a quadratic transform, said first set of predetermined transform types comprising KLT.
8. The method of claim 7, wherein the subset of the predetermined set of transform PARTITION types comprises PARTITION _ NONE without transform block partitioning.
9. The method of claim 7, wherein a number of transform partitions associated with each transform partition type in the subset of the set of predetermined transform partition types is less than or equal to a predetermined threshold.
10. The method of claim 7, wherein the transform type further indicates: the type of primary transform associated with the transform block belongs to a second set of predetermined transform types.
11. The method of claim 10, wherein the subset of the set of predetermined transform PARTITION types includes PARTITION _ NONE without transform block partitioning, and wherein the second set of predetermined transform types includes DCT and ADST.
12. A method for decoding video data, the method comprising:
receiving an encoded video stream of a data block;
extracting a transform partition type associated with a data block of the video data from the encoded video bitstream;
extracting a transform type associated with a transform block of the data block from the encoded video bitstream in response to the transform partition type belonging to a subset of a predetermined set of transform partition types; and
in response to the transform partition type not belonging to the set of predetermined transform partition types, identifying a transform type for the data block in a default manner.
13. The method of claim 12, wherein the subset of the set of predetermined transform PARTITION types includes PARTITION _ NONE without transform block partitioning.
14. The method of claim 12, wherein the transform type comprises a primary transform type or a quadratic transform type.
15. A method for decoding video data, the method comprising:
receiving an encoded video stream of a data block;
extracting a transform type of a transform associated with a transform block of the data block from the encoded video bitstream; and
in response to the transform type belonging to a predetermined set of transform types:
a transform partition type associated with the data block is extracted from the encoded video bitstream.
16. The method of claim 15, further comprising:
in response to the transform type not belonging to the set of predetermined transform types:
determining a transform PARTITION type associated with the data block as a predetermined default transform PARTITION type, the predetermined default transform PARTITION type comprising PARTITION _ NONE.
17. The method of claim 15, wherein the transform is a primary transform, the set of predetermined transform types comprising:
DCT type 2;
ADST;
DCT type 1 to DCT type 8;
DST type 1 to DST type 8;
LGT; or alternatively
KLT。
18. The method of claim 15, wherein the transform is a quadratic transform, the set of predetermined transform types including KLTs having kernels associated with predetermined KLT indices.
19. An apparatus comprising circuitry configured to implement the method of any of claims 1 to 6 or the method of any of claims 12 to 18.
20. An apparatus comprising circuitry configured to implement the method of claim 7.
21. A computer program product comprising a non-transitory computer-readable program medium having stored thereon computer code that, when executed by one or more processors, causes the one or more processors to implement the method of any of claims 1-6 or the method of any of claims 12-18.
22. A computer program product comprising a non-transitory computer-readable program medium having stored thereon computer code that, when executed by one or more processors, causes the one or more processors to implement the method of claim 7.
HK62023069256.3A 2021-04-16 2022-01-28 Methods, apparatus and storage medium for decoding video data HK40080098B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US63/175,897 2021-04-16
US17/568,275 2022-01-04

Publications (2)

Publication Number Publication Date
HK40080098A true HK40080098A (en) 2023-04-28
HK40080098B HK40080098B (en) 2025-06-20

Family

ID=

Similar Documents

Publication Publication Date Title
KR102806126B1 (en) Generating orthogonal transformations with subspace constraints
JP2024535959A (en) Secondary Transforms for Combined Inter-Intra Prediction Modes
EP4179726A1 (en) Harmonized design for offset based refinement and multiple reference line selection
US20250274597A1 (en) Interaction Between Transform Partitioning And Primary/Secondary Transform Type Selection
KR20230106688A (en) Adaptive Multiple Transform Set Selection
CN115486077B (en) Method, apparatus and program medium for decoding video data
JP7683894B2 (en) Scanning Order of Secondary Transform Coefficients
CN119110095B (en) Improved intra-frame mode coding
US12120335B2 (en) Hardware friendly design for intra mode coding
KR20230170759A (en) Signaling of EOB for one-dimensional transformation skipping
HK40080098A (en) Methods, apparatus and storage medium for decoding video data
HK40084656A (en) Method for processing video information, device and storage medium
HK40080098B (en) Methods, apparatus and storage medium for decoding video data
HK40073788A (en) Video encoding and decoding method and apparatus