US20130089154A1

US20130089154A1 - Adaptive frame size support in advanced video codecs

Info

Publication number: US20130089154A1
Application number: US13/648,174
Authority: US
Inventors: Ying Chen; Ye-Kui Wang; Marta Karczewicz
Original assignee: Qualcomm Inc
Current assignee: Qualcomm Inc
Priority date: 2011-10-10
Filing date: 2012-10-09
Publication date: 2013-04-11
Also published as: KR20140093229A; KR101569305B1; JP2014532374A; US20130089134A1; WO2013055808A1; WO2013055681A1; US9451284B2; US20130089135A1; CN103959793A; WO2013055806A1; EP2767088A1; JP5972984B2

Abstract

Techniques are described related to receiving first and second sub-sequences of video, wherein the first sub-sequence includes one or more frames each having a first resolution, and the second sub-sequence includes one or more frames each having a second resolution, receiving a first sequence parameter set and a second sequence parameter set for the coded video sequence, wherein the first sequence parameter set indicates the first resolution of the one or more frames of the first sub-sequence, and the second sequence parameter set indicates the second resolution of the one or more frames of the second sub-sequence, and wherein the first sequence parameter set is different than the second sequence parameter set, and using the first sequence parameter set and the second sequence parameter set to decode the coded video sequence.

Description

This application claims the benefit of:
U.S. Provisional Application No. 61/545,525, filed Oct. 10, 2011, and
U.S. Provisional Application No. 61/550,276, filed on Oct. 21, 2011 the entire contents each of which is hereby incorporated by reference in its entirety.

TECHNICAL FIELD

This disclosure relates to video coding and, more particularly, to techniques for coding video data.

BACKGROUND

Digital video capabilities can be incorporated into a wide range of devices, including digital televisions, digital direct broadcast systems, wireless broadcast systems, personal digital assistants (PDAs), laptop or desktop computers, tablet computers, e-book readers, digital cameras, digital recording devices, digital media players, video gaming devices, video game consoles, cellular or satellite radio telephones, so-called “smart phones,” video teleconferencing devices, video streaming devices, and the like. Digital video devices implement video compression techniques, such as those described in the standards defined by MPEG-2, MPEG-4, ITU-T H.263, ITU-T H.264/MPEG-4, Part 10, Advanced Video Coding (AVC), the High Efficiency Video Coding (HEVC) standard presently under development, and extensions of such standards. The video devices may transmit, receive, encode, decode, and/or store digital video information more efficiently by implementing such video compression techniques.
Video compression techniques perform spatial (intra-picture) prediction and/or temporal (inter-picture) prediction to reduce or remove redundancy inherent in video sequences. For block-based video coding, a video slice (i.e., a video picture or a portion of a video picture) may be partitioned into video blocks, which may also be referred to as treeblocks, coding tree blocks (CTBs), coding tree units (CTUs), coding units (CUs) and/or coding nodes. Video blocks in an intra-coded (I) slice of a picture are encoded using spatial prediction with respect to reference samples in neighboring blocks in the same picture. Video blocks in an inter-coded (P or B) slice of a picture may use spatial prediction with respect to reference samples in neighboring blocks in the same picture or temporal prediction with respect to reference samples in other reference pictures. Pictures may be referred to as frames, and reference pictures may be referred to a reference frames.
Spatial or temporal prediction results in a predictive block for a block to be coded. Residual data represents pixel differences between the original block to be coded and the predictive block. An inter-coded block is encoded according to a motion vector that points to a block of reference samples forming the predictive block, and the residual data indicating the difference between the coded block and the predictive block. An intra-coded block is encoded according to an intra-coding mode and the residual data. For further compression, the residual data may be transformed from the pixel domain to a transform domain, resulting in residual transform coefficients, which then may be quantized. The quantized transform coefficients, initially arranged in a two-dimensional array, may be scanned in order to produce a one-dimensional vector of transform coefficients, and entropy coding may be applied to achieve even more compression.

SUMMARY

In general, this disclosure describes techniques for coding video sequences that include frames, or “pictures,” having different spatial resolutions. One aspect of this disclosure includes using multiple sequence parameter sets in a single resolution-adaptive coded video sequence to indicate a resolution of a sequence of pictures in coded video. As one example, the resolution-adaptive coded video sequence may comprise two or more sub-sequences which may be coded, wherein each sub-sequence may comprise a set of pictures with a common spatial resolution, and may refer to a same active sequence parameter set. Another aspect of this disclosure includes a novel activation process for activating a sequence parameter set when using multiple sequence parameter sets in a single resolution-adaptive coded video sequence, as described above.
Yet another aspect of this disclosure includes novel techniques for managing a decoded picture buffer (DPB). As one example, a size of a DPB is not indicated using a number of frame buffers (e.g., a number of storage locations each capable of storing a frame, or “picture,” of a fixed size), consistent with some techniques, but rather using a different unit of size. As another example, before inserting a decoded picture into a DPB, the availability of the DPB to store the decoded picture is determined based on a spatial resolution of the decoded picture to be inserted, so as to ensure that the DPB includes sufficient empty buffer space for inserting the decoded picture. As still another example, after removing a decoded picture from a DPB, the availability of the DPB to store a subsequent decoded picture is determined based on a spatial resolution of the removed decoded picture, and a spatial resolution of the subsequent decoded picture to be inserted into the DPB. In other words, the proportion of the DPB unavailable to store decoded pictures, or a “fullness” of the DPB, after removing the decoded picture, is not decreased by an amount corresponding to a single decoded picture of a fixed size, consistent with some techniques, but rather by a varying amount, depending on the spatial resolution of the removed decoded picture.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 is a block diagram illustrating an example video encoding and decoding system that may utilize techniques described in this disclosure.

FIG. 2 is a block diagram illustrating an example video encoder that may implement the techniques described in this disclosure.

FIG. 3 is a block diagram illustrating an example video decoder that may implement the techniques described in this disclosure.

FIGS. 4A-4D are conceptual diagrams illustrating an example video sequence that includes a plurality of pictures that are encoded and transmitted in accordance with the techniques of this disclosure.

FIG. 5 is a conceptual diagram illustrating the operation of a decoded picture buffer of a hypothetical reference decoder (HRD) model in accordance with the techniques of this disclosure.

FIG. 6 is a flowchart illustrating an example operation of using a first sub-sequence and a second sub-sequence to decode video in accordance with the techniques of this disclosure.

FIG. 7 is a flowchart illustrating an example operation of managing a decoded picture buffer in accordance with the techniques of this disclosure.

DETAILED DESCRIPTION

The techniques of this disclosure are generally related for techniques for using multiple sequence parameter sets (SPSs) for communicating video data at different resolutions, and techniques for managing the multiple SPSs. In the current High Efficiency Video Coding (HEVC) design, pictures in a same coded video sequence (CVS) have a same size, wherein the size is signaled in a sequence parameter set (SPS) for the CVS. Additional syntax information for the CVS also signaled in the SPS includes the Largest Coding Unit (LCU) size and the Smallest Coding Unit (SCU) size, which define a largest and a smallest block, or coding unit, size for each picture, respectively. In the context of H.264/AVC and High Efficiency Video Coding (HEVC), a CVS may refer to a sequence of coded pictures starting from an instantaneous decoding refresh (IDR) picture to another IDR picture, exclusive, in decoding order, or the end of the coded video bitstream if the starting IDR picture is the last IDR picture in coded video bitstream.
However, HEVC may support resolution-adaptive video sequences that include frames with different resolutions. One method for adaptive frame size support is described in JCTVC-F158: Resolution switching for coding efficiency and resilience, Davies, 6th Meeting, Turin, IT, 14-22 Jul. 2011, referred to as JCTVC-F158 hereinafter.
To support resolution-adaptive video, this disclosure describes techniques for coding multiple SPSs. Each SPS of the multiple SPSs may include information related to a sequence of pictures that has a different resolution. This disclosure also introduces a new sequence, referred to as a resolution sub-sequence (RSS) that may refer back to one of the multiple SPSs in order to indicate the resolution of a sequence of pictures. This disclosure also describes techniques for activating a single SPS when multiple parameters sets may be utilized within a single CVS, as well as different techniques and orders for transmitting the different SPSs.
The techniques of this disclosure are also related to techniques for managing a decoded picture buffer (DPB). For example, a video coder (e.g., a video encoder or a video decoder) includes a DPB. The DPB stores decoded pictures, including reference pictures. Reference pictures are pictures that can potentially be used for inter-predicting a picture. In other words, the video coder may predict a picture, during coding (encoding or decoding) of that picture, based on one or more reference pictures stored in the DPB.
Decoded pictures used for predicting subsequent coded pictures, and for future output, are buffered in a Decoded Picture Buffer (DPB).
To efficiently utilize memory of a DPB, DPB management processes, including a storage process of decoded pictures into the DPB, a marking process of reference pictures, and an output and removal processes of decoded pictures from the DPB, are specified. DPB management includes at least the following aspects: (1) Picture identification and reference picture identification; (2) Reference picture list construction; (3) Reference picture marking; (4) Picture output from the DPB; (5) Picture insertion into the DPB; and (6) Picture removal from the DPB. Some introduction to reference picture marking and reference picture list construction is included below.
Each CVS may include a number of reference pictures, which may be used to predict pixel values of other pictures (e.g., pictures that come before or after the reference picture). A video coder marks each reference picture, and stores the reference picture in the DPB. In previous video coding standards, such as H.264/AVC, the DPB includes a maximum number, referred to as M (num_ref_frames), of reference pictures used for inter-prediction in the active sequence parameter set. When a reference picture is decoded, the reference picture is marked as “used for reference.” If the decoding of the reference picture caused more than M pictures to be marked as “used for reference,” at least one picture must be marked as “unused for reference.” The DPB removal process then would remove pictures marked as “unused for reference” from the DPB if they are not needed for output as well.
When a picture is decoded, the decoded picture may be either a non-reference picture or a reference picture. A reference picture may be a long-term reference picture or short-term reference picture, and when the decoded picture is marked as “unused for reference”, the decoded picture may become no longer needed for reference. In some video coding standards, there may be reference picture marking operations that change the status of the reference pictures.
There may be at least two types of operation modes for the reference picture marking, such as a sliding window operation mode, and an adaptive memory control operation mode. The operation mode for reference picture marking may be selected on a picture basis; whereas, the sliding window operation mode may work as a first-in-first-out queue with a fixed number of short-term reference pictures. In other words, short-term reference pictures with earliest decoding time may be the first to be removed (marked as picture not used for reference), in an implicit fashion.
The video coder may also be tasked with constructing reference picture lists that indicate which reference pictures may be used for inter-prediction purposes. Two of these reference picture lists are referred to as List 0 and List 1, respectively. The video coder firstly employs default construction techniques to construct List 0 and List 1 (e.g., preconfigured construction schemes for constructing List 0 and List 1). Optionally, after the initial List 0 and List 1 are constructed, the video decoder may decode syntax elements, when present, that instruct the video decoder to modify the initial List 0 and List 1.
The video encoder may signal syntax elements that are indicative of identifier(s) of reference pictures in the DPB, and the video encoder may also signal syntax elements that include indices, within List 0, List 1, or both List 0 and List 1, that indicate which reference picture or pictures to use to decode a coded block of a current picture. The video decoder, in turn, uses the received identifier to identify the index value or values for a reference picture or reference pictures listed in List 0, List 1, or both List 0 and List 1. From the index value(s) as well as the identifier(s) of the reference picture or reference pictures, the video decoder retrieves the reference picture or reference pictures, or part(s) thereof, from the DPB, and decodes the coded block of the current picture based on the retrieved reference picture or pictures and one or more motion vectors that identify blocks within the reference picture or pictures that are used for decoding the coded block.
In the context of AVC and HEVC, a coded video sequence (CVS) refers to a sequence of coded frames, or “pictures,” ranging from an instantaneous decoding refresh (IDR) picture to another IDR picture, exclusive, in a decoding order, or to an end of a coded video bitstream if the starting IDR picture is the last IDR picture in the coded video bitstream.
However, when coding a single CVS comprising pictures having at least two different spatial resolutions, with respect to some solutions based on HEVC, e.g., as described in JCTVC-F158, using a DPB having a size measured in pictures may cause a number of issues, which are described below.
First, a sub-sequence of pictures with one resolution may have different coding parameters, such as an LCU size, than another sub-sequence of pictures with another, different resolution. Accordingly, it may not be sufficient to use a single active SPS to describe characteristics of a CVS comprising the sub-sequences of pictures with the different resolutions.
Furthermore, different sub-sequences of a CVS may have reference pictures having different sizes, that is, different spatial resolutions. Accordingly, one set of particular parameters included in an SPS for the CVS, e.g., max_num_ref_frames, may be optimal for one sub-sequence, but can be sub-optimal for all sub-sequences included in the CVS.
Additionally, some techniques for DPB management may no longer be effective when coding a single CVS that includes pictures having different resolutions. As one example, because the pictures having the different resolutions may correspond to the pictures having different sizes, a size of a DPB used to store the pictures can no longer be indicated using a number of frame buffers, e.g., a number of storage locations each capable of storing a frame, or “picture,” of a fixed size.
Furthermore, to insert a decoded picture into the DPB, the DPB must include an empty frame buffer of a size that is sufficiently large to store the decoded picture. However, once again, because the pictures having the different resolutions may correspond to the pictures having different sizes, a frame buffer of a fixed size may not correspond to a size of a particular decoded picture to be inserted. Accordingly, merely determining whether the DPB includes an empty frame buffer of a fixed size may be insufficient to determine whether the DPB is available to store the decoded picture. As one example, the DPB may have less buffer space than is required to store the decoded picture.
Similarly, after removing a decoded picture from the DPB, wherein the removed decoded picture has a resolution that corresponds to a size that is different than the size of the frame buffer, merely determining that the decoded picture has been removed from the DPB may be insufficient to determine whether the DPB is actually available to store a subsequent decoded picture having a particular resolution. Furthermore, the above determination is also insufficient to indicate the actual buffer space that may be available within the DPB for storing additional decoded pictures.
In another example, a single empty frame buffer of a fixed size may exist within the DPB, and the DPB may store decoded picture(s) having a particular resolution in the frame buffer. However, if a video coder removes a decoded picture from the DPB, and the removed picture has a resolution that is smaller than the size of the frame buffer, sufficient buffer space may exist within the DPB to insert a decoded picture with a resolution that corresponds to a size that is larger than the size of the removed decoded picture. Accordingly, merely determining that a particular decoded picture has been removed from the DPB may be insufficient to indicate the actual buffer space that may be available within the DPB for storing additional decoded pictures.
FIG. 1 is a block diagram illustrating an example video encoding and decoding system 10 that may utilize techniques described in this disclosure. In general, a reference picture set is defined as a set of reference pictures associated with a picture, consisting of all reference pictures that are prior to the associated picture in decoding order, that may be used for inter prediction of the associate picture or any picture following the associated picture in decoding order. In some examples, the reference pictures that are prior to the associated picture may be reference pictures until the next instantaneous decoding refresh (IDR) picture, or broken link access (BLA) picture. In other words, reference pictures in the reference picture set may all be prior to the current picture in decoding order. Also, the reference pictures in the reference picture set may be used for inter-predicting the current picture and/or inter-predicting any picture following the current picture in decoding order until the next IDR picture or BLA picture.
For example, some of the reference pictures in the reference picture set are reference pictures that can potentially be used to inter-predict a block of the current picture, and not pictures following the current picture in decoding order. Some of the reference pictures in the reference picture set are reference pictures that can potentially be used to inter-predict a block of the current picture, and blocks in one or more pictures following the current picture in decoding order. Some of the reference pictures in the reference picture set are reference pictures that can potentially be used to inter-predict blocks in one or more pictures following the current picture in decoding order, and cannot be used to inter-predict a block in the current picture.
As used in this disclosure, reference pictures that can potentially be used for inter-prediction refer to reference pictures that can be used for inter-prediction, but do not necessarily have to be used for inter-prediction. For example, the reference picture set may identify reference pictures that can potentially be used for inter-prediction. However, this does not mean that all of the identified reference pictures must be used for inter-prediction. Rather, one or more of these identified reference pictures could be used for inter-prediction, but all do not necessarily have to be used for inter-prediction.
As shown in FIG. 1, system 10 includes a source device 12 that generates encoded video for decoding by destination device 14. Source device 12 and destination device 14 may each be an example of a video coding device. Source device 12 may transmit the encoded video to destination device 14 via communication channel 16 or may store the encoded video on a storage medium 17 or a file server 19, such that the encoded video may be accessed by the destination device 14 as desired.
Source device 12 and destination device 14 may comprise any of a wide range of devices, including a wireless handset such as so-called “smart” phones, so-called “smart” pads, or other such wireless devices equipped for wireless communication. Additional examples of source device 12 and destination device 14 include, but are not limited to, a digital television, a device in digital direct broadcast system, a device in wireless broadcast system, a personal digital assistants (PDA), a laptop computer, a desktop computer, a tablet computer, an e-book reader, a digital camera, a digital recording device, a digital media player, a video gaming device, a video game console, a cellular radio telephone, a satellite radio telephone, a video teleconferencing device, and a video streaming device, a wireless communication device, or the like.
As indicated above, in many cases, source device 12 and/or destination device 14 may be equipped for wireless communication. Hence, communication channel 16 may comprise a wireless channel, a wired channel, or a combination of wireless and wired channels suitable for transmission of encoded video data. Similarly, the file server 19 may be accessed by the destination device 14 through any standard data connection, including an Internet connection. This may include a wireless channel (e.g., a Wi-Fi connection), a wired connection (e.g., DSL, cable modem, etc.), or a combination of both that is suitable for accessing encoded video data stored on a file server.
The techniques of this disclosure, however, may be applied to video coding in support of any of a variety of multimedia applications, such as over-the-air television broadcasts, cable television transmissions, satellite television transmissions, streaming video transmissions, e.g., via the Internet, encoding of digital video for storage on a data storage medium, decoding of digital video stored on a data storage medium, or other applications. In some examples, system 10 may be configured to support one-way or two-way video transmission to support applications such as video streaming, video playback, video broadcasting, and/or video telephony
In the example of FIG. 1, source device 12 includes a video source 18, video encoder 20, a modulator/demodulator (MODEM) 22 and an output interface 24. In source device 12, video source 18 may include a source such as a video capture device, such as a video camera, a video archive containing previously captured video, a video feed interface to receive video from a video content provider, and/or a computer graphics system for generating computer graphics data as the source video, or a combination of such sources. As one example, if video source 18 is a video camera, source device 12 and destination device 14 may form so-called camera phones or video phones. However, the techniques described in this disclosure may be applicable to video coding in general, and may be applied to wireless and/or wired applications.
The captured, pre-captured, or computer-generated video may be encoded by video encoder 20. The encoded video information may be modulated by modem 22 according to a communication standard, such as a wireless communication protocol, and transmitted to destination device 14 via output interface 24. Modem 22 may include various mixers, filters, amplifiers or other components designed for signal modulation. Output interface 24 may include circuits designed for transmitting data, including amplifiers, filters, and one or more antennas.
The captured, pre-captured, or computer-generated video that is encoded by the video encoder 20 may also be stored onto a storage medium 17 or a file server 19 for later consumption. The storage medium 17 may include Blu-ray discs, DVDs, CD-ROMs, flash memory, or any other suitable digital storage media for storing encoded video. The encoded video stored on the storage medium 17 may then be accessed by destination device 14 for decoding and playback.
File server 19 may be any type of server capable of storing encoded video and transmitting that encoded video to the destination device 14. Example file servers include a web server (e.g., for a website), an FTP server, network attached storage (NAS) devices, a local disk drive, or any other type of device capable of storing encoded video data and transmitting it to a destination device. The transmission of encoded video data from the file server 19 may be a streaming transmission, a download transmission, or a combination of both. The file server 19 may be accessed by the destination device 14 through any standard data connection, including an Internet connection. This may include a wireless channel (e.g., a Wi-Fi connection), a wired connection (e.g., DSL, cable modem, Ethernet, USB, etc.), or a combination of both that is suitable for accessing encoded video data stored on a file server.
Destination device 14, in the example of FIG. 1, includes an input interface 26, a modem 28, a video decoder 30, and a display device 32. Input interface 26 of destination device 14 receives information over channel 16, as one example, or from storage medium 17 or file server 17, as alternate examples, and modem 28 demodulates the information to produce a demodulated bitstream for video decoder 30. The demodulated bitstream may include a variety of syntax information generated by video encoder 20 for use by video decoder 30 in decoding video data. Such syntax may also be included with the encoded video data stored on a storage medium 17 or a file server 19. As one example, the syntax may be embedded with the encoded video data, although aspects of this disclosure should not be considered limited to such a requirement. The syntax information defined by video encoder 20, which is also used by video decoder 30, may include syntax elements that describe characteristics and/or processing of video blocks, such as coding tree units (CTUs), coding tree blocks (CTBs), prediction units (PUs), coding units (CUs) or other units of coded video, e.g., video slices, video pictures, and video sequences or groups of pictures (GOPs). Each of video encoder 20 and video decoder 30 may form part of a respective encoder-decoder (CODEC) that is capable of encoding or decoding video data.
Display device 32 may be integrated with, or external to, destination device 14. In some examples, destination device 14 may include an integrated display device and also be configured to interface with an external display device. In other examples, destination device 14 may be a display device. In general, display device 32 displays the decoded video data to a user, and may comprise any of a variety of display devices such as a liquid crystal display (LCD), a plasma display, an organic light emitting diode (OLED) display, or another type of display device.
In the example of FIG. 1, communication channel 16 may comprise any wireless or wired communication medium, such as a radio frequency (RF) spectrum or one or more physical transmission lines, or any combination of wireless and wired media. Communication channel 16 may form part of a packet-based network, such as a local area network, a wide-area network, or a global network such as the Internet. Communication channel 16 generally represents any suitable communication medium, or collection of different communication media, for transmitting video data from source device 12 to destination device 14, including any suitable combination of wired or wireless media. Communication channel 16 may include routers, switches, base stations, or any other equipment that may be useful to facilitate communication from source device 12 to destination device 14.
Video encoder 20 and video decoder 30 may operate according to a video compression standard, such as the include ITU-T H.261, ISO/IEC MPEG-1 Visual, ITU-T H.262 or ISO/IEC MPEG-2 Visual, ITU-T H.263, ISO/IEC MPEG-4 Visual and ITU-T H.264 (also known as ISO/IEC MPEG-4 AVC), including its Scalable Video Coding (SVC) and Multiview Video Coding (MVC) extensions. In addition, there is a new video coding standard, namely High Efficiency Video Coding (HEVC) standard presently under development by the Joint Collaboration Team on Video Coding (JCT-VC) of ITU-T Video Coding Experts Group (VCEG) and ISO/IEC Motion Picture Experts Group (MPEG). A recent Working Draft (WD) of HEVC, and referred to as HEVC WD8 hereinafter, is available, as of Jul. 20, 2012, from http://phenix.int-evry.fr/jct/doc_end_user/documents/10_Stockholm/wg11/JCTVC-J1003-v8.zip.
The techniques of this disclosure, however, are not limited to any particular coding standard. For purposes of illustration only, the techniques are described in accordance with the HEVC standard.
Although not shown in FIG. 1, in some aspects, video encoder 20 and video decoder 30 may each be integrated with an audio encoder and decoder, and may include appropriate MUX-DEMUX units, or other hardware and software, to handle encoding of both audio and video in a common data stream or separate data streams. If applicable, MUX-DEMUX units may conform to the ITU H.223 multiplexer protocol, or other protocols such as the user datagram protocol (UDP).
Video encoder 20 and video decoder 30 each may be implemented as any of a variety of suitable encoder circuitry, such as one or more processors including microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), discrete logic, software, hardware, firmware or any combinations thereof. When the techniques are implemented partially in software, a device may store instructions for the software in a suitable, non-transitory computer-readable medium and execute the instructions in hardware using one or more processors to perform the techniques of this disclosure.
Each of video encoder 20 and video decoder 30 may be included in one or more encoders or decoders, either of which may be integrated as part of a combined encoder/decoder (CODEC) in a respective device. In some instances, video encoder 20 and video decoder 30 may be commonly referred to as a video coder that codes information (e.g., pictures and syntax elements). The coding of information may refer to encoding when the video coder corresponds to video encoder 20. The coding of information may refer to decoding when the video coder corresponds to video decoder 30.
FIG. 2 is a block diagram illustrating an example video encoder 20 that may implement the techniques described in this disclosure. Video encoder 20 may perform intra- and inter-coding of video blocks within video slices. Intra coding relies on spatial prediction to reduce or remove spatial redundancy in video within a given video frame or picture. Inter-coding relies on temporal prediction to reduce or remove temporal redundancy in video within adjacent frames or pictures of a video sequence. Intra-mode (I mode) may refer to any of several spatial based compression modes. Inter-modes, such as uni-directional prediction (P mode) or bi-prediction (B mode), may refer to any of several temporal-based compression modes.
In the example of FIG. 2, video encoder 20 includes a partitioning unit 35, prediction processing unit 41, summer 50, transform processing unit 52, quantization unit 54, entropy encoding unit 56, decoded picture buffer (DPB) 64, and DPB management unit 65. Prediction processing unit 41 includes motion estimation unit 42, motion compensation unit 44, and intra prediction unit 46. For video block reconstruction, video encoder 20 also includes inverse quantization unit 58, inverse transform unit 60, and summer 62. A deblocking filter (not shown in FIG. 2) may also be included to filter block boundaries to remove blockiness artifacts from reconstructed video. If desired, the deblocking filter would typically filter the output of summer 62. Additional loop filters (in loop or post loop) may also be used in addition to the deblocking filter.
As shown in FIG. 2, video encoder 20 receives video data, and partitioning unit 35 partitions the data into video blocks. This partitioning may also include partitioning into slices, tiles, or other larger units, as wells as video block partitioning, e.g., according to a quadtree structure of LCUs and CUs. Video encoder 20 generally illustrates the components that encode video blocks within a video slice to be encoded. The slice may be divided into multiple video blocks (and possibly into sets of video blocks referred to as tiles). Prediction processing unit 41 may select one of a plurality of possible coding modes, such as one of a plurality of intra coding modes or one of a plurality of inter coding modes, for the current video block based on error results (e.g., coding rate and the level of distortion). Prediction processing unit 41 may provide the resulting intra- or inter-coded block to summer 50 to generate residual block data and to summer 62 to reconstruct the encoded block for use as a reference picture.
Intra prediction unit 46 within prediction processing unit 41 may perform intra-predictive coding of the current video block relative to one or more neighboring blocks in the same picture or slice as the current block to be coded to provide spatial compression. Motion estimation unit 42 and motion compensation unit 44 within prediction processing unit 41 perform inter-predictive coding of the current video block relative to one or more predictive blocks in one or more reference pictures to provide temporal compression.
Motion estimation unit 42 may be configured to determine the inter-prediction mode for a video slice according to a predetermined pattern for a video sequence. The predetermined pattern may designate video slices in the sequence as P slices or B slices. Motion estimation unit 42 and motion compensation unit 44 may be highly integrated, but are illustrated separately for conceptual purposes. Motion estimation, performed by motion estimation unit 42, is the process of generating motion vectors, which estimate motion for video blocks. A motion vector, for example, may indicate the displacement of a PU of a video block within a current video picture relative to a predictive block within a reference picture.
A predictive block is a block that is found to closely match the PU of the video block to be coded in terms of pixel difference, which may be determined by sum of absolute difference (SAD), sum of square difference (SSD), or other difference metrics. In some examples, video encoder 20 may calculate values for sub-integer pixel positions of reference pictures stored in decoded picture buffer 64. For example, video encoder 20 may interpolate values of one-quarter pixel positions, one-eighth pixel positions, or other fractional pixel positions of the reference picture. Therefore, motion estimation unit 42 may perform a motion search relative to the full pixel positions and fractional pixel positions and output a motion vector with fractional pixel precision.
Motion estimation unit 42 calculates a motion vector for a PU of a video block in an inter-coded slice by comparing the position of the PU to the position of a predictive block of a reference picture. The reference picture may be selected from a first reference picture list (List 0) or a second reference picture list (List 1), each of which identify one or more reference pictures stored in decoded picture buffer 64. Motion estimation unit 42 sends the calculated motion vector to entropy encoding unit 56 and motion compensation unit 44.
Motion compensation, performed by motion compensation unit 44, may involve fetching or generating the predictive block based on the motion vector determined by motion estimation, possibly performing interpolations to sub-pixel precision. Upon receiving the motion vector for the PU of the current video block, motion compensation unit 44 may locate the predictive block to which the motion vector points in one of the reference picture lists. Video encoder 20 forms a residual video block by subtracting pixel values of the predictive block from the pixel values of the current video block being coded, forming pixel difference values. The pixel difference values form residual data for the block, and may include both luma and chroma difference components. Summer 50 represents the component or components that perform this subtraction operation. Motion compensation unit 44 may also generate syntax elements associated with the video blocks and the video slice for use by video decoder 30 in decoding the video blocks of the video slice.
Intra-prediction unit 46 may intra-predict a current block, as an alternative to the inter-prediction performed by motion estimation unit 42 and motion compensation unit 44, as described above. In particular, intra-prediction unit 46 may determine an intra-prediction mode to use to encode a current block. In some examples, intra-prediction unit 46 may encode a current block using various intra-prediction modes, e.g., during separate encoding passes, and intra-prediction unit 46 (or mode select unit 40, in some examples) may select an appropriate intra-prediction mode to use from the tested modes. For example, intra-prediction unit 46 may calculate rate-distortion values using a rate-distortion analysis for the various tested intra-prediction modes, and select the intra-prediction mode having the best rate-distortion characteristics among the tested modes. Rate-distortion analysis generally determines an amount of distortion (or error) between an encoded block and an original, unencoded block that was encoded to produce the encoded block, as well as a bit rate (that is, a number of bits) used to produce the encoded block. Intra-prediction unit 46 may calculate ratios from the distortions and rates for the various encoded blocks to determine which intra-prediction mode exhibits the best rate-distortion value for the block.
After selecting an intra-prediction mode for a block, intra-prediction unit 46 may provide information indicative of the selected intra-prediction mode for the block to entropy encoding unit 56. Entropy encoding unit 56 may encode the information indicating the selected intra-prediction mode in accordance with the techniques of this disclosure. Video encoder 20 may include in the transmitted bitstream configuration data, which may include a plurality of intra-prediction mode index tables and a plurality of modified intra-prediction mode index tables (also referred to as codeword mapping tables), definitions of encoding contexts for various blocks, and indications of a most probable intra-prediction mode, an intra-prediction mode index table, and a modified intra-prediction mode index table to use for each of the contexts.
After prediction processing unit 41 generates the predictive block for the current video block via either inter-prediction or intra-prediction, video encoder 20 forms a residual video block by subtracting the predictive block from the current video block. The residual video data in the residual block may be included in one or more TUs and applied to transform processing unit 52. Transform processing unit 52 transforms the residual video data into residual transform coefficients using a transform, such as a discrete cosine transform (DCT) or a conceptually similar transform. Transform processing unit 52 may convert the residual video data from a pixel domain to a transform domain, such as a frequency domain.
Transform processing unit 52 may send the resulting transform coefficients to quantization unit 54. Quantization unit 54 quantizes the transform coefficients to further reduce bit rate. The quantization process may reduce the bit depth associated with some or all of the coefficients. The degree of quantization may be modified by adjusting a quantization parameter. In some examples, quantization unit 54 may then perform a scan of the matrix including the quantized transform coefficients. Alternatively, entropy encoding unit 56 may perform the scan.
Following quantization, entropy encoding unit 56 entropy encodes the quantized transform coefficients. For example, entropy encoding unit 56 may perform context adaptive variable length coding (CAVLC), context adaptive binary arithmetic coding (CABAC), syntax-based context-adaptive binary arithmetic coding (SBAC), probability interval partitioning entropy (PIPE) coding or another entropy encoding methodology or technique. Following the entropy encoding by entropy encoding unit 56, the encoded bitstream may be transmitted to video decoder 30, or archived for later transmission or retrieval by video decoder 30. Entropy encoding unit 56 may also entropy encode the motion vectors and the other syntax elements for the current video slice being coded.
Inverse quantization unit 58 and inverse transform unit 60 apply inverse quantization and inverse transformation, respectively, to reconstruct the residual block in the pixel domain for later use as a reference block of a reference picture. Motion compensation unit 44 may calculate a reference block by adding the residual block to a predictive block of one of the reference pictures within one of the reference picture lists. Motion compensation unit 44 may also apply one or more interpolation filters to the reconstructed residual block to calculate sub-integer pixel values for use in motion estimation. Summer 62 adds the reconstructed residual block to the motion compensated prediction block produced by motion compensation unit 44 to produce a reference block for storage in decoded picture buffer 64. The reference block may be used by motion estimation unit 42 and motion compensation unit 44 as a reference block to inter-predict a block in a subsequent video frame or picture.
In accordance with this disclosure, prediction processing unit 41 represents one example unit for performing the example functions described above. For example, prediction processing unit 41 may encode syntax elements that support the use of adaptive resolution CVSs. Prediction processing unit 41 may also generate SPSs that may be activated by one or more resolution sub-sequences, and transmit the SPSs and RSSs to a video decoder. Each of the SPSs may include resolution information for one or more sequences of pictures. Prediction processing unit 41 may also receive and order one or more SPSs and cause video encoder 20 to code information indicative of the reference pictures that belong to the reference picture set. In addition, DPB management unit 65 may also perform techniques related to the management of DPB 64.
Also, during the reconstruction process (e.g., the process used to reconstruct a picture for use as a reference picture and storage in DPB 64), prediction processing unit 41 may construct the plurality of reference picture subsets that each identifies one or more of the reference pictures. Prediction processing unit processing 41 may also derive the reference picture set from the constructed plurality of reference picture subsets. Also, prediction processing unit 41 and DPB management unit 65 may implement any one or more of the sets of example pseudo code described below to implement one or more example techniques described in this disclosure.
In accordance with the techniques of this disclosure, prediction processing unit 41 may generate a coded video sequence comprising a first sub-sequence and a second sub-sequence, wherein the first sub-sequence includes one or more frames each having a first resolution. The second sub-sequence may include one or more frames each having a second resolution. The first sub-sequence may be different than the second sub-sequence, and the first resolution may be different than the second resolution. Prediction processing unit 41 may further generate a first sequence parameter set and a second sequence parameter set for the video sequence. The first sequence parameter set may indicate the first resolution of the one or more frames of the first sub-sequence, and the second sequence parameter set may indicate the second resolution of the one or more frames of the second sub-sequence. Also, the first sequence parameter set may be different than the second sequence parameter set. Prediction processing unit 41 may transmit the coded video sequence comprising the first sub-sequence and the second sub-sequence, and the first sequence parameter set and the second sequence. In some examples, the resolution may comprise a spatial resolution.
Prediction processing unit 41 may also alter the coding of the sequence parameters sets. For example, prediction processing unit 41 may code the first sequence parameter set and the second sequence parameter in a transmitted bitstream prior to either the first sub-sequence or the second sub-sequence. Prediction processing unit 41 may also interleave in the coded video sequence the one or more frames of the first sub-sequence and the one or more frames of the second sub-sequence.
In some examples, to transmit the first sequence parameter set and the second sequence parameter set of the coded video sequence, prediction processing unit 41 may be configured to transmit both the first sequence parameter set and the second sequence parameter set prior to transmitting either of the first sub-sequence and the second sub-sequence. In another example, to transmit the first sequence parameter set and the second sequence parameter set of the coded video sequence, prediction processing unit 41 may be configured to transmit the second sequence parameter set after transmitting at least one frame of the one or more frames of the first sub-sequence, and prior to transmitting the second sub-sequence.
In some examples, prediction processing unit 41 may code the first sequence parameter set in a transmitted bitstream prior to coding the first sub-sequence and prediction processing unit 41 may also code the second sequence parameter set in the transmitted bitstream after at least one frame of the one or more frames of the first sub-sequence and prior to the second sub-sequence.
Decoded picture buffer 64, decoded picture buffer management unit 65 and video encoder 20 may also perform the techniques of this disclosure. In some examples, decoded picture buffer 64 may receive a first decoded frame of video data, wherein the first decoded frame is associated with a first resolution, determine whether a decoded picture buffer is available to store the first decoded frame based on the first resolution, and in the event the decoded picture buffer is available to store the first decoded frame, store the first decoded frame in the decoded picture buffer, and determine whether decoded buffer 64 is available to store a second decoded frame of video data, wherein the second decoded frame is associated with a second resolution, based on the first resolution and the second resolution, wherein the first decoded frame is different than the second decoded frame.
In some additional examples, DPB management unit 65 may determine an amount of information that may be stored within decoded picture buffer 64, determine an amount of information associated with the first decoded frame based on the first resolution, and compare the amount of information that may be stored within decoded picture buffer 64, and the amount of information associated with the first decoded frame.
In one example, to determine whether decoded picture buffer 64 is available to store the second decoded frame based on the first resolution and the second resolution, DPB management unit 65 may be configured to determine an amount of information that may be stored within decoded picture buffer 64 based on the first resolution, determine an amount of information associated with the second decoded frame based on the second resolution, and compare the amount of information that may be stored within decoded picture buffer 64 and the amount of information associated with the second decoded frame. DPB management unit 65 may also be configured to remove the first decoded frame from decoded picture buffer 64, and in some examples, the resolution may comprise a spatial resolution.
The techniques described in this disclosure may refer to video encoder 20 signaling information. When video encoder 20 signals information, the techniques of this disclosure generally refer to any manner in which video encoder 20 provides the information in a coded bitstream. For example, when video encoder 20 signals syntax elements to video decoder 30, it may mean that video encoder 20 transmitted the syntax elements to video decoder 30 as part of a coded bitstream via output interface 24 and communication channel 16, or that video encoder 20 stored the syntax elements in a coded bitstream on storage medium 17 and/or file server 19 for eventual reception by video decoder 30. In this way, signaling from video encoder 20 to video decoder 30 should not be interpreted as requiring transmission directly from video encoder 20 to video decoder 30, although this may be one possibility for real-time video applications. In other examples, however, signaling from video encoder 20 to video decoder 30 should be interpreted as any technique with which video encoder 20 provides information in a bitstream for eventual reception by video decoder 30, either directly or via an intermediate storage (e.g., in storage medium 17 and/or file server 19).
Video encoder 20 and video decoder 30 may be configured to implement the example techniques described in this disclosure for coding, transmitting, receiving and activating SPSs and RSSs, as well as for managing the DPB. For example, video decoder 30 may invoke the techniques to support adaptive resolution CVSs and to add and remove reference pictures from the DPB. Video decoder 30 may invoke the process in a similar manner.
To support SPSs in a single adaptive-resolution CVS, prediction processing unit 41 may utilize RSSs. Each RSS may indicate information, such as a resolution of a series of coded video pictures of a CVS. Prediction processing unit 41 may use one resolution sub-sequence (RSS) at given time. Each RSS may reference a single SPS. As an example, if there are “n” RSSs in a given CVS, there may be, altogether, “n” active SPSs when decoding the CVS. However, in some examples, multiple RSSs may refer to a single SPS in a CVS. The SPS or PPS may indicate the different resolution of each RSS. The SPS or PPS may include a resolution ID as well as a syntax element that indicates the resolution associated with each resolution ID.
In accordance with the techniques of this disclosure, a computer-readable storage medium may include a data structure that represents CVSs, SPSs, and RSSs. In particular, the data structure may include a coded video sequence comprising a first sub-sequence and a second sub-sequence. The first sub-sequence may include one or more frames each having a first resolution, and the second sub-sequence may include one or more frames each having a second resolution. The first sub-sequence may also be different than the second sub-sequence, and the first resolution may be different than the second resolution. The data structure may further comprise a first sequence parameter set and a second sequence parameter set for the coded video sequence. The first sequence parameter set may indicate the first resolution of the one or more frames of the first sub-sequence, the second sequence parameter set may indicate the second resolution of the one or more frames of the second sub-sequence, and the first sequence parameter set may be different than the second sequence parameter set.
Prediction processing unit 41 of video encoder 20 may order or restrict each of the RSSs according to spatial resolution characteristics of each RSS. In general, prediction processing unit 41 may order the SPSs based on their horizontal resolutions. As an example, if a horizontal size of a resolution “A” of an SPS is greater than that of a resolution “B” of an SPS, a vertical size of the resolution “A” may not be less than that of the resolution “B.” With this restriction, a resolution “C” of an SPS may be considered to be larger than a resolution “D” of an SPS as long as one of a horizontal size and a vertical size of the resolution “C” is greater than a corresponding size of the resolution “D.” Video encoder 20 may assign an RSS with a largest spatial resolution a resolution ID equal to “0,” and an RRS with a second largest spatial resolution a resolution ID equal to “1,” and so forth.
In some examples, prediction processing unit 41 may not signal a resolution ID. Rather, video encoder 20 may derive the resolution ID according to the spatial resolutions of the RSSs. Prediction processing unit 41 may still order each of the RSSs in each CVS according to the spatial resolutions of each RSS, as described above. The RSS with the largest spatial resolution is assigned a resolution ID equal to 0, and the RSS with the second largest spatial resolution is assigned a resolution ID equal to 1, and so on.
For any RSS with a resolution ID equal to “rId,” during inter-prediction, prediction processing unit 41 may refer to decoded pictures only within the same RSS, within an RSS with a resolution ID equal to “rId−1,” or within an RSS with a resolution ID equal to “rId+1.” Prediction processing unit 41 may not refer to decoded pictures within other RSSs when performing inter-prediction.
In some examples, there may be additional restrictions on inter-prediction amongst RSSs. In one instance, an RSS prediction processing unit 41 may only perform inter-prediction of blocks from two adjacent RSSs, i.e., the RSS with the immediately larger spatial resolution and the RSS with the immediately smaller spatial resolution. In another example, prediction processing unit 41 may not be limited to performing inter-prediction using spatially-neighboring RSSs, and prediction processing unit 41 may perform inter-prediction using any RSS, not just spatially neighboring RSSs (e.g., RSSs with rId+1 or rId−1).
The techniques of this disclosure may also include processes and techniques for transmitting and activating picture parameters sets (PPSs). The use of PPSs may decouple the transmission of infrequently changing information from the transmission of coded block data for the CVSs. Video encoder 20 and decoder 30 may, in some applications convey or signal the SPSs and PPSs “out-of-band,” or using a different communication channel than that used to communicate the coded block data of the CVSs, e.g., using a reliable transport mechanism.
A PPS raw byte sequence payload (RBSP) may include parameters to which coded slice network abstraction layer (NAL) units of one or more coded pictures may refer. Each PPS RBSP is initially considered not active at a start of a decoding process. At most, one PPS RBSP is considered active at any given moment during the decoding process, and activation of any particular PPS RBSP results in deactivation of a previously-active PPS RBSP, if any.
In some examples, prediction processing unit 41 of video encoder 20 and prediction processing unit 81 of video decoder 30 may support RSSs each having the same resolution aspect ratio. In other examples, video encoder 20 and decoder 30 may support different RSSs having different resolution aspect ratios among the different RSSs. The resolution aspect ratio of an RSS may be defined as the proportion of the width of an RSS versus the height of the RSS.
In the example where prediction processing units 41 and 81 support RSSs having different resolution aspect ratios, prediction processing units 41 and 81 may crop a portion of a block of a reference picture having a first resolution aspect ratio in order to predict the values of a predictive block having a second, different resolution aspect ratio. The techniques of this disclosure define a number of syntax elements, referred to as cropping parameters, which may be signaled in the RBSP of an SPS to indicate how a reference picture should be cropped. The cropped area of the reference picture may be referred to as a “cropping window.”
In order to support CVSs with adaptive-resolution, the techniques of this disclosure propose adding following syntax structures to the SPS. The syntax elements may include a profile indicator or a flag that indicates the existence of more than one spatial resolution in the CVS. Alternatively no flag may be added, but the existence of the more than one spatial resolution in the CVS may be indicated by a particular value of the profile indicator, which may be denoted as profile_idc. Additionally, the syntax elements may include a resolution ID, a syntax element that indicates a spatial relationship between the current resolution sub-sequence and an adjacent spatial resolution sub-sequence, and a syntax element that indicates the required size of the DPB in units of 8×8 blocks.
According to the techniques of this disclosure, a modified SPS RBSP syntax structure may be expressed as shown below in Table I:

	TABLE I

	seq_parameter_set_rbsp( ) {	Descriptor

	profile_idc	u(8)
	reserved_zero_8bits /* equal to 0 */	u(8)
	level_idc	u(8)
	seq_parameter_set_id	ue(v)
	max_temporal_layers_minus1	u(3)
	pic_width_in_luma_samples	u(16)
	pic_height_in_luma_samples	u(16)
	bit_depth_luma_minus8	ue(v)
	bit_depth_chroma_minus8	ue(v)
	pcm_bit_depth_luma_minus1	u(4)
	pcm_bit_depth_chroma_minus1	u(4)
	log2_max_pic_order_cnt_lsb_minus4	ue(v)
	max_num_ref_frames	ue(v)
	log2_min_coding_block_size_minus3	ue(v)
	log2_diff_max_min_coding_block_size	ue(v)
	log2_min_transform_block_size_minus2	ue(v)
	log2_diff_max_min_transform_block_size	ue(v)
	log2_min_pcm_coding_block_size_minus3	ue(v)
	max_transform_hierarchy_depth_inter	ue(v)
	max_transform_hierarchy_depth_intra	ue(v)
	chroma_pred_from_luma_enabled_flag	u(1)
	loop_filter_across_slice_flag	u(1)
	sample_adaptive_offset_enabled_flag	u(1)
	adaptive_loop_fiter_enabled_flag	u(1)
	pcm_loop_filter_disable_flag	u(1)
	cu_qp_delta_enabled_flag	u(1)
	temporal_id_nesting_flag	u(1)
	inter_4x4_enabled_flag	u(1)
	adaptive_spatial_resolution_flag	u(1)
	if (adaptive_spatial_resolution_flag ) {
	resolution_id	ue(v)
	for ( i = 0; i < 2; i++) {
	cropping_resolution_idc[ i ]	u(2)
	if (cropping_resolution_idc[ i ] & 0x01) {
	cropped_left[ i ]	ue(v)
	cropped_right[ i ]	ue(v)
	}
	if (cropping_resolution_idc[ i ] & 0x10) {
	cropped_top[ i ]	ue(v)
	cropped_bottom[ i ]	ue(v)
	}
	}
	}
	max_dec_pic_buffering	ue(v)
	rbsp_trailing_bits( )
	}

An exemplary description of the new SPS syntax elements in Table I is set forth in more detail below.
adaptive_spatial_resolution_flag: When equal to “1,” the flag indicates that a CVS containing an RSS referring to an SPS may contain pictures with different spatial resolutions. When equal to “0,” the flag indicates that all pictures in the CVS have a same spatial resolution, or equivalently, that there is only one RSS in the CVS. This syntax element applies to the entire CVS, and its value shall be identical for all SPSs that may be activated for the CVS.
The adaptive_spatial_resolution flag is only one example of how adaptive resolution CVSs may be implemented. As another example, there may be one or more profiles defined that enable adaptive spatial resolution. Accordingly, the value of the profile_idc syntax element, which may indicate the selection of an adaptive resolution profile, may signal the enablement of adaptive resolution.
resolution_id: Specifies an identifier of the RSS referring to the SPS. A value of resolution_id may be in a range of “0” to “7,” inclusive. An RSS with a largest spatial resolution among all RSSs in the CVS may have resolution_id equal to “0.”
cropping_resolution_idc[i]: Indicates whether cropping is needed to specify a reference region of a reference picture from a target RSS, as defined below, used for inter-prediction as a reference when decoding a coded picture from a current RSS.
The pseudocode that follows describes one example of how the numbering of an RSS using the resolution_id value that refers to an SPS may be implemented according to the techniques of this disclosure.

- Let “rId” be a resolution_id of the current RSS;
- The target RSS is the RSS with a resolution_id equal to: rId+(i==0?−1:1);
- If the current RSS has a resolution_id equal to 0, cropping_resolution_idc[0]=0
- If the current RSS has a largest resolution_id among all RSSs in the CVS, cropping_resolution_idc[1]=0

As described above, the techniques of this disclosure may enable RSSs and SPSs that may have different aspect ratios. When performing inter-prediction, video encoder 20 may predict the pixel values of a block from a block of a reference picture that has a different aspect ratio. Because of the difference in the aspect ratios, video encoder 20 may crop the portion of the block of the reference block in order to obtain a block with a similar resolution aspect ratio to the predictive block. The following syntax elements describe how video encoder 20 may perform cropping of blocks to obtain blocks with different resolution aspect ratios.
Cropping_resolution_idc[i] equal to “0” indicates that the target RSS does not exist, or that no cropping is needed.
Cropping_resolution_idc[i] equal to “1” indicates that cropping at a left and/or right side is needed.
Cropping_resolution_idc[i] equal to “2” indicates that cropping at a top and/or bottom is needed.
Cropping_resolution_idc[i] equal to “3” indicates that cropping at both the left/right and the top/bottom is needed.
Table II below illustrates the various values of Cropping_resolution_idc[i], and the corresponding indications.

TABLE II

cropping_resolution_idc[ i ]

0	No cropping is needed
1	Cropping may happen at the left
	and/or right side
2	Cropping may happen at the top
	and/or bottom
3	Cropping may happen at both
	left/right and top/bottom

In addition to “cropping_resolution_idc” value, the RBSP of an SPS may also include syntax elements that may indicate the number of pixels to be cropped from the top, bottom, left, and/or right of a reference picture from an RSS. These additional cropping syntax elements are described in further detail below.
cropped_left[i]: Specifies a number of pixels to be cropped at a left side of a luma component of the reference picture from the target RSS, to specify the reference region. When not present, video encoder 20 may infer the value to be equal to “0.”
cropped_right[i]: Specifies a number of pixels to be cropped at a right side of the luma component of the reference picture from the target RSS, to specify the reference region. When not present, video encoder 20 may infer the value to be equal to “0.”
cropped_top[i]: Specifies a number of pixels to be cropped at a top of the luma component of the reference picture from the target RSS, to specify the reference region. When not present, video encoder 20 may infer the value to be equal to “0.”
cropped_bottom[i]: Specifies a number of pixels to be cropped at a bottom of the luma component of the reference picture from the target RSS, to specify the reference region. When not present, video encoder 20 may infer the value to be equal to “0.”
In addition to signaling a bottom, top, left, and/or right cropping, video encoder 20 may signal the cropping window in other ways. As an example, video encoder 20 may signal the cropping window as the starting vertical and horizontal positions plus the width and height. As another example, video encoder 20 may signal the cropping window as the starting vertical and horizontal positions and the ending vertical and horizontal positions.
Before prediction processing unit 41 may use a coded picture in the current RSS, prediction processing unit 41 may crop a decoded picture from the target RSS as specified by the above cropping syntax elements. prediction processing unit 41 may also scale the cropped reference picture to be the same resolution as the coded picture in the current RSS, and scale the motion vectors of the cropped block accordingly.
As described above, video encoder 20 may each include DPB 64 that may contain decoded pictures. DPB management units 65 may manage DPB 64. Each decoded picture contained within DPB 64 may be needed for either inter-prediction as a reference, or for future output. In accordance with the techniques of this disclosure, DPB 64 may be modified to support adaptive-resolution CVSs, and more generally to store frames of different sizes.
In accordance with the techniques of this disclosure, prior to initialisation, the DPB may empty (i.e., an indication of a proportion of DPB 64 that is unavailable to store decoded pictures, or DPB “fullness,” is set to “0”). When a decoded picture is stored in DPB 64, DPB management unit 65 may increment the “fullness” of the DPB by the number of blocks (e.g., CUs or 8×8 pixel blocks) in the picture. Similarly, when DPB management unit 65 removes a decoded picture from DPB 64, DPB management unit 65 may decrease the fullness of the DPB by the number of blocks (e.g., CUs or 8×8 pixel blocks) in the removed picture.
To support a DPB that utilizes a count a block count rather than a frame count to indicate the “fullness” of the DPB, the RBSP of an SPS may include a syntax element that specifies a size of the DPB in 8×8 blocks. The parameter, denoted as max_dec_pic_buffering, specifies a required size of a decoded picture buffer (DPB), in units of 8×8 blocks, for decoding the CVS. This syntax element may apply to the entire CVS, and its value is identical for all SPSs that may be activated for the CVS. Further detail of the operation of the DPB is described with respect to FIG. 5, below.
FIG. 3 is a block diagram illustrating an example video decoder 30 that may implement the techniques described in this disclosure. In the example of FIG. 3, video decoder 30 includes an entropy decoding unit 80, prediction processing unit 81, inverse quantization unit 86, inverse transformation unit 88, summer 90, decoded picture buffer (DPB) 92, and DBP management unit 93. Prediction processing unit 81 includes motion compensation unit 82 and intra prediction unit 84. Video decoder 30 may, in some examples, perform a decoding pass generally reciprocal to the encoding pass described with respect to video encoder 20 from FIG. 2.
During the decoding process, video decoder 30 receives an encoded video bitstream that represents video blocks of an encoded video slice and associated syntax elements from video encoder 20. Entropy decoding unit 80 of video decoder 30 entropy decodes the bitstream to generate quantized coefficients, motion vectors, and other syntax elements. Entropy decoding unit 80 forwards the motion vectors and other syntax elements to prediction processing unit 81. Video decoder 30 may receive the syntax elements at the video slice level and/or the video block level.
When the video slice is coded as an intra-coded (I) slice, intra prediction unit 84 of prediction processing unit 81 may generate prediction data for a video block of the current video slice based on a signaled intra prediction mode and data from previously decoded blocks of the current picture. When the video picture is coded as an inter-coded (i.e., B or P) slice, motion compensation unit 82 of prediction processing unit 81 produces predictive blocks for a video block of the current video slice based on the motion vectors and other syntax elements received from entropy decoding unit 80. The predictive blocks may be produced from one of the reference pictures within one of the reference picture lists. Video decoder 30 may construct the reference frame lists, List 0 and List 1, using default construction techniques based on reference pictures stored in decoded picture buffer 92. In some examples, video decoder 30 may construct List 0 and List 1 from the reference pictures identified in the derived reference picture set.
Motion compensation unit 82 determines prediction information for a video block of the current video slice by parsing the motion vectors and other syntax elements, and uses the prediction information to produce the predictive blocks for the current video block being decoded. For example, motion compensation unit 82 uses some of the received syntax elements to determine a prediction mode (e.g., intra- or inter-prediction) used to code the video blocks of the video slice, an inter-prediction slice type (e.g., B slice or P slice), construction information for one or more of the reference picture lists for the slice, motion vectors for each inter-encoded video block of the slice, inter-prediction status for each inter-coded video block of the slice, and other information to decode the video blocks in the current video slice.
Motion compensation unit 82 may also perform interpolation based on interpolation filters. Motion compensation unit 82 may use interpolation filters as used by video encoder 20 during encoding of the video blocks to calculate interpolated values for sub-integer pixels of reference blocks. In this case, motion compensation unit 82 may determine the interpolation filters used by video encoder 20 from the received syntax elements and use the interpolation filters to produce predictive blocks.
Inverse quantization unit 86 inverse quantizes, i.e., de quantizes, the quantized transform coefficients provided in the bitstream and decoded by entropy decoding unit 80. The inverse quantization process may include use of a quantization parameter calculated by video encoder 20 for each video block in the video slice to determine a degree of quantization and, likewise, a degree of inverse quantization that should be applied. Inverse transform unit 88 applies an inverse transform, e.g., an inverse DCT, an inverse integer transform, or a conceptually similar inverse transform process, to the transform coefficients in order to produce residual blocks in the pixel domain.
After prediction processing unit 81 generates the predictive block for the current video block based on either inter- or intra-prediction, video decoder 30 forms a decoded video block by summing the residual blocks from inverse transform unit 88 with the corresponding predictive blocks generated by prediction processing unit 81. Summer 90 represents the component or components that perform this summation operation. If desired, a deblocking filter may also be applied to filter the decoded blocks in order to remove blockiness artifacts. Other loop filters (either in the coding loop or after the coding loop) may also be used to smooth pixel transitions, or otherwise improve the video quality. DPB management unit 93 may store the decoded video blocks of a given in decoded picture buffer 92, which stores reference pictures used for subsequent motion compensation. Decoded picture buffer 92 also stores decoded video for later presentation on a display device, such as display device 32 of FIG. 1.
In accordance with this disclosure, prediction processing unit 81 and DPB management unit 93 represent example units for performing the example functions described above. For example, prediction processing unit 81 may receive a coded video sequence comprising a first sub-sequence and a second sub-sequence, wherein the first sub-sequence includes one or more frames each having a first resolution, and the second sub-sequence includes one or more frames each having a second resolution, and wherein the first sub-sequence is different than the second sub-sequence, and the first resolution is different than the second resolution. Prediction processing unit 81 may also receive a first sequence parameter set and a second sequence parameter set for the coded video sequence, wherein the first sequence parameter set indicates the first resolution of the one or more frames of the first sub-sequence, and the second sequence parameter set indicates the second resolution of the one or more frames of the second sub-sequence, and wherein the first sequence parameter set is different than the second sequence parameter set. Prediction processing unit 81 may also use the first sequence parameter set and the second sequence parameter set to decode the coded video sequence.
As another example in accordance with the techniques of this disclosure, prediction processing unit 81 may also receive a first decoded frame of video data, wherein the first decoded frame is associated with a first resolution. DPB management unit 93 may determine whether DPB 92 is available to store the first decoded frame based on the first resolution, and in the event the decoded picture buffer is available to store the first decoded frame, store the first decoded frame in DPB 92, and determine whether the DPB 93 is available to store a second decoded frame of video data, wherein the second decoded frame is associated with a second resolution, based on the first resolution and the second resolution, wherein the first decoded frame is different than the second decoded frame.
In general, video decoder 30 may perform any of the techniques of this disclosure. In some examples, video decoder 30 may perform some or all of the techniques described above with respect to video encoder 20 in FIG. 2. In some examples, video decoder 30 may perform the techniques described with respect to FIG. 2 in a reciprocal ordering or manner to that described with respect to video encoder 20.
FIGS. 4A-4D are conceptual diagrams that illustrate examples of a coded bitstream including coded video data in accordance with the techniques of this disclosure. As shown in FIG. 4A, a coded bitstream 400 may comprise one or more coded video sequences (CVSs), in particular, CVS 402 and CVS 404. As also shown in FIG. 4A, each of CVS 402 and CVS 404 may comprise one or more frames, or “pictures,” PIC_1 (0)-PIC_1 (N), and PIC_2 (0)-PIC_2 (M), respectively. As still further shown in FIG. 4A, each of CVS 402 and CVS 404 may further comprise a single sequence parameter set (SPS), in particular, SPS1 and SPS2, respectively. As described above, each of SPS1 and SPS2 may define parameters for the corresponding one of CVS 402 and CVS 404, including LCU size, SCU size, and other syntax information for the respective CVS that is common to all frames, or “pictures” within the CVS.
As shown in FIG. 4B, a particular CVS, CVS 406, may further comprise one or more picture parameter sets (PPSs), in particular, PPS1 and PPS2. As described above, each of PPS1 and PPS2 may define parameters for CVS 406, including syntax information that indicates picture resolution, that are common to one or more pictures within CVS 406, but not to all pictures within CVS 406. For example, syntax information included within each of PPS1 and PPS2, e.g., picture resolution syntax information, may apply to a sub-set of the pictures included within CVS 406. As one example, PPS1 may indicate picture resolution for PIC_1 (0)-PIC_1 (N), and PPS2 may indicate picture resolution for PIC_2 (0)-PIC_2 (M). Accordingly, CVS 406 may comprise pictures having different resolutions, wherein picture resolution for a particular one or more pictures (e.g., PIC_1 (0)-PIC_1 (N)) within CVS 406 that share a common picture resolution may be specified by a corresponding one of PPS1 and PPS2.
In cases where pictures having different resolutions are alternated within a CVS in a decoding order, e.g., in a resolution-adaptive CVS, a PPS may have to be signaled prior to each picture having a different picture resolution relative to a previous picture in the decoding order, to indicate the picture resolution for the currently decoded picture. Accordingly, in such cases, multiple PPSs may need to be signaled throughout decoding the CVS, which may increase coding overhead.
As described above, A PPS RBSP may include parameters that can be referred to by coded slice NAL units of one or more coded pictures. Each PPS RBSP is initially considered not active at a start of a decoding process. In most examples, one PPS RBSP is considered active at any given moment during the decoding process, and activation of any particular PPS RBSP results in deactivation of a previously-active PPS RBSP, if any.
When a PPS RBSP (with a particular value of the pic_parameter_set_id syntax element) is not active, and is referred to by a coded slice NAL unit (using the particular value of pic_parameter_set_id), the PPS referred to by the pic_parameter_sed_id is activated. This PPS RBSP is referred to as an “active PPS RBSP,” until it is deactivated by an activation of another PPS. Video encoder 20 or decoder 30 may require a PPS with the referenced pic_parameter_set_id, value to have been received before activating that PPS with that pic_parameter_set_id.
As an example of the PPS activation process, a NAL unit may refer to PPS1. Video encoder 20 or decoder 30 may activate PPS1 based on the reference to PPS1 in the NAL unit. PPS1 is the active PPS RBSP. PPS1 remains the active PPS RBSP until a NAL unit references PPS2, at which point video encoder 20 or decoder 30 may activate PPS2. Once activated, PPS2 becomes the active PPS RBSP, and PPS1 is no longer the active PPS RBSP.
Any PPS NAL unit that has the same pic_parameter_set_id value for the active PPS RBSP for a coded picture may have the same content as that of the active PPS RBSP for the coded picture. That is, if the pic_parameter_set_id of the PPS NAL is the same as that of the active PPS RBSP, the content of the active PPS RBSP may not change. There may be an exception to this rule, however. If a PPS NAL has the same pic_parameter_set_id as the active PPS RBSP, and the PPS NAL follows the last Video Coding Layer (VCL) NAL unit of the coded picture, and precedes the first VCL NAL unit of another coded picture, then the content of the active PPS RBSP may change (e.g., the pic_parameter_set_id value may indicate a different set of parameters).
In accordance with the techniques of this disclosure, as shown in FIGS. 4C-4D, syntax information that indicates picture resolution for one or more pictures within a CVS, wherein the CVS comprises one or more pictures having different sizes, may be indicated using multiple SPSs for the CVS, rather than using a plurality of PPSs, as described above with reference to FIGS. 4A-4B.
A SPS RBSP may include parameters that can be referred to by one or more PPS RBSPs, or one or more Supplemental Extension Information (SEI) NAL units containing a buffering period SEI message. Each SPS is initially considered not active at a start of a decoding process. At most, one SPS may be considered active for each RSS at any given moment during the decoding process, and the activation of any particular SPS may result in a deactivation of a previously-active SPS for the same resolution sub-sequence, if any. Also, if there are “n” resolution sub-sequences within the CVS, at most “n” SPS RBSPs may be considered active for the entire CVS at any given moment during the decoding process.
When an SPS RBSP (with a particular value of seq_parameter_set_id) is not already active, and is referred to by activation of a PPS RBSP (using the particular value of seq_parameter_set_id), or is referred to by an SEI NAL unit containing a buffering period SEI message (using the particular value of seq_parameter_set_id), the SPS RBSP is activated. This SPS RBSP may be referred to as an “active SPS RBSP” for the associated RSS (the RSS in which the coded pictures refers to the active SPS RBSP through the PPS RBSPs), until it is deactivated by an activation of another SPS RBSP. Video encoder 20 or decoder 30 may require the SPS RBSP with a particular value of seq_parameter_set_id, to be available to video encoder 20 or video decoder 30 prior to the activation of that SPS. Additionally, the SPS may remain active for the entire RSS in the CVS.
Additionally, because an instantaneous decoder refresh (IDR) access unit begins a new CVS, and an activated SPS RBSP may remain active for the entire RSS in the CVS, an SPS RBSP may only be activated by a buffering period SEI message when the buffering period SEI message is part of an IDR access unit.
Any SPS NAL unit containing the particular value of seq_parameter_set_id for the active SPS RBSP for a RSS in a CVS may have the same content as that of the active SPS RBSP for the RSS in the CVS, unless it follows a last access unit of the CVS, and precedes the first VCL NAL unit and the first SEI NAL unit containing a buffering period SEI message (when present) of another CVS.
Also, if a PPS RBSP or an SPS RBSP is conveyed within the bitstream, these constraints impose an order constraint on the NAL units that contain the PPS RBSP or the SPS RBSP, respectively. Otherwise if PPS RBSP or SPS RBSP are conveyed by other means not specified in this disclosure, they should be available to the decoding process in a timely fashion such that these constraints are obeyed.
The constraints that are expressed on the relationship between the values of the syntax elements (and the values of variables derived from those syntax elements) in SPS and PPS, and other syntax elements, are typically expressions of constraints that apply only to the active SPS and the active PPS. If any SPS RBSP is present that is not activated in the bitstream, its syntax elements usually have values that would conform to the specified constraints if it were activated by reference in an otherwise conforming bitstream. If any PPS RBSP is present that is not ever activated in the bitstream, the syntax elements of the PPS RBSP may have values that would conform to the specified constraints if the PPS were activated by reference in an otherwise-conforming bitstream.
During the decoding process, the values of parameters of the active PPS and the active SPS may be considered to be in effect. For interpretation of SEI messages, the values of the parameters of the PPS and SPS that are active for the operation of the decoding process for the VCL NAL units of the primary coded picture in the same access unit may be considered in effect unless otherwise specified in the SEI message semantics.
As one example, as shown in FIG. 4C, CVS 408 may include one or more SPSs, in particular, SPS1 and SPS2, that each indicate picture resolution for PIC_1 (0), PIC_1 (1), etc., and PIC_2 (0), PIC_2 (1), etc., respectively. In other words, SPS1 indicates picture resolution information for PIC_1 (0), PIC_1 (1), etc., and SPS2 indicates picture resolution information for PIC_2 (0), PIC_2 (1), etc. In this example, CVS 408 may further comprise one or more PPSs (not shown), wherein the one or more PPSs may specify syntax information for one or more pictures of CVS 408, but wherein the one or more PPSs do not include any syntax information that indicates picture resolution for any of the one or more pictures of CVS 408.
In this example, SPS1 and SPS2 may indicate picture resolution information for all pictures within CVS 408, even in cases where pictures having different resolutions are alternated within a CVS in the decoding order. Accordingly, after the indicating picture resolution information for all pictures within CVS 408 using SPS1 and SPS2, no additional indication of the information may be needed.
As shown in FIG. 4C, the multiple SPSs, e.g., SPS1 and SPS2, may be located at the beginning of the corresponding CVS, e.g., CVS 408, prior to any of PIC_1 (0), PIC_1 (1) and PIC_2 (0), PIC_2 (1). As shown in FIG. 4D, alternatively, an SPS that indicates picture resolution information for one or more pictures may be located before a first one of such pictures in a decoding sequence. For example, as shown in FIG. 4D, SPS2 is located within CVS 410 prior to a first one of pictures PIC_2 (0), PIC_2 (1), etc., but after a first one of PIC_1 (0), PIC_1 (1), etc.
FIG. 5 is a conceptual diagram illustrating the operation of a decoded picture buffer of a hypothetical reference decoder (HRD) model in accordance with the techniques of this disclosure. FIG. 5 includes coded picture buffer (CPB) 502, decoded picture buffer (DPB) 504, and DPB management unit 506. DPB management unit 506 may remove a picture in coded picture buffer (CPB) 502. Video encoder 20 or decoder 30 may decode the picture, and DPB management unit 506 may store the decoded picture in decoded picture buffer 504. Based on various criteria, such as an output time, output flag, or a picture count, DPB management unit 506 may remove a picture from DPB 504. In some cases video encoder 20 or decoder 30 may output the decoded picture. CPB 502 may contain encoded pictures that are removed so that video encoder 20 or decoder 30 may utilize the decoded pictures that may be needed for inter-prediction as a reference, or for future output. In general, DPB 504 may include a maximum capacity. In previous video coding standards, DPB 504 may include a maximum number of frames that can be stored in the DPB. However, the support adaptive-resolution CVSs, DPB management unit 506 may maintain a count of blocks contained within the DPB to measure the “fullness” of the DPB.
This disclosure describes the removal techniques of decoded pictures in the DPB from at least two perspectives. In the first perspective, DPB management unit 506 of video decoder 30 may remove decoded pictures based on an output time if the pictures are intended for output. In the second perspective, DPB management unit 506 may remove decode pictures based on the picture order count (POC) values if the pictures are intended for output. In either perspectives, DPB management unit 506 may remove decoded pictures that are not needed for output (i.e., outputted already or not intended for output) when the decoded picture is not in the reference picture set, and prior to decoding the current picture. Although described with respect to video decoder 30, video encoder 20 and DPB management unit 506 of video encoder 20 may also perform any of the DPB management techniques described in this disclosure.
DPB 504 may include a plurality of buffers, and each buffer may store a decoded picture that is to be used as a reference picture or is held for future output. Initially, the DPB is empty (i.e., the DPB fullness is set to zero). In the described example techniques, the removal of the decoded pictures from the DPB may occur before the decoding of the current picture, but after video decoder 30 parses the slice header of the first slice of the current picture.
In the first perspective, the following techniques may occur instantaneously at time t_r(n) in the following sequence. In this example, t_r(n) is CPB removal time (i.e., decoding time) of the access unit n containing the current picture. As described in this disclosure, the techniques occurring instantaneously may mean that the in the HRD model, it is assumed that decoding of a picture is instantaneous, with a time period for decoding a picture equal to zero.
In the first perspective, decoder 30 may invoke the derivation process for a reference picture set. If the current picture, which DPB management unit 506 may retrieve from CPB 502 is an IDR picture, DPB management unit 506 may remove all decoded pictures from DPB 504, and may set and the DPB fullness to 0. If the decoded picture is not an IDR picture, DPB management unit 506 may remove all pictures not included in the reference picture set of the current picture from DPB 504. DPB management unit 506 may also remove all pictures having an OutputFlag value equal to “0”, or having DPB output time is less than or equal to the CPB removal time of the current picture, which may be referred to as “n” (i.e., t_o,dpb(m)<=t_r(n)). The OutputFlag may indicate that video decoder 30 should output the picture (e.g., for display or for transmission in the case of an encoder).
Whenever DPB management unit 506 removes a picture from DPB 504, DPB management unit 506 may decrement the fullness of DPB 504 by the number of 8×8 blocks in the picture, i.e., (pic_width_in_luma_samples*pic_height_in_luma_samples)>>6.
After DPB management unit 506 has removed any pictures from the DPB, video decoder 30 may decode and store the received picture “n” in the DPB. DPB management unit 506 may increment the DPB fullness by the number of 8×8 blocks in the stored decoded picture, i.e., (pic_width_in_luma_samples*pic_height_in_luma_samples)>>6.
Each picture may also have an OutputFlag, as described above. When the picture has an OutputFlag value equal to 1, the DPB output time, denoted as t_o,dpb(n), of the picture may be derived by the following equation.
t _o,dpb(n)=t _r(n)+t _c*dpb_outputdelay(n)
In the equation, dpb output delay(n) may be the value of dpb output delay specified in the picture timing SEI message associated with access unit “n.”
If the OutputFlag of a picture is equal to “1” and t_o,dpb(n)=t_r(n), video decoder 30 may output the current picture. Otherwise, if the value of OutputFlag is equal to 0, video decoder 30 may not output the current picture. Otherwise, (i.e., if OutputFlag is equal to 1 and t_o,dpb(n)>t_r(n)), video decoder 30 may output the current picture later, at time t_o,dpb(n).
As described above, in some examples, video decoder 30 may crop the picture in the decoded picture buffer. Video decoder 30 may utilize the cropping rectangle specified in the active sequence parameter set for the picture to determine the cropping rectangle.
In some examples, video decoder 30 may determine a difference between the DPB output time for a picture and the DPB output time for a picture following the picture in output order. When picture “n” is a picture that is output and is not the last picture of the bitstream that is output, the output time of picture “n” Δt_o,dpb(n) may be defined according to the following equation.
Δt _o,dpb(n)=t _o,dpb(n _n)−t _o,dpb(n)
In preceding equation, n_nmay denote the picture that follows after picture “n” in output order and has OutputFlag equal to 1.
In the second perspective for removing decoded pictures, the HRD may implement the techniques instantaneously when DPB management unit 506 removes an access unit from CPB 502. Again, video decoder 30 and DPB management unit 506 of video decoder 30 may implement the removing of decoded pictures from DPB 504, and video decoder 30 may not necessarily include CPB 502. In some examples, video decoder 30 and video encoder 20 may not require CPB 502. Rather, CPB 504 is described as part of the HRD model for purposes of illustration only.
As above, in the second perspective for removing decoded pictures, DPB management unit 506 may remove the pictures from the DPB before the decoding of the current picture, but after parsing the slice header of the first slice of the current picture. Also, similar to the first perspective for removing decoded pictures, in the second perspective, video decoder 30 and DPB management unit 506 may perform similar functions to those described above with respect to the first perspective when the current picture is an IDR picture.
Otherwise, if the current picture is not an IDR picture, DPB management unit 506 may empty, without output, buffers of the DPB that store a picture that is marked as “not needed for output” and that store pictures not included in the reference picture set of the current picture. DPB management unit 506 may also decrement the DPB fullness by the number of buffers that DPB management unit 506 emptied. When there is not empty buffer (i.e., the DPB fullness is equal to the DBP size), DPB management unit 506 may implement a “bumping” process described below. In some examples, when there is no empty buffer, DPB management unit 506 may implement the bumping process repeatedly unit there is an empty buffer in which video decoder 30 can store the current decoded picture.
In general, video decoder 30 may implement the following steps to implement the bumping process. Video decoder 30 may first determine the picture to be outputted. For example, video decoder 30 may select the picture having the smaller PicOrderCnt (POC) value of all the pictures in DPB 504 that are marked as “needed for output.” Video decoder 30 may crop the selected picture using the cropping rectangle specified in the active sequence parameter set for the picture. Video decoder 30 may output the cropped picture, and may mark the picture as “not needed for output.” Video decoder 30 may check the buffer of DPB 504 that stored the cropped and outputted picture. If the picture is not included in the reference picture set, DPB management unit 506 may empty that buffer and may decrement the DPB fullness by the number of 8×8 blocks in the removed picture.
Although the above techniques for the DPB management are described from the context of video decoder 30 and DPB management unit 65, in some examples, video encoder 20, and DPB management unit 93 may implement similar techniques. However, video encoder 20 implementing similar techniques is not required in every example. In some examples, video decoder 30 may implement these techniques, and video encoder 20 may not implement these techniques.
In this manner, a video coder (e.g., video encoder 20 or video decoder 30) may implement techniques to support CVSs having adaptive resolution. Again, the reference picture set may identify the reference pictures that can potentially be used for inter-predicting the current picture and can potentially be used for inter-predicting one or more picture following the current picture in decoding order.
In the above examples, the DPB size or fullness may be signaled with respect to the number of 8×8 blocks of a pictured stored in the DPB. Alternatively, the fullness of the DPB, i.e., the max_dec_pic buffering syntax element, may be signaled based on the number of smallest coding units (SCUs) of a picture. For example, if the smallest SCU among all active SPSs is 16×16, then the unit of max_dec_pic buffering may be 16×16 blocks.
As still another example, video encoder 20 or decoder 30 may signal the DPB size, indicated by the max_dec_pic buffering syntax element, using units of frame buffers that are specific to the spatial resolution indicated by the SPS. For example, if there are two RSSs, rss1 and rss2, with resolution res1 and resolution res2, referring to SPS sps1 and SPS sps2 respectively, wherein res1 is greater than res2, then max_dec_pic buffering in sps1 is counted in frame buffers of res1, and max_dec_pic buffering in sps2 is counted in frame buffers of res2. In this example, video encoder 20 or decoder 30 may be subject to the restriction that the DPB size, if counted in units of 8×8 blocks, indicated by the max_dec_pic buffering value in sps1 may not be less than that indicated by the max_dec_pic buffering value in sps2. Consequently, in the DPB operations, when video decoder 30 removes one frame buffer of res1 from DPB 504, the freed buffer space may be sufficient for insertion of a decoded picture of either resolution. However, when decoder 30 removes one frame buffer of res2 from DPB 504, the freed buffer space may not be sufficient for insertion of a decoded picture of res1. Rather, video decoder 30 may remove multiple frame buffers of res2 from DPB 504 in this case.
The video decoder 30 may derive the reference picture set in any manner, including the example techniques described above. Video decoder 30 may determine whether a decoded picture stored in the decoded picture buffer is not needed for output and is not identified in the reference picture set. When video decoder 30 has outputted the decoded picture and the decoded picture is not identified in the reference picture set, the video decoder 30 may remove the decoded picture from the decoded picture buffer. Subsequent to removing the decoded picture, video decoder 30 may code the current picture. For example, video decoder 30 may construct the reference picture list(s) as described above, and code the current picture based on the reference picture list(s).
FIG. 6 is a flowchart illustrating an example operation of using a first sub-sequence and a second sub-sequence to decode video in accordance with the techniques of this disclosure. For purposes of illustration only, the method of FIG. 6 may be performed by a video coder corresponding to either video encoder 20 or video decoder 30. In the method of FIG. 6, the video coder may process a coded video sequence comprising a first sub-sequence and a second sub-sequence (601). The first sub-sequence may include one or more frames each having a first resolution, and the second sub-sequence may include one or more frames each having a second resolution. The first sub-sequence may be different than the second sub-sequence, and the first resolution may be different than the second resolution.
The video coder (e.g., video encoder 20 or video decoder 30) may also process a first sequence parameter set (SPS) and a second sequence parameter set for the coded video sequence (602). The first sequence parameter set may indicate the first resolution of the one or more frames of the first sub-sequence, and the second sequence parameter set may indicate the second resolution of the one or more frames of the second sub-sequence. The first sequence parameter set may also be different than the second sequence parameter set. The video coder (e.g., video encoder 20 or video decoder 30) may use the first sequence parameter set and the second sequence parameter set to code the coded video sequence (603).
In some examples, the video coder may comprise an encoder, e.g., encoder 20 of FIGS. 1-2. In the case where the video coder comprises a decoder, processing SPSs and sub-sequences may comprise receiving the SPSs and sub-sequences. In this case, coding the first and second video sequences may comprise decoding the first and second video sequences.
In the case where the video coder comprises an encoder, processing SPSs and sub-sequences may comprise generating the SPSs and sub-sequences. In this case, coding the first and second video sequences may comprise encoding the first and second video sequences. Additionally in the case where the video coder comprises an encoder, the video encoder may transmit the coded video sequence comprising the first sub-sequence and the second subs-sequence instead of receiving the video sequence comprising the first and second sub-sequence. In some examples, the first resolution and the second resolution may each comprise a spatial resolution.
In some examples, the video coder may code the first sequence parameter set and the second sequence parameter in a received bitstream prior to either the first sub-sequence or the second sub-sequence.
In another example, to receive the first sequence parameter set and the second sequence parameter set of the coded video sequence, the video coder may be configured to receive both the first sequence parameter set and the second sequence parameter set prior to receiving either of the first sub-sequence and the second sub-sequence.
In another example, the video coder may code the first sequence parameter set in a received bitstream prior to the first sub-sequence and the second sequence parameter set is coded in the received bitstream after at least one frame of the one or more frames of the first sub-sequence, and prior to the second sub-sequence.
In another example, to receive the first sequence parameter set and the second sequence parameter set of the coded video sequence, the video coder may be configured to receive the second sequence parameter set after receiving at least one frame of the one or more frames of the first sub-sequence, and prior to receiving the second sub-sequence.
In yet another example, the video coder may interleave the one or more frames of the first sub-sequence and the one or more frames of the second sub-sequence in the coded video sequence.
FIG. 7 is a flowchart illustrating an example operation of managing a decoded picture buffer. For purposes of illustration only, the method of FIG. 7 may be performed by a video coder corresponding to either video encoder 20 or video decoder 30. In the method of FIG. 7, a video coder may receive a coded video sequence comprising a first sub-sequence and a second sub-sequence (701). The first sub-sequence may include one or more frames each having a first resolution, and the second sub-sequence may include one or more frames each having a second resolution. The first sub-sequence may be different than the second sub-sequence, and the first resolution may be different than the second resolution. The video coder may receive a first decoded frame of video data, and the first decoded frame may be associated with a first resolution. In some examples, wherein the resolution may comprise a spatial resolution.
In accordance with the method illustrated in FIG. 7, the video coder may also determine whether a decoded picture buffer is available to store the first decoded frame based on the first resolution (702). In the event the decoded picture buffer is available to store the first decoded frame, the video coder may store the first decoded frame in the decoded picture buffer, and determine whether the decoded picture buffer is available to store a second decoded frame of video data. The second decoded frame of video data may be associated with a second resolution. The video coder may also determine whether the decoded picture buffer is available to store the second decoded frame based on the first resolution and the second resolution (704). The first decoded frame may also be different than the second decoded frame.
In some examples, to determine whether the decoded picture buffer is available to store the first decoded frame based on the first resolution, the video coder may be configured to determine an amount of information that may be stored within the decoded picture buffer, determine an amount of information associated with the first decoded frame based on the first resolution, and compare the amount of information that may be stored within the decoded picture buffer and the amount of information associated with the first decoded frame.
In an example, to determine whether the decoded picture buffer is available to store the second decoded frame based on the first resolution and the second resolution, the video coder may be configured to determine an amount of information that may be stored within the decoded picture buffer based on the first resolution, determine an amount of information associated with the second decoded frame based on the second resolution, and compare the amount of information that may be stored within the decoded picture buffer and the amount of information associated with the second decoded frame.
In some examples, the video coder may be further configured to remove the first decoded frame from the decoded picture buffer. The video coder may also be an encoder, e.g., encoder 20 of FIGS. 1-2, or a decoder, e.g., decoder 30 of FIGS. 1-2, in some examples.
By way of example, and not limitation, such computer-readable storage media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage, or other magnetic storage devices, flash memory, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer. Also, any connection is properly termed a computer-readable medium. For example, if instructions are transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. It should be understood, however, that computer-readable storage media and data storage media do not include connections, carrier waves, signals, or other transient media, but are instead directed to non-transient, tangible storage media. Disk and disc, as used herein, includes compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk and Blu-ray disc, where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
Instructions may be executed by one or more processors, such as one or more digital signal processors (DSPs), general purpose microprocessors, application specific integrated circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuitry. Accordingly, the term “processor,” as used herein may refer to any of the foregoing structure or any other structure suitable for implementation of the techniques described herein. In addition, in some aspects, the functionality described herein may be provided within dedicated hardware and/or software modules configured for encoding and decoding, or incorporated in a combined codec. Also, the techniques could be fully implemented in one or more circuits or logic elements.
The techniques of this disclosure may be implemented in a wide variety of devices or apparatuses, including a wireless handset, an integrated circuit (IC) or a set of ICs (e.g., a chip set). Various components, modules, or units are described in this disclosure to emphasize functional aspects of devices configured to perform the disclosed techniques, but do not necessarily require realization by different hardware units. Rather, as described above, various units may be combined in a codec hardware unit or provided by a collection of interoperative hardware units, including one or more processors as described above, in conjunction with suitable software and/or firmware.
Various examples have been described. These and other examples are within the scope of the following claims.

Claims

What is claimed is:

1. A method of decoding video data, the method comprising:

receiving a coded video sequence comprising a first sub-sequence and a second sub-sequence, wherein the first sub-sequence includes one or more frames each having a first resolution, and the second sub-sequence includes one or more frames each having a second resolution, and wherein the first sub-sequence is different than the second sub-sequence, and the first resolution is different than the second resolution;

receiving a first sequence parameter set and a second sequence parameter set for the coded video sequence, wherein the first sequence parameter set indicates the first resolution of the one or more frames of the first sub-sequence, and the second sequence parameter set indicates the second resolution of the one or more frames of the second sub-sequence, and wherein the first sequence parameter set is different than the second sequence parameter set; and

using the first sequence parameter set and the second sequence parameter set to decode the coded video sequence.

2. The method of claim 1, wherein the first sequence parameter set and the second sequence parameter set are coded in a received bitstream prior to either the first sub-sequence or the second sub-sequence.

3. The method of claim 1, wherein receiving the first sequence parameter set and the second sequence parameter set of the coded video sequence comprises:

receiving both the first sequence parameter set and the second sequence parameter set prior to receiving either of the first sub-sequence and the second sub-sequence.

4. The method of claim 1, wherein the first sequence parameter set is coded in a received bitstream prior to the first sub-sequence and the second sequence parameter set is coded in the received bitstream after at least one frame of the one or more frames of the first sub-sequence, and prior to the second sub-sequence.

5. The method of claim 1, wherein receiving the first sequence parameter set and the second sequence parameter set of the coded video sequence comprises:

receiving the second sequence parameter set after receiving at least one frame of the one or more frames of the first sub-sequence, and prior to receiving the second sub-sequence.

6. The method of claim 1, wherein the one or more frames of the first sub-sequence and the one or more frames of the second sub-sequence are interleaved in the coded video sequence.

7. The method of claim 1, wherein the first resolution and the second resolution each comprise a spatial resolution.

8. An apparatus for decoding video data, the apparatus comprising a video decoder configured to:

receive a coded video sequence comprising a first sub-sequence and a second sub-sequence, wherein the first sub-sequence includes one or more frames each having a first resolution, and the second sub-sequence includes one or more frames each having a second resolution, and wherein the first sub-sequence is different than the second sub-sequence, and the first resolution is different than the second resolution;

receive a first sequence parameter set and a second sequence parameter set for the coded video sequence, wherein the first sequence parameter set indicates the first resolution of the one or more frames of the first sub-sequence, and the second sequence parameter set indicates the second resolution of the one or more frames of the second sub-sequence, and wherein the first sequence parameter set is different than the second sequence parameter set; and

use the first sequence parameter set and the second sequence parameter set to decode the coded video sequence.

9. The apparatus of claim 8, wherein the first sequence parameter set and the second sequence parameter set are coded in a received bitstream prior to either the first sub-sequence or the second sub-sequence.

10. The apparatus of claim 8, wherein to receive the first sequence parameter set and the second sequence parameter set of the coded video sequence, the apparatus is configured to:

receive both the first sequence parameter set and the second sequence parameter set prior to receiving either of the first sub-sequence and the second sub-sequence.

11. The apparatus of claim 8, wherein the first sequence parameter set is coded in a received bitstream prior to the first sub-sequence and the second sequence parameter set is coded in the received bitstream after at least one frame of the one or more frames of the first sub-sequence, and prior to the second sub-sequence.

12. The apparatus of claim 8, wherein to receive the first sequence parameter set and the second sequence parameter set of the coded video sequence, the apparatus is configured to:

receive the second sequence parameter set after receiving at least one frame of the one or more frames of the first sub-sequence, and prior to receiving the second sub-sequence.

13. The apparatus of claim 8, wherein the one or more frames of the first sub-sequence and the one or more frames of the second sub-sequence are interleaved in the coded video sequence.

14. The apparatus of claim 8, wherein the first resolution and the second resolution each comprise a spatial resolution.

15. An apparatus for decoding video data, the apparatus comprising:

means for receiving a coded video sequence comprising a first sub-sequence and a second sub-sequence, wherein the first sub-sequence includes one or more frames each having a first resolution, and the second sub-sequence includes one or more frames each having a second resolution, and wherein the first sub-sequence is different than the second sub-sequence, and the first resolution is different than the second resolution;

means for receiving a first sequence parameter set and a second sequence parameter set for the coded video sequence, wherein the first sequence parameter set indicates the first resolution of the one or more frames of the first sub-sequence, and the second sequence parameter set indicates the second resolution of the one or more frames of the second sub-sequence, and wherein the first sequence parameter set is different than the second sequence parameter set; and

means for using the first sequence parameter set and the second sequence parameter set to decode the coded video sequence.

16. A computer-readable storage medium comprising instructions that, when executed, cause at least one processor to decode video data, wherein the instructions cause the at least one processor to:

17. A method of encoding video data, the method comprising:

generating a coded video sequence comprising a first sub-sequence and a second sub-sequence, wherein the first sub-sequence includes one or more frames each having a first resolution, and the second sub-sequence includes one or more frames each having a second resolution, and wherein the first sub-sequence is different than the second sub-sequence, and the first resolution is different than the second resolution;

generating a first sequence parameter set and a second sequence parameter set for the video sequence, wherein the first sequence parameter set indicates the first resolution of the one or more frames of the first sub-sequence, and the second sequence parameter set indicates the second resolution of the one or more frames of the second sub-sequence, and wherein the first sequence parameter set is different than the second sequence parameter set; and

transmitting the coded video sequence comprising the first sub-sequence and the second sub-sequence, and the first sequence parameter set and the second sequence parameter.

18. The method of claim 17, wherein the first sequence parameter set and the second sequence parameter set are coded in a transmitted bitstream prior to either the first sub-sequence or the second sub-sequence.

19. The method of claim 17, wherein transmitting the first sequence parameter set and the second sequence parameter set of the coded video sequence comprises:

transmitting both the first sequence parameter set and the second sequence parameter set prior to transmitting either of the first sub-sequence and the second sub-sequence.

20. The method of claim 17, wherein the first sequence parameter set is coded in a transmitted bitstream prior to the first sub-sequence and the second sequence parameter set is coded in the transmitted bitstream after at least one frame of the one or more frames of the first sub-sequence and prior to the second sub-sequence.

21. The method of claim 17, wherein transmitting the first sequence parameter set and the second sequence parameter set of the coded video sequence comprises:

transmitting the second sequence parameter set after transmitting at least one frame of the one or more frames of the first sub-sequence, and prior to transmitting the second sub-sequence.

22. The method of claim 17, wherein the one or more frames of the first sub-sequence and the one or more frames of the second sub-sequence are interleaved in the coded video sequence.

23. The method of claim 17, wherein the first resolution and the second resolution each comprise a spatial resolution.

24. An apparatus for coding video data, the apparatus comprising a video coder configured to:

generate a coded video sequence comprising a first sub-sequence and a second sub-sequence, wherein the first sub-sequence includes one or more frames each having a first resolution, and the second sub-sequence includes one or more frames each having a second resolution, and wherein the first sub-sequence is different than the second sub-sequence, and the first resolution is different than the second resolution;

generate a first sequence parameter set and a second sequence parameter set for the video sequence, wherein the first sequence parameter set indicates the first resolution of the one or more frames of the first sub-sequence, and the second sequence parameter set indicates the second resolution of the one or more frames of the second sub-sequence, and wherein the first sequence parameter set is different than the second sequence parameter set; and

transmit the coded video sequence comprising the first sub-sequence and the second sub-sequence, and the first sequence parameter set and the second sequence parameter.

25. The apparatus of claim 24, wherein the first sequence parameter set and the second sequence parameter set are coded in a transmitted bitstream prior to either the first sub-sequence or the second sub-sequence.

26. The apparatus of claim 24, wherein to transmit the first sequence parameter set and the second sequence parameter set of the coded video sequence, the apparatus is configured to:

transmit both the first sequence parameter set and the second sequence parameter set prior to transmitting either of the first sub-sequence and the second sub-sequence.

27. The apparatus of claim 24, wherein the first sequence parameter set is coded in a transmitted bitstream prior to the first sub-sequence and the second sequence parameter set is coded in the transmitted bitstream after at least one frame of the one or more frames of the first sub-sequence and prior to the second sub-sequence.

28. The apparatus of claim 24, wherein to transmit the first sequence parameter set and the second sequence parameter set of the coded video sequence, the apparatus is configured to:

transmit the second sequence parameter set after transmitting at least one frame of the one or more frames of the first sub-sequence, and prior to transmitting the second sub-sequence.

29. The apparatus of claim 24, wherein the one or more frames of the first sub-sequence and the one or more frames of the second sub-sequence are interleaved in the coded video sequence.

30. The apparatus of claim 24, wherein first resolution and the second resolution each comprise a spatial resolution.

31. An apparatus for encoding video data, the apparatus comprising:

means for generating a coded video sequence comprising a first sub-sequence and a second sub-sequence, wherein the first sub-sequence includes one or more frames each having a first resolution, and the second sub-sequence includes one or more frames each having a second resolution, and wherein the first sub-sequence is different than the second sub-sequence, and the first resolution is different than the second resolution;

means for generating a first sequence parameter set and a second sequence parameter set for the video sequence, wherein the first sequence parameter set indicates the first resolution of the one or more frames of the first sub-sequence, and the second sequence parameter set indicates the second resolution of the one or more frames of the second sub-sequence, and wherein the first sequence parameter set is different than the second sequence parameter set; and

means for transmitting the coded video sequence comprising the first sub-sequence and the second sub-sequence, and the first sequence parameter set and the second sequence parameter.

32. A computer readable storage medium comprising instructions that, when executed, cause at least one processor of a video encoding device to:

33. A computer readable storage medium, comprising a data structure stored thereon, the data structure comprising:

a coded video sequence comprising a first sub-sequence and a second sub-sequence, wherein the first sub-sequence includes one or more frames each having a first resolution, and the second sub-sequence includes one or more frames each having a second resolution, and wherein the first sub-sequence is different than the second sub-sequence, and the first resolution is different than the second resolution; and

a first sequence parameter set and a second sequence parameter set for the coded video sequence, wherein the first sequence parameter set indicates the first resolution of the one or more frames of the first sub-sequence, and the second sequence parameter set indicates the second resolution of the one or more frames of the second sub-sequence, and wherein the first sequence parameter set is different than the second sequence parameter set.

34. The computer readable medium of claim 33, wherein the first sequence parameter set and the second sequence parameter set are coded in a bitstream on the data structure prior to either the first sub-sequence or the second sub-sequence.

35. The computer readable medium of claim 33, wherein the first sequence parameter set is coded in a bitstream on the data structure prior to the first sub-sequence and the second sequence parameter set is coded in the bitstream on the data structure after at least one frame of the one or more frames of the first sub-sequence, and prior to the second sub-sequence.