CN111327893B - Apparatus, method and computer program for video encoding and decoding - Google Patents

Apparatus, method and computer program for video encoding and decoding Download PDF

Info

Publication number
CN111327893B
CN111327893B CN201911101171.2A CN201911101171A CN111327893B CN 111327893 B CN111327893 B CN 111327893B CN 201911101171 A CN201911101171 A CN 201911101171A CN 111327893 B CN111327893 B CN 111327893B
Authority
CN
China
Prior art keywords
residual
sample
decoding
prediction
channel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911101171.2A
Other languages
Chinese (zh)
Other versions
CN111327893A (en
Inventor
J·莱内玛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nokia Technologies Oy
Original Assignee
Nokia Technologies Oy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Technologies Oy filed Critical Nokia Technologies Oy
Publication of CN111327893A publication Critical patent/CN111327893A/en
Application granted granted Critical
Publication of CN111327893B publication Critical patent/CN111327893B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/182Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being a pixel
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • H04N19/513Processing of motion vectors
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/103Selection of coding mode or of prediction mode
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/132Sampling, masking or truncation of coding units, e.g. adaptive resampling, frame skipping, frame interpolation or high-frequency transform coefficient masking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/146Data rate or code amount at the encoder output
    • H04N19/147Data rate or code amount at the encoder output according to rate distortion criteria
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/17Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
    • H04N19/176Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a block, e.g. a macroblock
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/186Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being a colour or a chrominance component
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/42Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation
    • H04N19/423Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation characterised by memory arrangements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/44Decoders specially adapted therefor, e.g. video decoders which are asymmetric with respect to the encoder
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/46Embedding additional information in the video signal during the compression process
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/85Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using pre-processing or post-processing specially adapted for video compression
    • H04N19/88Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using pre-processing or post-processing specially adapted for video compression involving rearrangement of data among different coding units, e.g. shuffling, interleaving, scrambling or permutation of pixel data or permutation of transform coefficient data among different blocks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/21Server components or server architectures
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/70Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by syntax aspects related to video coding, e.g. related to compression standards

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

An apparatus, method and computer program for video encoding and decoding, a method for motion compensated prediction, the method comprising: determining a residual signal for at least one sample; determining whether the residual signal represents a residual for samples in more than one channel; and if affirmative, applying the residual signal application for at least a first sample in a first channel for generating a first reconstructed sample; and applying a residual signal for at least a second sample in the second channel for generating a second reconstructed sample.

Description

Apparatus, method and computer program for video encoding and decoding
Technical Field
The present invention relates to an apparatus, method and computer program for video encoding and decoding.
Background
Video and image samples are typically encoded using color representations (color representation), such as YUV or YCbCr, consisting of one luma channel (luminance channel) and two chroma channels (chrominance channel). In these cases, the luminance channels, which represent mainly scene illumination, are typically encoded with a certain resolution, while the chrominance channels, which represent typically the differences between certain color components, are typically encoded with a second resolution, which is lower than the luminance signal.
The intention of such a differential representation is to de-correlate the color components and to be able to compress the data more efficiently. However, in many cases, there is still some correlation left between channels, which can be used to represent data more efficiently.
Disclosure of Invention
Now, in order to at least alleviate the above problems, an enhancement method for color channel coding and decoding is presented herein.
The method according to the first aspect comprises: determining a residual signal for at least one sample; determining whether the residual signal represents a sample residual for more than one channel; and if affirmative, applying the residual signal for at least a first sample in a first channel for generating a first reconstructed sample; and applying the residual signal for at least a second sample in a second channel for generating a second reconstructed sample.
According to an embodiment, the method further comprises: the combined residual signal is applied for the chrominance channels of a still image or video sequence.
According to an embodiment, the method further comprises: a flag is included in the bitstream for indicating that at least one predefined condition for residual decoding is met.
According to an embodiment, the method further comprises: decoding the combined residual flag; decoding a single residual block in response to the combined residual flag being 1 or true; and applying the residue to both the first channel and the second channel.
The apparatus according to the second embodiment includes: means for determining a residual signal for at least one sample; means for determining whether the residual signal represents a residual for samples in more than one channel; means for applying the residual signal for at least a first sample in a first channel for generating a first reconstructed sample in response to the residual signal representing a residual for samples in more than one channel; and means for applying the residual signal for at least a second sample in a second channel for generating a second reconstructed sample.
According to an embodiment, the apparatus further comprises: means for applying the combined residual signal for a chrominance channel of a still image or video sequence.
According to an embodiment, the apparatus further comprises: means for decoding the combined residual flag; means for decoding a single residual block in response to the combined residual flag being 1 or true; and means for applying the residue to both the first channel and the second channel.
According to an embodiment, the apparatus further comprises: means for decoding the encoded block flag for the first channel and the encoded block flag for the second channel; means for decoding the combined residual flag in response to the two encoded block flags being either 1 or true; means for decoding a single residual block in response to the combined residual flag being 1 or true; and means for applying the single residual block for both the first channel and the second channel.
According to an embodiment, the apparatus further comprises: means for decoding the combined residual flag; means for decoding a single residual block in response to the combined residual flag being 1 or true; means for adding a single residual block to a prediction block in a first channel; and means for subtracting the single residual block from the predicted block in the second channel.
According to an embodiment, the apparatus further comprises: means for decoding an identifier associated with the first channel; and means for subtracting the residual signal of the second channel from the predicted signal in the first channel.
According to an embodiment, the apparatus further comprises: means for applying the combined residual signal to a subset of blocks determined by bitstream signaling.
According to an embodiment, the apparatus further comprises: means for applying the combined residual signal to the block using prediction modes belonging to a predetermined set of prediction modes.
According to an embodiment, the apparatus further comprises: means for applying the combined residual signal to the block using a residual coding mode belonging to a predetermined set of residual coding modes.
The apparatus according to the third aspect comprises at least one processor and at least one memory, the at least one memory stored with code thereon, which when executed by the at least one processor, causes the apparatus to at least perform: determining a residual signal for the at least one sample; determining whether the residual signal represents a sample residual for more than one channel; applying the residual signal for at least a first sample in a first channel for generating a first reconstructed sample in response to the residual signal representing sample residuals for more than one channel; and applying the residual signal for at least a second sample in a second channel for generating a second reconstructed sample.
As described above, the apparatus and computer-readable storage medium having code stored thereon are thus configured to perform the methods described above and one or more embodiments related thereto.
Drawings
For a better understanding of the present invention, reference will now be made, by way of example, to the accompanying drawings in which:
FIG. 1 schematically illustrates an electronic device employing an embodiment of the invention;
fig. 2 schematically illustrates a user equipment suitable for employing an embodiment of the present invention;
FIG. 3 further schematically illustrates an electronic device employing an embodiment of the invention, the electronic device being connected using wireless and wired network connections;
FIG. 4 schematically illustrates an encoder suitable for implementing embodiments of the present invention;
FIG. 5 shows a flow chart of a method according to an embodiment of the invention;
fig. 6a and 6b show examples of residual coding according to the prior art and according to an embodiment of the invention, respectively;
FIG. 7 shows a schematic diagram of a decoder suitable for implementing an embodiment of the invention; and
fig. 8 illustrates a schematic diagram of an example multimedia communication system in which various embodiments may be implemented.
Detailed Description
Suitable means and possible mechanisms for initiating a viewpoint switch are described in further detail below. In this regard, reference is first made to fig. 1 and 2, wherein fig. 1 shows a block diagram of a video coding system according to an example embodiment, as a schematic block diagram of an example apparatus or electronic device 50, which may incorporate a codec according to an embodiment of the present invention. Fig. 2 shows a layout of an apparatus according to an example embodiment. The elements of fig. 1 and 2 will be explained next.
The electronic device 50 may be, for example, a mobile terminal or user equipment of a wireless communication system. However, it is to be appreciated that embodiments of the present invention may be implemented within any electronic device or apparatus that may need to encode and decode or encode or decode video images.
The apparatus 50 may include a housing 30 for incorporating and protecting equipment. The device 50 may further include a display 32 in the form of a liquid crystal display. In other embodiments of the invention, the display may be any suitable display technology suitable for displaying images or video. The device 50 may further include a keypad 34. In other embodiments of the invention, any suitable data or user interface mechanism may be employed. For example, the user interface may be implemented as a virtual keyboard or data entry system as part of a touch-sensitive display.
The device may include a microphone 36 or any suitable audio input that may be a digital or analog signal input. The apparatus 50 may further include an audio output device, which in embodiments of the invention may be any of the earpiece 38, speaker, or analog audio or digital audio output connection. The apparatus 50 may also include a battery (or in other embodiments of the invention, the device may be powered by any suitable mobile energy device such as a solar cell, fuel cell or spring motor). The apparatus may further comprise a camera capable of recording or capturing images and/or video. The apparatus 50 may further include an infrared port for short-range line-of-sight communication with other devices. In other embodiments, the apparatus 50 may further include any suitable short-range communication solution, such as, for example, a Bluetooth wireless connection or a USB/firewire (firewire) wired connection.
The apparatus 50 may include a controller 56, a processor, or processor circuitry for controlling the apparatus 50. The controller 56 may be connected to a memory 58, which memory 58 may store data in the form of image and audio data and/or may also store instructions for implementation on the controller 56 in embodiments of the present invention. The controller 56 may further be connected to codec circuitry 54, which codec circuitry 54 is adapted to perform or assist in the encoding and decoding of audio and/or video data performed by the controller.
The apparatus 50 may further include a card reader 48 and a smart card 46, e.g., a UICC and UICC reader, for providing user information and adapted to provide authentication information for authenticating and authorizing a user at the network.
The apparatus 50 may comprise radio interface circuitry 52, which radio interface circuitry 52 is connected to the controller and adapted to generate wireless communication signals for communication with, for example, a cellular communication network, a wireless communication system or a wireless local area network. The apparatus 50 may further include an antenna 44 connected to the radio interface circuitry 52 for transmitting and receiving radio frequency signals generated at the radio interface circuitry 52 to and from other apparatus(s).
The apparatus 50 may include a camera capable of recording or detecting individual frames, which are then passed to a codec 54 or controller for processing. The apparatus may receive video image data from another device for processing prior to transmission and/or storage. The device 50 may also receive images for encoding/decoding either wirelessly or through a wired connection. The structural elements of the apparatus 50 described above represent examples of components for performing the corresponding functions.
With respect to fig. 3, an example of a system is shown in which embodiments of the present invention may be utilized. The system 10 includes a plurality of communication devices that can communicate over one or more networks. The system 10 may include any combination of wired or wireless networks including, but not limited to, wireless cellular telephone networks (such as GSM, UMTS, CDMA networks, etc.), wireless Local Area Networks (WLANs) such as defined by any IEEE 802.X standard, bluetooth personal area networks, ethernet local area networks, token ring local area networks, wide area networks, and the internet.
The system 10 may include wired and wireless communication devices and/or apparatus 50 suitable for implementing embodiments of the present invention.
For example, the system shown in FIG. 3 illustrates a representation of the mobile telephone network 11 and the Internet 28. The connection to the internet 28 may include, but is not limited to, long range wireless connections, short range wireless connections, and various wired connections including, but not limited to, telephone lines, cable lines, power lines, and the like.
Example communication devices shown in system 10 may include, but are not limited to, an electronic device or apparatus 50, a combination of a Personal Digital Assistant (PDA) and a mobile telephone 14, a PDA 16, an Integrated Messaging Device (IMD) 18, a desktop computer 20, a notebook computer 22. The device 50 may be stationary or mobile when carried by a mobile individual. The device 50 may also be positioned by a mode of transportation including, but not limited to, an automobile, truck, taxi, bus, train, ship, airplane, bicycle, motorcycle, or any similar suitable mode of transportation.
Embodiments may also be implemented in a set top box (i.e., a digital TV receiver that may/may not have a display or wireless functionality), in a tablet computer or (laptop) Personal Computer (PC) (with hardware or software or a combination of encoder/decoder implementations), in various operating systems, and in a chipset, processor, DSP, and/or embedded system that provides hardware/software based encoding.
Some or additional devices may send and receive calls and messages and communicate with the service provider through a wireless connection 25 with a base station 24. The base station 24 may be connected to a network server 26 that allows communication between the mobile telephone network and the internet 28. The system may include additional communication devices and various types of communication devices.
The communication devices may communicate using various transmission techniques including, but not limited to, code Division Multiple Access (CDMA), global System for Mobile communications (GSM), universal Mobile Telecommunications System (UMTS), time Division Multiple Access (TDMA), frequency Division Multiple Access (FDMA), transmission control protocol-Internet protocol (TCP-IP), short Message Service (SMS), multimedia Message Service (MMS), email, instant Message Service (IMS), bluetooth, IEEE 802.11, and any similar wireless communication techniques. Communication devices involved in practicing various embodiments of the invention may communicate using various media including, but not limited to, radio, infrared, laser, cable connections, and any suitable connection.
In telecommunications and data networks, a channel may refer to a physical channel or a logical channel. Physical channels may refer to physical transmission media such as wires, while logical channels may refer to logical connections on a multiplexed medium capable of carrying multiple logical channels. A channel may be used to transmit an information signal (e.g., a bit stream) from one or several transmitters (or transmitters) to one or several receivers.
An MPEG-2 Transport Stream (TS) equivalently specified in ISO/IEC 13818-1 or in ITU-T recommendation h.222.0 is a format for carrying audio, video and other media and program metadata or other metadata in a multiplexed stream. A Packet Identifier (PID) is used to identify elementary streams within a TS (also referred to as elementary streams of packets). Thus, logical channels within an MPEG-2TS may be considered to correspond to particular PID values.
Available media file format standards include the ISO base media file format (ISO/IEC 14496-12, which may be abbreviated as ISOBMFF) and the file format of NAL unit structured video (ISO/IEC 14496-15), which is derived from ISOBMFF.
Some concepts, structures, and specifications of ISOBMFF are described below as examples of container file formats upon which embodiments may be implemented. Aspects of the invention are not limited to ISOBMFF, but rather are described with respect to one possible basis upon which the invention may be partially or fully implemented.
The basic building blocks of the ISO base media file format are called boxes. Each box has a header and a payload. The frame header indicates the type of frame and the frame size in bytes. The box may enclose (close) other boxes, and the ISO file format specifies which box types are allowed in a certain type of box. Furthermore, the presence of some boxes in each file may be mandatory, while the presence of other boxes may be optional. In addition, for some box types, it may be permissible to have multiple boxes in the file. Thus, the hierarchical structure of the boxes may be specified in consideration of the ISO base media file format.
According to the ISO file format family, a file includes media data and metadata encapsulated in a box. Each box is identified by a four character code (4 CC) and starts with a header that informs of the type and size of the box.
In a file conforming to the ISO base media file format, media data may be provided in a media data 'mdat' box, and a movie 'moov' box may be used to enclose metadata. In some cases, it may be desirable to have 'mdat' and 'moov' boxes present in order for the file to be operational. The movie 'moov' box may include one or more tracks (tracks), and each track may reside in a corresponding track 'trak' box. The track may be one of a number of types, including a media track, which refers to samples formatted according to a media compression format (and its encapsulation to an ISO base media file format).
For example, movie clips may be used when recording content to an ISO file, e.g., to avoid losing data when the recording application crashes, memory space is exhausted, or some other event occurs. Without a movie fragment, a data loss may occur because the file format may require that all metadata (e.g., movie frames) be written into one contiguous area of the file. Furthermore, when recording a file, there may not be a sufficient amount of storage space (e.g., random access memory RAM) to buffer the movie box for the size of the available storage, and it may be too slow to recalculate the content of the movie box when the movie is closed. Moreover, movie clips can be recorded and played back simultaneously using a conventional ISO file parser. Furthermore, a smaller duration of initial buffering may be required for progressive downloading, e.g., receiving and playing back files simultaneously while using movie fragments, and with smaller initial movie frames than files with the same media content but without movie fragments structured.
The movie fragment feature may enable the splitting of metadata that might otherwise reside in the movie box into multiple fragments. Each segment may correspond to a particular time period of the track. In other words, the movie fragment feature may enable interleaving of file metadata and media data. Thus, the size of the movie box may be limited, and the above-mentioned use cases may be implemented.
In some examples, if the media samples of the movie fragment are in the same file as the moov box, they may reside in the mdat box. However, moof boxes may be provided for metadata of movie fragments. The moov box may include information for a certain duration of playback time that has previously existed in the moov box. The moov box may still represent a valid movie on its own, but in addition it may also include an mvex box indicating that a movie fragment will follow in the same file. The movie fragment may timely expand the presentation associated with the moov box.
Within a movie fragment there may be a set of track fragments, including anywhere from zero to multiple per track. The track segments may in turn include anywhere from zero to multiple track runs, with each document in the documents being a continuous run of samples for that track. Within these structures, many fields are optional and may be default. Metadata that may be included in moov boxes may be limited to a subset of metadata that may be included in moov boxes, and in some cases may be encoded differently. Details about boxes that can be included in moof boxes can be found from the ISO base media file format specification. An independent movie fragment may be defined as consisting of a moof box and an mdat box that are consecutive in file order, and wherein the mdat box contains samples of the movie fragment for which the moof box provides metadata and does not contain samples of any other movie fragment (i.e., any other moof box).
A track reference mechanism may be used to associate tracks with each other. The trackReferenceBox includes box(s), each box providing references from the set of containing tracks to other tracks. These references are marked by the box type of the box(s) that are contained (i.e., the four-character code of the box).
The ISO base media file format contains three mechanisms for timing metadata that may be associated with a particular sample: sample group, timing metadata track, and sample side information. The derived specification may utilize one or more of these three mechanisms to provide similar functionality.
Sample groupings of ISO base media file format and its derivatives (such as AVC file format and SVC file format) may be defined based on grouping criteria as assigning each sample in a track as a member of a sample group. A sample group (sample group) in a sample group (sample grouping) is not limited to consecutive samples, and may contain samples that are not adjacent. Since there may be more than one sample packet for a sample in the track, each sample packet may have a type field to indicate the type of packet. The sample packet may be represented by two linked data structures: (1) SampleToGroupBox (sbgp box) represents assignment of samples to groups of samples; and (2) SampleGroupDescriptionBox (sgpd box) contains a sample group entry for each sample group, which describes the characteristics of the group. Based on different grouping criteria, there may be multiple instances of SampleToGroupBox and sampletgroupdescriptionbox. These may be distinguished by type words that are used to indicate the type of packet. SampleToGroupBox may contain a grouping_type_parameter field, which may be used, for example, to indicate the subtype of the packet.
The Matroska file format is capable of storing any one of video, audio, picture or subtitle tracks in one file, but is not limited to. Matroska may be used as a base format for derived file formats, such as WebM. Matroska uses the extensible binary language (EBML) as a basis. EBML is inspired by the XML principle, specifying a binary and octet (byte) aligned format. EBML itself is a generalized description of binary marking technology. The Matroska file consists of elements that make up an EBML "document". These elements incorporate an element ID, a descriptor for the element size, and the binary data itself. These elements may be nested. The fragment element of Matroska is a container of other top level (level 1) elements. The Matroska file may include (but is not limited to being composed of) a segment. The multimedia data in the Matroska file is organized in clusters (or cluster elements), each cluster typically containing several seconds of multimedia data. The cluster includes a Block group element, which in turn includes a Block element. The hint element includes metadata that can assist in random access or lookup and can include a file pointer or corresponding timestamp for the lookup point.
A video codec consists of an encoder that transforms an input video into a compressed representation suitable for storage/transmission, and a decoder that can decompress the compressed video representation back into a visible form. The video encoder and/or the video decoder may also be separate from each other, i.e. no codec need to be formed. Typically, the encoder will discard some information in the original video sequence in order to represent the video in a more compact form (i.e., lower bit rate).
Typical hybrid video encoders (e.g., many encoder implementations of ITU-T h.263 and h.264) encode video information in two phases. First, the pixel values in a certain picture area (or "block") are predicted, for example, by means of motion compensation (finding and indicating an area in one of the previously encoded video frames that corresponds closely to the encoded block) or by means of spatial (using pixel values around the block to be encoded in a prescribed manner). Second, a prediction error (i.e., a difference between the predicted pixel block and the original pixel block) is encoded. This is typically done by transforming the differences of the pixel values using a prescribed transform, such as a Discrete Cosine Transform (DCT) or a variant thereof, quantizing the coefficients and entropy encoding the quantized coefficients. By varying the fidelity of the quantization process, the encoder can control the balance between the accuracy of the pixel representation (image quality) and the size of the resulting encoded video representation (file size or transmission bit rate).
In temporal prediction, the prediction source is a previously decoded picture (also referred to as a reference picture). In intra-block copy (IBC; also referred to as intra-block copy prediction), prediction is applied similarly to temporal prediction, but the reference picture is the current picture and only previously decoded samples can be referenced in the prediction process. Inter-layer (inter-layer) or inter-view (inter-view) prediction may similarly be applied to temporal prediction, but the reference picture is a picture decoded from another scalability layer or from another view, respectively. In some cases, inter prediction (inter prediction) may refer to only temporal prediction, while in other cases inter prediction may be collectively referred to as temporal prediction and any one of intra-block copy, inter-layer prediction, and inter-view prediction, provided that they are performed in the same or similar process as temporal prediction. Inter-prediction or temporal prediction may sometimes be referred to as motion compensated or motion compensated prediction.
Motion compensation may be performed with full sample or sub-sample accuracy. In the case of motion compensation where the complete samples are accurate, the motion may be represented as motion vectors with integer values for horizontal and vertical displacements, and the motion compensation process may use these displacements to effectively copy the samples from the reference picture. In the case of sub-sample accurate motion compensation, the motion vector is represented by a fractional or decimal value for the horizontal and vertical components of the motion vector. In the case where the motion vector refers to a non-integer position in the reference picture, a sub-sample interpolation process is typically invoked to calculate a predicted sample value based on the reference sample and the selected sub-sample position. The subsampled interpolation process typically includes: horizontal filter compensation is performed for horizontal offsets with respect to the complete sample position, followed by vertical filter compensation for vertical offsets with respect to the complete sample position. However, in some circumstances, the vertical processing may also be completed before the horizontal processing.
Inter prediction (which may also be referred to as temporal prediction, motion compensation, or motion compensated prediction) reduces temporal redundancy. In inter prediction, the prediction source is a previously decoded picture. Intra prediction (intra predication) exploits the fact that neighboring pixels within the same picture are likely to be correlated. Intra prediction may be performed in the spatial or transform domain, i.e. sample values or transform coefficients may be predicted. Intra prediction is typically utilized in intra coding where inter prediction is not applied.
One result of the encoding process is a set of encoding parameters, such as motion vectors and quantized transform coefficients. If a number of parameters are predicted first from spatially or temporally adjacent parameters, they can be entropy coded more efficiently. For example, a motion vector may be predicted from spatially neighboring motion vectors, and only the difference with respect to the motion vector predictor may be encoded. Prediction and intra prediction of coding parameters may be collectively referred to as in-picture prediction.
Fig. 4 shows a block diagram of a video encoder suitable for employing an embodiment of the invention. Fig. 4 presents an encoder for two layers, but it is to be understood that the presented encoder can be similarly extended to encode more than two layers. Fig. 4 illustrates an embodiment of a video encoder comprising a first encoder section 500 for a base layer and a second encoder section 502 for an enhancement layer. Each of the first encoder section 500 and the second encoder section 502 may include similar elements for encoding an input picture. The encoder sections 500, 502 may include pixel predictors 302, 402, prediction error encoders 303, 403, and prediction error decoders 304, 404. Fig. 4 also shows an embodiment of the pixel predictors 302, 402 as comprising: the inter predictors 306, 406, the intra predictors 308, 408, the mode selectors 310, 410, the filters 316, 416, and the reference frame memories 318, 418. The pixel predictor 302 of the first encoder section 500 receives 300 a base layer picture of the video stream that is to be encoded at the inter predictor 306 (determining the difference between the picture and the motion compensated reference frame 318) and the intra predictor 308 (determining the prediction of the picture block based only on the current frame or the processed portion of the picture). The outputs of both the inter predictor and the intra predictor are passed to a mode selector 310. The intra predictor 308 may have more than one intra prediction mode. Thus, each mode may perform intra prediction and provide a predicted signal to the mode selector 310. The mode selector 310 also receives a copy of the base layer picture 300. Accordingly, the pixel predictor 402 of the second encoder section 502 receives 400 enhancement layer images of the video stream that are to be encoded at the inter predictor 406 (determining the difference between the image and the motion compensated reference frame 418) and the intra predictor 408 (determining the prediction of the image block based only on the current frame or the processed portion of the image). The outputs of both the inter predictor and the intra predictor are passed to a mode selector 410. The intra predictor 408 may have more than one intra prediction mode. Thus, each mode may perform intra prediction and provide a predicted signal to the mode selector 410. The mode selector 410 also receives a copy of the enhancement layer picture 400.
Depending on which coding mode is selected to code the current block, the output of the inter predictor 306, 406 or the output of one of the alternative intra predictor modes or the output of the surface encoder within the mode selector is passed to the output of the mode selector 310, 410. The output of the mode selector is passed to a first summing device 321, 421. The first summing device may subtract the output of the pixel predictors 302, 402 from the base layer picture 300/enhancement layer picture 400 to generate a first prediction error signal 320, 420 that is input to the prediction error encoder 303, 403.
The pixel predictors 302, 402 further receive a combination of the predicted representation of the image block 312, 412 from the preliminary reconstructor 339, 439 and the output 338, 438 of the prediction error decoder 304, 404. The preliminary reconstructed images 314, 414 may be passed to intra predictors 308, 408 and filters 316, 416. Filters 316, 416 receiving the preliminary representation may filter the preliminary representation and output final reconstructed images 340, 440, which final reconstructed images 340, 440 may be stored in reference frame memories 318, 418. The reference frame memory 318 may be coupled to the inter-predictor 306 to be used as a reference image for comparison with future base layer pictures 300 in inter-prediction operations. According to some embodiments, in case the base layer is selected and indicated as a source for inter-layer sample prediction and/or inter-layer motion information prediction of the enhancement layer, the reference frame memory 318 may also be connected to the inter-frame predictor 406 to be used as a reference picture for comparison with future enhancement layer pictures 400 in inter-frame prediction operations. Also, a reference frame memory 418 may be connected to the inter predictor 406 to be used as a reference image for comparison with future enhancement layer pictures 400 in an inter prediction operation.
According to some embodiments, in case the base layer is selected and indicated as a source for predicting the filtering parameters of the enhancement layer, the filtering parameters from the filter 316 of the first encoder section 500 may be provided to the second encoder section 502.
The prediction error encoder 303, 403 comprises a transform unit 342, 442 and a quantizer 344, 444. The transform units 342, 442 transform the first prediction error signals 320, 420 into the transform domain. The transform is, for example, a DCT transform. The quantizers 344, 444 quantize the transform domain signals (e.g., DCT coefficients) to form quantized coefficients.
The prediction error decoder 304, 404 receives the output from the prediction error encoder 303, 403 and performs inverse processing of the prediction error encoder 303, 403 to produce a decoded prediction error signal 338, 438, which when combined with the predicted representation of the image block 312, 412 at the second summing device 339, 439 produces a preliminary reconstructed image 314, 414. The prediction error decoder may be considered to include dequantizers 361, 461 dequantizing quantized coefficient values (e.g., DCT coefficients) to reconstruct the transformed signal, and inverse transforming units 363, 463 performing an inverse transformation on the reconstructed transformed signal, wherein the output of the inverse transforming units 363, 463 contains the reconstructed block(s). The prediction error decoder may further comprise a block filter that may filter the reconstructed block(s) based on the further decoded information and the filter parameters.
The entropy encoders 330, 430 receive the outputs of the prediction error encoders 303, 403 and may perform appropriate entropy encoding/variable length encoding on the signals to provide error detection and correction capabilities. The outputs of the entropy encoders 330, 430 may be inserted into the bit stream, for example, by the multiplexer 508.
Entropy encoding/decoding may be performed in a number of ways. For example, context-based encoding/decoding may be applied, wherein both the encoder and decoder modify the context state of the encoding parameters based on previously encoded/decoded encoding parameters. The context-based coding may be, for example, context-adaptive binary arithmetic coding (CABAC) or context-based variable length coding (CAVLC) or any similar entropy coding. The entropy encoding/decoding may alternatively or additionally be performed using a variable length encoding scheme such as Huffman encoding/decoding or Exp-Golomb encoding/decoding. Decoding of coding parameters from entropy coded bitstreams or codewords may be referred to as parsing.
The H.264/AVC standard was developed by the Video Coding Experts Group (VCEG) (JVT) of the International telecommunication Union telecommunication standardization sector (ITU-T) and the Moving Picture Experts Group (MPEG) of the International organization for standardization (ISO)/International Electrotechnical Commission (IEC). The H.264/AVC standard is promulgated by two parent standards organizations and is known as ITU-T recommendation H.264 and ISO/IEC International Standard 14496-10, also known as MPEG-4 part 10 Advanced Video Coding (AVC). There are multiple versions of the H.264/AVC standard in which new extensions or functions are integrated. These extensions include Scalable Video Coding (SVC) and Multiview Video Coding (MVC).
Release 1 of the high efficiency video coding (h.265/HEVC, also known as HEVC) standard was developed by joint collaborative group-video coding (JCT-VC) of VCEG and MPEG. This standard is promulgated by two parent standards organizations and is known as ITU-T recommendation h.265 and ISO/IEC international standard 23008-2, also known as MPEG-H part 2 High Efficiency Video Coding (HEVC). Later versions of h.265/HEVC include scalable, multiview, fidelity range extension, three-dimensional and screen content coding extension, which may be abbreviated as SHVC, MV-HEVC, REXT, 3D-HEVC and SCC, respectively.
SHVC, MV-HEVC, and 3D-HEVC use the common base specification specified in annex F version 2 of the HEVC standard. The general basis includes, for example, high level syntax and semantics (such as inter-layer dependencies) that specify some characteristics of the layers of the bitstream, and decoding processes such as reference picture list construction, including inter-layer reference pictures and picture order count derivation of the multi-layer bitstream. Attachment F may also be used in a potentially subsequent multi-layer extension of HEVC. It is to be understood that even though video encoders, video decoders, encoding methods, decoding methods, bitstream structures, and/or embodiments may be described below with reference to specific extensions such as SHVC and/or MV-HEVC, they are generally applicable to any multi-layer extension of HEVC, and even more generally to any multi-layer video coding scheme.
Some key definitions, bit streams and coding structures and concepts of h.264/AVC and HEVC are described in this section as examples of video encoders, decoders, coding methods, decoding methods and bit stream structures, where embodiments may be implemented. Some key definitions, bit stream and coding structures and concepts of h.264/AVC are the same as in HEVC, and thus will be described in conjunction below. The various aspects of the invention are not limited to h.264/AVC or HEVC, but rather are described with respect to one possible basis on which the invention may be implemented in part or in whole.
Similar to many early video coding standards, bitstream syntax and semantics and decoding procedures for error free bitstreams are specified in h.264/AVC and HEVC. The encoding process is not specified, but the encoder must generate a consistent bit stream (conforming bitstream). The Hypothetical Reference Decoder (HRD) can be utilized to verify bitstream and decoder consistency. These standards contain coding tools that help to account for transmission errors and losses, but the use of such tools in coding is optional and the decoding process has not been specified for erroneous bitstreams.
The basic units for input to and output to an h.264/AVC or HEVC encoder and decoder, respectively, are pictures. The picture provided as input to the encoder may also be referred to as a source picture, and the picture decoded by decoding may be referred to as a decoded picture.
The source picture and the decoded picture are each composed of one or more sample arrays, such as one of the following sample array sets:
brightness (Y) only (monochrome).
Luminance and two chromaticities (YCbCr or YCgCo).
Green, blue and red (GBR, also called RGB).
An array representing other unspecified mono-or tri-stimulus color samples (e.g. YZX, also called XYZ).
Hereinafter, these arrays may be referred to as luminance (or L or Y) and chrominance, wherein two chrominance arrays may be referred to as Cb and Cr; whichever actual color representation method is used. The actual color representation method used may be indicated, for example, with a video availability information (VUI) syntax in the encoded bitstream, for example using h.264/AVC and/or HEVC. A component may be defined as an array or single sample from one of three sample arrays (luminance and two chrominance) or a single sample constituting an array or array of monochrome format pictures.
In H.264/AVC and HEVC, a picture may be a frame or a field. The frame comprises a matrix of luminance samples and possibly corresponding chrominance samples. A field is a collection of alternating sample lines of a frame and can be used as an encoder input when the source signal is interleaved (interleaved). The array of chroma samples may not be present (and therefore monochrome sampling may be used) or the array of chroma samples may be sub-sampled when compared to the array of luma samples. The chromaticity format can be summarized as follows:
In monochrome sampling, there is only one sample array, nominally what can be considered as a luminance array.
In 4:2:0 sampling, each of the two chroma arrays has half the height and half the width of the luma array.
In 4:2:2 sampling, each of the two chrominance arrays has the same height and half width of the luminance array.
In 4:4:4 sampling when no separate color plane is used, each of the two chroma arrays has the same height and width as the luma array.
In h.264/AVC and HEVC, an array of samples may be encoded into a bitstream as separate color planes, and the separately encoded color planes from the bitstream are decoded separately. When separate color planes are used, each of them is treated separately as a picture (by an encoder and/or decoder) using monochrome sampling.
A partition may be defined as dividing a collection into subsets such that each element of the collection is in exactly one of the subsets.
When describing the operation of HEVC encoding and/or decoding, the following terminology may be used. A coding block may be defined as a block of N x N samples for some value of N, such that dividing the coding tree block into coding blocks is a division. A Coding Tree Block (CTB) may be defined as a block of N x N samples for some values of N, such that the division of a component into coding tree blocks is a division. A Coding Tree Unit (CTU) may be defined as a coding tree block of luma samples, two corresponding coding tree blocks of chroma samples of a picture having three sample arrays, or a coding tree block of samples of a picture encoded using three separate color planes and a syntax structure used to encode the samples. An encoding unit (CU) may be defined as an encoding block of luma samples, two corresponding encoding blocks of chroma samples of a picture with three sample arrays, or an encoding block of samples of a monochrome picture or a picture encoded using three separate color planes and syntax structures for encoding the samples. A CU with the largest allowed size may be named LCU (largest coding unit) or Coding Tree Unit (CTU), and the video pictures are divided into non-overlapping LCUs.
The CU includes: one or more Prediction Units (PUs) defining a prediction process for samples within a CU, and one or more Transform Units (TUs) defining a prediction error coding process for samples in the CU. Typically, a CU consists of square sample blocks of a selectable size from a predefined set of possible CU sizes. Each PU and TU may be further partitioned into smaller PUs and TUs to increase the granularity of the prediction process and the prediction error coding process, respectively. Each PU has prediction information associated with it that defines which prediction is to be applied to pixels within the PU (e.g., motion vector information of the PU for inter prediction and intra prediction directionality information of the PU for intra prediction).
Each TU may be associated with information describing a prediction error decoding process for samples within the TU, including, for example, DCT coefficient information. Whether prediction error coding is applied for each CU is typically signaled at the CU level. In the case where there is no prediction error residual associated with a CU, it may be considered that there is no TU for that CU. The splitting of pictures into CUs and the splitting of CUs into PUs and TUs is typically signaled in the bitstream to allow the decoder to recreate the intended structure of these units.
In HEVC, a picture may be divided into tiles (tiles) that are rectangular and contain an integer number of LCUs. In HEVC, tiles are partitioned to form a regular grid, where the height and width of the tiles differ from each other by at most one LCU. In HEVC, a slice is defined as an integer number of coding tree units contained in one independent slice and all subsequent dependent slices (if any) within the same access unit that precede the next independent slice (if any). In HEVC, slice segments are defined as an integer number of coding tree units that are sequentially ordered in a tile scan and contained in a single NAL unit. The division of each picture into slices is a division. In HEVC, the independent slice segments are defined as: slice segments for which the value of the syntax element of the slice header cannot be inferred from the value of the previous slice segment; and the slave cut is defined as: slice segments for which the values of some syntax elements of the slice segment header are inferred from the values of the previous independent slices in decoding order. In HEVC, a slice header is defined as the slice header of an independent slice that is the current slice or an independent slice preceding the current dependent slice, and a slice header is defined as part of an encoded slice that contains data elements related to the first or all of the coding tree units represented in the slice. If a tile is not used, the CUs are scanned in raster scan order of LCUs within the tile or within the picture. Within the LCU, the CUs have a particular scan order.
A Motion Constrained Tile Set (MCTS) causes the inter prediction process to be constrained in encoding such that no sample values outside of the motion constrained tile set and no sample values at fractional sample positions (derived using one or more sample values outside of the motion constrained tile set) are used to intra predict any samples within the motion constrained tile set. In addition, the encoding of the MCTS is constrained in such a way that motion vector candidates are not derived from blocks other than the MCTS. This may be enforced by turning off temporal motion vector prediction of HEVC, or by disabling the encoder from using any motion vector prediction candidates or AMVP candidate list of PUs that are right to the left of the MCTS's right tile boundary that are after TMVP candidates in TMVP candidates or merging, except for the last one in the MCTS' lower right corner. In general, an MCTS may be defined as a set of tiles that are independent of any sample values and encoded data (such as motion vectors) outside of the MCTS. In some cases, MCTS may be required to form rectangular areas. It should be appreciated that, depending on the context, an MCTS may reference a tile set within a picture or a corresponding tile set in a sequence of pictures. Individual tile sets may, but typically need not, be collocated in a sequence of pictures.
It is noted that sample locations used in inter prediction may be saturated by the encoding and/or decoding process such that locations outside of a picture will be otherwise saturated to point to corresponding boundary samples of the picture. Thus, if the tile boundary is also a picture boundary, in some use cases, the encoder may allow the motion vector to effectively cross the boundary, or allow the motion vector to cause an effective fractional sample interpolation that will reference a location outside the boundary because the sample location is saturated onto the boundary. Under other use cases, in particular, if an encoded tile can be extracted from a bitstream located adjacent to a picture boundary to another bitstream (where the tile is located in a position not adjacent to the picture boundary), the encoder may constrain the motion vector on the picture boundary similar to any MCTS boundary.
Temporal motion constrained tile set SEI messages of HEVC may be used to indicate the presence of motion constrained tile sets in a bitstream.
The decoder reconstructs the output video by applying a prediction approach similar to the encoder to form a prediction representation of the pixel block (using motion or spatial information created by the encoder and stored in the compressed representation) and prediction error decoding (the inverse operation of the prediction error encoding restores the quantized prediction error signal in the spatial pixel domain). After applying the prediction and prediction error decoding approach, the decoder sums the prediction and prediction error signals (pixel values) to form an output video frame. The decoder (and encoder) may also apply additional filtering to improve the quality of the output video before delivering it for display and/or storage as a prediction reference for the upcoming frames in the video sequence.
For example, the filtering may include one or more of the following: deblocking, sample Adaptive Offset (SAO), and/or Adaptive Loop Filtering (ALF). h.264/AVC includes deblocking, while HEVC includes deblocking and SAO.
In a typical video codec, motion information is indicated using a motion vector (such as a prediction unit) associated with each motion compensated image block. Each of these motion vectors represents the displacement of an image block in a picture to be encoded (on the encoder side) or decoded (on the decoder side), and the displacement of a prediction source block in one of the previously encoded or decoded pictures. In order to efficiently represent motion vectors, they are typically differentially encoded with respect to block-specific predicted motion vectors. In a typical video codec, predicted motion vectors are created in a predefined manner, e.g., the median of encoded or decoded motion vectors of neighboring blocks is calculated. Another way to create motion vector predictions is to generate a list of candidate predictions from neighboring blocks and/or co-located blocks in the temporal reference picture and signal the selected candidates as motion vector predictors. In addition to predicting motion vector values, it is also possible to predict which reference picture(s) are used for motion compensated prediction, and this prediction information may be represented, for example, by reference indices of previously encoded/decoded pictures. The reference index is typically predicted from neighboring blocks and/or co-located blocks in the temporal reference picture. Moreover, typical high-efficiency video codecs employ an additional motion information encoding/decoding mechanism, commonly referred to as a merge/merge mode, in which all motion field information (including motion vectors and corresponding reference picture indices for each available reference picture list) is predicted and used without any modification/correction. Likewise, motion field information is predicted using motion field information of neighboring blocks and/or co-located blocks in a temporal reference picture, and the used motion field information is signaled in a list of motion field candidate lists filled with motion field information of available neighboring/co-located blocks.
In a typical video codec, the motion compensated prediction residual is first transformed using a transform kernel (e.g., DCT) and then encoded. The reason for this is that there is still typically some correlation between the residuals, and in many cases the transformation may help reduce this correlation and provide more efficient coding.
Typical video encoders utilize Lagrangian cost functions to find the best coding mode, e.g., the desired coding mode for a block and associated motion vector. Such a cost function uses a weight factor λ to relate the (exact or estimated) image distortion due to the lossy encoding method to the amount of (exact or estimated) information needed to represent the pixel values in the image region:
C=D+λR, (1)
where C is Lagrangian cost to be minimized, D is image distortion (e.g., mean square error) that takes into account mode and motion vectors, and R is the number of bits (including the amount of data representing candidate motion vectors) required to reconstruct the data of the image block in the decoder.
Video coding standards and specifications may allow an encoder to divide an encoded picture into coded slices, etc. In-picture prediction is typically disabled across slice boundaries. Thus, a slice may be considered as a way of dividing an encoded picture into independently decodable slices. In H.264/AVC and HEVC, in-picture prediction can be disabled across slice boundaries. Thus, a slice may be considered as a way of dividing an encoded picture into independently decodable slices, and thus a slice is generally considered as a basic unit for transmission. In many cases, the encoder may indicate in the bitstream which types of intra-picture prediction to turn off across slice boundaries, and the decoder operation may take this information into account, for example, when deducing which prediction sources are available. For example, if a neighboring CU resides in a different slice, samples from the neighboring CU may be considered unavailable for intra prediction.
The basic units for the output of an h.264/AVC or HEVC encoder and the input of an h.264/AVC or HEVC decoder are Network Abstraction Layer (NAL) units, respectively. NAL units may be encapsulated into packets or similar structures for transmission over a packet-oriented network or storage into a structured file. Byte stream formats have been specified in h.264/AVC and HEVC for transport or storage environments that do not provide framing structures. The byte stream format separates NAL units from each other by appending a start code in front of each NAL unit. To avoid false detection of NAL unit boundaries, the encoder runs a byte-oriented start code emulation prevention algorithm that adds emulation prevention bytes to the NAL unit payload if the start code would otherwise occur. To support direct forwarding gateway operations between packet-oriented and flow-oriented systems, start code emulation prevention can always be performed regardless of whether the byte stream format is in use. A NAL unit may be defined as a syntax structure containing an indication of the type of data to follow and bytes containing that data in the form of RBSPs, interspersed with emulation prevention bytes as necessary. An original byte sequence payload (RBSP) may be defined as a syntax structure containing an integer number of bytes encapsulated in a NAL unit. The RBSP is in the form of a null or string of data bits containing a syntax element, followed by an RBSP stop bit, and then zero or more subsequent bits equal to 0.
The NAL unit consists of a header and a payload. In H.264/AVC and HEVC, the NAL unit header indicates the type of NAL unit.
In HEVC, a two byte NAL unit header is used for all specified NAL unit types. The NAL unit header contains one reserved bit, a six bit NAL unit type indication, a three bit nuh temporal id plus1 indication for the temporal level (greater than or equal to 1 may be required) and a six bit nuh layer id syntax element. The temporal_id_plus1 syntax element may be considered as a temporal identifier of the NAL unit and the zero-based TemporalId variable may be derived as follows: temporalId = temporal_id_plus1-1. The abbreviation TID may be used interchangeably with the temporalld variable. A temporalld equal to 0 corresponds to the lowest temporal level. To avoid start code emulation involving two NAL unit header bytes, the value of temporal_id_plus1 is required to be non-zero. The bit stream created by excluding all VCL NAL units having a temporalld greater than or equal to the selected value and including all other VCL NAL units remains consistent. Thus, a picture with a temporalld equal to tid_value does not use any picture with a temporalld greater than tid_value as an inter prediction reference. A sub-layer or temporal sub-layer may be defined as a temporal scalable layer (or temporal layer TL) of a temporal scalable bit stream, consisting of VCL NAL units with specific values of the temporalld variable and associated non-VCL NAL units. nuh_layer_id can be understood as a scalability layer identifier.
NAL units may be categorized into Video Coding Layer (VCL) NAL units and non-VCL NAL units. The VCL NAL units are typically coded slice NAL units. In HEVC, VCL NAL units include syntax elements that represent one or more CUs.
In HEVC, abbreviations for picture types may be defined as follows: trailing (TRAIL) pictures, temporal sub-layer access (TSA), step-by-step temporal sub-layer access (STSA), random Access Decodable Leading (RADL) pictures, random Access Skipped Leading (RASL) pictures, broken Link Access (BLA) pictures, instantaneous Decoding Refresh (IDR) pictures, clean Random Access (CRA) pictures.
Random Access Point (RAP) pictures, which may also be referred to as Intra Random Access Point (IRAP) pictures in the independent layer, contain only intra-coded slices. IRAP pictures belonging to a prediction layer may contain P, B and I slices, cannot use inter-prediction from other pictures in the same predicted layer, and may use inter-layer prediction from its direct reference layer. In the current version of HEVC, IRAP pictures may be BLA pictures, CRA pictures, or IDR pictures. The first picture in the bitstream containing the base layer is an IRAP picture at the base layer. The IRAP picture at the independent layer and all subsequent non-RASL pictures at the independent layer can be correctly decoded in decoding order as long as the necessary parameter sets are available when activation is required, without the need to perform the decoding process of any picture preceding the IRAP picture in decoding order. When the necessary parameter sets are available when needed to be activated and decoding of each direct reference layer of the prediction layer that has been initialized, IRAP pictures belonging to the prediction layer and all subsequent non-RASL pictures in decoding order within the same prediction layer can be correctly decoded without performing the decoding process of any pictures of the same prediction layer preceding the IRAP pictures in decoding order. There may be pictures in the bitstream that contain only intra-coded slices that are not IRAP pictures.
The non-VCL NAL units may be, for example, one of the following types: sequence parameter set, picture parameter set, supplemental Enhancement Information (SEI) NAL unit, access unit delimiter, end of sequence NAL unit, end of bitstream NAL unit, or padding data NAL unit. Reconstruction of a decoded picture may require a parameter set, while reconstruction of decoded sample values does not necessitate many other non-VCL NAL units.
Parameters that remain unchanged by the encoded video sequence may be included in the sequence parameter set. In addition to parameters that may be required by the decoding process, the sequence parameter set may optionally contain video availability information (VUI) including parameters that may be important for buffering, picture output timing, presentation, and resource reservation. In HEVC, a sequence parameter set RBSP includes parameters that may be referenced by one or more picture parameter sets RBSPs or one or more SEI NAL units that contain buffering period SEI messages. The picture parameter set contains such parameters that may be unchanged in several coded pictures. The picture parameter set RBSP may include parameters that may be referenced by coded slice NAL units of one or more coded pictures.
In HEVC, a Video Parameter Set (VPS) may be defined as a syntax structure that contains syntax elements applied to zero or more fully encoded video sequences, the syntax elements being determined by the content of syntax elements found in the SPS, referenced by syntax elements found in each PSS, referenced by syntax elements found in each slice segment header.
The video parameter set RBSP can include one or more parameters that the sequence parameter set RBSP can refer to.
The relationship and hierarchy between the Video Parameter Set (VPS), the Sequence Parameter Set (SPS), and the Picture Parameter Set (PPS) may be described as follows. The VPS resides one level above the SPS in the parameter set hierarchy and in the context of scalability and/or 3D video. The VPS may include parameters that are common to all slices at all (scalability or view) layers in the entire coded video sequence. SPS includes parameters that are common to all slices in a particular (scalability or view) layer in the entire coded video sequence (and may be shared by multiple (scalability or view) layers). PPS includes parameters common to all slices in a particular layer representation (one scalability in one access unit or representation of the view layer) and possibly shared by all slices in the multi-layer representation.
The VPS may provide information about the dependency relationship of layers in the bitstream and many other information applicable to all slices across all (scalability or view) layers in the entire coded video sequence. A VPS may be considered to comprise two parts (a base VPS and a VPS extension), wherein the VPS extension may optionally exist.
The out-of-band transmission, signaling, or storage may additionally or alternatively be used for other purposes in addition to tolerating transmission errors, such as easy access or session negotiations. For example, sample entries of tracks in a file conforming to the ISO base media file format may include parameter sets, while encoded data in a bitstream is stored in other locations in the file or in another file. Phrases along the bitstream (e.g., along the bitstream indication) or along the coding unit of the bitstream (e.g., along the coded tile indication) may be used in the claims and described embodiments to refer to out-of-band transmission, signaling, or storage by way of out-of-band data associated with the bitstream or coding unit, respectively. Phrase decoding along a bitstream or along an encoding unit of a bitstream, etc., may refer to decoding reference out-of-band data (which may be obtained from out-of-band transmission, signaling, or storage) associated with the bitstream or encoding unit, respectively.
The SEI NAL unit may contain one or more SEI messages that are not needed for decoding of the output picture, but may assist related processes such as picture output timing, presentation, error detection, error concealment, and resource reservation. Several SEI messages are specified in h.264/AVC and HEVC, and user data SEI messages enable organizations and companies to specify SEI messages for use by themselves. h.264/AVC and HEVC contain the syntax and semantics of the specified SEI message, but the procedure of handling the message is not defined in the receiver. Thus, the encoder needs to follow the h.264/AVC standard or HEVC standard when creating SEI messages, and does not need a decoder conforming to the h.264/AVC standard or HEVC standard, respectively, to process SEI messages to ensure output order consistency. One of the reasons for including the syntax and semantics of SEI messages in h.264/AVC and HEVC is to allow different system specifications to interpret the supplemental information equally, enabling interoperability. It is expected that the system specification may require the use of specific SEI messages at both the encoding and decoding end, and in addition, may prescribe a procedure for handling specific SEI messages in the receiving end.
In HEVC, there are two types of SEI NAL units, namely a suffix SEI NAL unit and a prefix SEI NAL unit, which have nal_unit_type values that are different from each other. The SEI message(s) contained in the suffix SEI NAL unit are associated with a VCL NAL unit that precedes the suffix SEI NAL unit in decoding order. The SEI message(s) contained in the prefix SEI NAL unit are associated with the VCL NAL unit that follows the prefix SEI NAL unit in decoding order.
An encoded picture is an encoded representation of a picture.
In HEVC, an encoded picture may be defined as an encoded representation of a picture that contains all of the coding tree units of the picture. In HEVC, access Units (AUs) may be defined as a set of NAL units associated with each other according to a prescribed classification rule, consecutive in decoding order, and containing at most one picture with any particular value of nuh layer id. In addition to VCL NAL units containing encoded pictures, an access unit may also contain non-VCL NAL units. The specified classification rule may, for example, relate pictures or picture output count values having the same output time into the same access unit.
A bitstream may be defined as a sequence of bits in the form of a stream of NAL units or byte streams that form a representation of encoded pictures and associated data (forming one or more encoded video sequences). The first bit stream may be followed by the second bit stream in the same logical channel, such as in the same file or in the same connection of a communication protocol. An elementary stream (in the context of video coding) may be defined as a sequence of one or more bitstreams. The end of the first bitstream may be indicated by a specific NAL unit, which may be referred to as the end of the bitstream (EOB) NAL unit and which is the last NAL unit of the bitstream. In HEVC and its current draft extensions, EOB NAL units are required to have nuh_layer_id equal to 0.
In h.264/AVC, an encoded video sequence is defined as a sequence of consecutive access units from an IDR access unit (included) to a next IDR access unit (not included) or to the end of a bitstream (subject to earlier occurrence) in decoding order.
In HEVC, a Coded Video Sequence (CVS) may be defined as, for example, a sequence of access units consisting of IRAP access units with NoRaslOutputFlag equal to 1, followed by zero or more access units without IRAP access units with NoRaslOutputFlag equal to 1, including all subsequent access units, but not including any subsequent access units with IRAP access units with NoRaslOutputFlag equal to 1, in decoding order. An IRAP access unit may be defined as an access unit where the base layer picture is an IRAP picture. The value of NoRaslOutputFlag is equal to 1 for each IDR picture, each BLA picture, and each IRAP picture, each IRAP picture being the first picture in the bitstream in decoding order in that particular layer, the first IRAP picture after the end of the sequence NAL unit (having the same nuh layer id value in decoding order). There may be a way to provide the decoder with the value of the HandleCraAsBlaFlag from an external entity, such as a player or a receiver, which can control the decoder. The HandleCraAsBlaFlag may be set to 1 by, for example, the player seeking a new position in the bitstream or tuning to broadcast and starting decoding and then starting decoding from the CRA picture. When the HandleCraAsBlaFlag of a CRA picture is equal to 1, the CRA picture is processed and decoded as if it were a BLA picture.
In HEVC, a coded video sequence may additionally or alternatively (according to the above description) be designated as ending when a particular NAL unit, which may be referred to as an end of sequence (EOS) NAL unit, appears in the bitstream and has a nuh layer id equal to 0.
A group of pictures (GOP) and its characteristics can be defined as follows. The GOP may be decoded whether or not any previous pictures were decoded. An open GOP is a group of pictures in which, when decoding starts from an initial intra picture of the open GOP, a picture preceding the initial intra picture in output order may not be correctly decoded. In other words, a picture of an open GOP may refer to a picture belonging to a previous GOP (in inter prediction). The HEVC decoder may identify intra pictures that begin an open GOP because a particular NAL unit type, CRA NAL unit type, may be used for its coded slices. A closed GOP is a group of pictures in which all pictures can be correctly decoded when decoding starts from an initial intra picture of the closed GOP. In other words, no picture in a closed GOP refers to any picture in a previous GOP. In H.264/AVC and HEVC, a closed GOP may start from an IDR picture. In HEVC, a closed GOP may also start from a bla_w_radl or bla_n_lp picture. The open GOP coding structure may be more efficient in compression than the closed GOP coding structure because of the greater flexibility in selecting reference pictures.
A Decoded Picture Buffer (DPB) may be used in the encoder and/or decoder. Buffering decoded pictures is used for reference in inter prediction and reordering decoded pictures into output order for two reasons. Since h.264/AVC and HEVC provide great flexibility for reference picture marking and output reordering, separate buffers for reference picture buffering and output picture buffering may waste memory resources. Thus, the DPB may include a unified decoded picture buffering process for reference pictures and output reordering. When the decoded picture is no longer used as a reference and no output is needed, it may be removed from the DPB.
In many coding modes of h.264/AVC and HEVC, reference pictures for inter prediction are indicated with an index to a reference picture list. The index may be encoded with a variable length coding, which typically causes a smaller index to have a shorter value for the corresponding syntax element. In h.264/AVC and HEVC, two reference picture lists (reference picture list 0 and reference picture list 1) are generated for each bi-directionally predicted (B) slice, and one reference picture list (reference picture list 0) is formed for each inter-coded (P) slice.
Many coding standards, including h.264/AVC and HEVC, may have a decoding process to derive a reference picture index to a reference picture list that may be used to indicate which of a plurality of reference pictures is used for inter prediction for a particular block. The reference picture index may be encoded into a bitstream by an encoder in some inter-coding modes, or may be derived (by both the encoder and decoder) using neighboring blocks, for example, in some other inter-coding modes.
Several candidate motion vectors may be derived for a single prediction unit. For example, motion vector prediction HEVC includes two motion vector prediction schemes, namely Advanced Motion Vector Prediction (AMVP) and merge mode. In AMVP or merge mode, a list of motion vector candidates is derived for the PU. There are two types of candidates: spatial candidates and temporal candidates, wherein temporal candidates may also be referred to as TMVP candidates.
For example, candidate list derivation may be performed as follows, with the understanding that other possibilities exist for candidate list derivation. If the occupancy of the candidate list is not maximum, then the spatial candidates (if available and not present in the candidate list) are first included in the candidate list. Thereafter, if the occupancy of the candidate list has not reached a maximum, the temporal candidate is included in the candidate list. If the number of candidates has not reached the maximum allowed number, the combined bi-prediction candidates (for the B slice) and zero motion vectors are added. After the candidate list has been constructed, the encoder decides final motion information from the candidates, e.g., based on Rate Distortion Optimization (RDO) decisions, and encodes an index of the selected candidates into the bitstream. Likewise, the decoder decodes an index of the selected candidate from the bitstream, constructs a candidate list, and selects a motion vector predictor from the candidate list using the decoded index.
In HEVC, AMVP and merge mode may be characterized as follows. In AMVP, the encoder indicates whether unidirectional prediction or bi-directional prediction is used and which reference pictures are used and encodes the motion vector differences. In merge mode, only candidates selected from the candidate list are encoded into the bitstream to indicate that the current prediction unit has the same motion information as the indicated predictor. Thus, the merge mode creates regions consisting of neighboring prediction blocks sharing the same motion information, which is signaled only once for each region.
Examples of the operation of advanced motion vector prediction are provided below, while other similar implementations of advanced motion vector prediction are possible, e.g., having different sets of candidate locations and candidate locations having a set of candidate locations. It is also to be appreciated that other prediction modes, such as merge mode, may operate similarly. Two spatial Motion Vector Predictors (MVPs) may be derived and a Temporal Motion Vector Predictor (TMVP) may be derived. The selection may be made in the following positions: three spatial motion vector predictor candidate positions (B 0 、B 1 、B 2 ) And two spatial motion vector predictor candidate positions (a 0 、A 1 ). With each candidate position set (B 0 、B 1 、B 2 ) Or (A) 0 、A 1 ) A first motion vector predictor that is available (e.g., resides in the same slice, inter-coded, etc.) may be selected to represent a prediction direction (up or left) in motion vector competition. The reference index of the temporal motion vector predictor may be indicated by the encoder in the slice header (e.g., as a collocated ref idx syntax element). In a predefined order of potential time candidate locations (e.g., in order (C 0 、C 1 ) A) available (e.g., inter-coded) first motion vector predictor may be selected as a source of a temporal motion vector predictor. The motion vector derived from the first available candidate position in the co-located picture may be scaled according to the proportion of picture order count differences of the reference picture, co-located picture and current picture of the temporal motion vector predictor. Moreover, a redundancy check may be performed between candidates to remove the same candidate, which may result in zero motion vectors being included in the candidate list. For example, the motion vector predictor may be indicated in the bitstream by indicating the direction of the spatial motion vector predictor (up or left) or selecting a temporal motion vector predictor candidate. Co-located pictures may also be referred to as collocated pictures, sources for motion vector prediction or source pictures for motion vector prediction 。
The motion parameter type or motion information may include, but is not limited to, one or more of the following types:
-an indication of the type of prediction (e.g. intra prediction, unidirectional prediction, bi-prediction) and/or the number of reference pictures;
indication of prediction direction, such as inter (also known as temporal) prediction, inter-layer prediction, inter-view prediction, view Synthesis Prediction (VSP) and inter-component prediction (which may be indicated by reference picture and/or prediction type, and in some embodiments, inter-view prediction and view synthesis prediction may be considered together as one prediction direction) and/or
Indication of reference picture type, such as short-term reference picture and/or long-term reference picture and/or inter-layer reference picture (e.g. may be indicated by reference picture)
Reference index to reference picture list and/or any other identifier of reference picture (e.g. may be indicated by reference picture and its type may depend on prediction direction and/or reference picture type and may be accompanied by other relevant information such as reference picture list or similar content to which reference index applies);
horizontal motion vector components (which may be indicated, for example, in terms of prediction blocks or reference indices, etc.);
Vertical motion vector components (which may be indicated, for example, in terms of prediction blocks or reference indices, etc.);
one or more parameters, such as picture order count differences and/or relative camera spacing between pictures containing or associated with motion parameters and their reference pictures, which may be used to scale horizontal motion vector components and/or vertical motion vector components during one or more motion vector predictions (where the one or more parameters may be indicated per reference picture or per reference index, for example);
-applying the motion parameters and/or coordinates of the block of motion information, e.g. coordinates of the upper left sample of the block in units of luminance samples;
-applying motion parameters and/or ranges (e.g. width and height) of blocks of motion information.
In general, motion vector prediction mechanisms (such as those presented above as examples) may include predictions or inheritance of certain predefined or indicated motion parameters.
The motion field associated with a picture may be considered to include a set of motion information generated for each encoded block of the picture. For example, the motion field may be accessed by coordinates of the block. For example, motion fields may be used in TMVP or any other motion prediction mechanism, where sources or inheritance for prediction other than the current (decoded) coded picture are used.
Different spatial granularities or units may be applied to represent and/or store the motion field. For example, a regular grid of spatial units may be used. For example, a picture may be divided into rectangular blocks of a certain size (possibly except for blocks at the edges of the picture, such as at the right and bottom edges). For example, the size of the spatial unit may be equal to a minimum size for which the encoder may indicate significant motion in the bitstream, such as a 4x4 block of luma sample units. For example, a so-called compressed motion field may be used, wherein the spatial unit may be equal to a predefined or indicated size, such as a 16x16 block of luminance sample units, which may be larger than a minimum size for indicating significant motion. For example, an HEVC encoder and/or decoder may be implemented in a manner that performs Motion Data Storage Reduction (MDSR) or motion field compression (before using motion fields for any prediction between pictures) for each decoded motion field. In HEVC implementations, MDSR may reduce the granularity of motion data to 16x16 blocks of luma sample units by maintaining motion applicable to the top left samples of 16x16 blocks in a compressed motion field. The encoder may encode the indication(s) related to the spatial units of the compressed motion field as one or more syntax elements and/or syntax element values, for example in a sequence level syntax structure such as a video parameter set or a sequence parameter set. In some (decoding) encoding methods and/or apparatuses, a motion field may be represented and/or stored according to a block partition of motion prediction (e.g., a prediction unit according to the HEVC standard). In some (decoding) encoding methods and/or apparatus, a combination of regular meshing and block partitioning may be applied such that motions associated with partitions larger than a predefined or indicated spatial unit size are represented and/or stored in association with those partitions, while motions associated with partitions that are smaller than or not aligned with the predefined or indicated spatial unit size or meshing are represented and/or stored for the predefined or indicated units.
Scalable video coding may refer to a coding structure in which one bitstream may contain multiple representations of content, for example, at different bit rates, resolutions, or frame rates. In these cases, the receiver may extract the desired representation based on its characteristics (e.g., resolution that best matches the display device). Alternatively, the server or network element may extract the portion of the bitstream to be transmitted to the receiver according to, for example, the network characteristics or processing power of the receiver. By decoding only certain parts of the scalable bit stream, a meaningful decoded representation can be produced. A scalable bit stream is typically composed of a "base layer" that provides the lowest quality video available and one or more enhancement layers that enhance video quality when received and decoded with lower layers. In order to increase the coding efficiency of an enhancement layer, the coded representation of that layer is typically dependent on lower layers. For example, motion and mode information of the enhancement layer may be predicted from a lower layer. Likewise, lower layer pixel data may be used to create predictions for the enhancement layer.
In some scalable video coding schemes, a video signal may be encoded into a base layer and one or more enhancement layers. The enhancement layer may enhance, for example, temporal resolution (i.e., frame rate), spatial resolution, or enhance the quality of video content represented by another layer or only a portion thereof. Each layer and all its dependent layers are one representation of the video signal, e.g. at a certain spatial resolution, temporal resolution and quality level. In this document, we refer to the scalable layer and all its subordinate layers as "scalable layer representation". The portion of the scalable bit stream corresponding to the scalable layer representation can be extracted and decoded to produce a representation of the original signal with a certain fidelity.
Scalability modes or scalability dimensions may include, but are not limited to, the following:
-quality scalability: base layer pictures are encoded with lower quality than enhancement layer pictures, e.g., may be implemented in the base layer using larger quantization parameter values (i.e., larger quantization steps for quantization of transform coefficients) than enhancement layers. As described below, quality scalability may be further classified as fine grain or Fine Grain Scalability (FGS), medium grain or Medium Grain Scalability (MGS), and/or Coarse Grain Scalability (CGS).
Spatial scalability: base layer pictures are encoded at a lower resolution (i.e., with fewer samples) than enhancement layer pictures. Spatial scalability and mass scalability, especially their coarse-grain scalability types, can sometimes be regarded as the same type of scalability.
Bit depth scalability: the base layer picture is encoded at a lower bit depth (e.g., 8 bits) than the enhancement layer picture (e.g., 10 or 12 bits).
Dynamic range scalability: the scalable layer represents different dynamic ranges and/or images obtained using different tone mapping functions and/or different optical transfer functions.
Chroma format scalability: base layer pictures provide lower spatial resolution in a chroma sample array (e.g., encoded in a 4:2:0 chroma format) than enhancement layer pictures (e.g., in a 4:4:4 format).
-gamut scalability: enhancement layer pictures have a richer/wider range of color representations than base layer pictures-for example, an enhancement layer may have a UHDTV (ITU-R bt.2020) color gamut and a base layer may have an ITU-R bt.709 color gamut.
View scalability (view scalability), which may also be referred to as multiview coding. The base layer represents the first view and the enhancement layer represents the second view. A view may be defined as a sequence of pictures representing one camera or view point. It is considered that in stereoscopic or two-view video, one video sequence or view is presented for the left eye and a parallel view is presented for the right eye.
Depth scalability, which may also be referred to as depth enhancement coding. One or more layers of the bitstream may represent texture view(s), while other layer(s) may represent depth view(s).
Scalability of the region of interest (as described below).
Inter-to-progressive (also known as field-to-frame scalability): the encoded interlaced source content material of the base layer is enhanced with an enhancement layer to represent progressive source content. The source content of the coding interlace in the base layer may include coding fields, coding frames representing field pairs, or a mixture thereof. In interlaced to progressive scalability, the base layer picture may be resampled to be a suitable reference picture for one or more enhancement layer pictures.
Hybrid codec scalability (also referred to as coding standard scalability): in hybrid codec scalability, the base layer and enhancement layer bitstream syntax, semantics, and decoding processes are specified in different video coding standards. Accordingly, the base layer picture is encoded according to a different encoding standard or format than the enhancement layer picture. For example, the base layer may be encoded with h.264/AVC, and the enhancement layer may be encoded with HEVC multi-layer extension.
It should be appreciated that many scalability types may be combined and applied together. For example, color gamut scalability and bit depth scalability may be combined.
The term layer may be used in the context of any type of scalability, including view scalability and depth enhancement. Enhancement layers may refer to any type of enhancement, such as SNR, spatial, multiview, depth, bit depth, chroma format, and/or color gamut enhancement. A base layer may refer to any type of base video sequence, such as a base view, a base layer for SNR/spatial scalability, or a texture base view for depth enhanced video coding.
Some scalable video coding schemes may require cross-layer alignment of IRAP pictures in such a way that all pictures in an access unit are IRAP pictures or that no picture in an access unit is an IRAP picture. Other scalable video coding schemes (such as multi-layer extensions of HEVC) may allow misaligned IRAP pictures (i.e., one or more pictures in an access unit) to be IRAP pictures, while one or more other pictures in the access unit are not IRAP pictures. Scalable bitstreams with IRAP pictures or similar pictures that are not cross-layer aligned may be used to provide more frequent IRAP pictures, for example, in the base layer, where they may have smaller coding sizes, for example, due to smaller spatial resolution. A process or mechanism for layer-by-layer (layer-wise) initiated decoding may be included in a video decoding scheme. Thus, when the base layer contains IRAP pictures, the decoder may start decoding the bitstream, and when the other layers contain IRAP pictures, the decoder may gradually start decoding the other layers. In other words, in a layer-by-layer initiation of the decoding mechanism or process, the decoder gradually increases the number of layers being decoded (where a layer may represent enhancement of spatial resolution, quality level, view, additional components or combinations such as depth) because subsequent pictures from additional enhancement layers are decoded in the decoding process. For example, a gradual increase in the number of decoding layers can be regarded as a gradual increase in image quality (in the case of quality and spatial scalability).
The sender, gateway, client, or another entity may select a transport layer and/or a sub-layer of the scalable video bitstream. The terms layer extraction, or layer-down switching may refer to transmitting fewer layers than are available in a bitstream received by a sender, gateway, client, or another entity. An on-layer handover may refer to transmitting additional layer(s) compared to those layers that the sender, gateway, client or another entity transmitted prior to making the on-layer handover, i.e. restarting transmitting one or more layers that have previously stopped transmitting in the under-layer handover. Similar to the layer down-and/or up-hand-off, the sender, gateway, client, or another entity may perform the time sub-layer down-and/or up-hand-off. The sender, gateway, client or another entity may also perform layer and sub-layer down-and/or up-switches. The down-switch and/or up-switch of layers and sub-layers may be performed in the same access unit, etc. (i.e., substantially simultaneously), or may be performed in different access units, etc. (i.e., substantially at different times).
Scalability can be enabled in two basic ways. By introducing a new coding mode for performing prediction of pixel values or syntax from a lower layer of the scalable representation, or by placing pictures of a lower layer into a reference picture buffer (e.g. a decoded picture buffer DPB) of a higher layer. The first approach may be more flexible and thus may provide better coding efficiency in most cases. However, the second reference frame based scalability approach can be implemented efficiently with minimal modification to the single layer codec while still achieving most of the available coding efficiency gains. Essentially, a reference frame based scalability codec can be implemented using the same hardware or software implementation for all layers, just by handling DPB management externally.
A scalable video encoder for quality scalability (also referred to as signal-to-noise ratio or SNR) and/or spatial scalability may be implemented as follows. For the base layer, a conventional non-scalable video encoder and decoder may be used. The reconstructed/decoded picture of the base layer is included in a reference picture buffer and/or a reference picture list for the enhancement layer. In the case of spatial scalability, the reconstructed/decoded base layer picture may be upsampled before inserting the reference picture list of the enhancement layer picture. The base layer decoded picture may be inserted into the reference picture list(s) to encode/decode the enhancement layer picture similar to the decoded reference picture of the enhancement layer. Thus, the encoder may select the base layer reference picture as an inter prediction reference and indicate its use with a reference picture index in the encoded bitstream. The decoder decodes from the bitstream (e.g., from the reference picture index) the base layer picture as an inter prediction reference for the enhancement layer. When a decoded base layer picture is used as a prediction reference for an enhancement layer, it is referred to as an inter-layer reference picture.
While the previous paragraph describes a scalable video codec with two scalable layers (with an enhancement layer and a base layer), it is to be understood that the description can be generalized to any two layers in a scalable hierarchy with more than two layers. In this case, the second enhancement layer may depend on the first enhancement layer in the encoding and/or decoding process, and thus the first enhancement layer may be regarded as a base layer for encoding and/or decoding the second enhancement layer. Furthermore, it is to be understood that inter-layer reference pictures from more than one layer may be present in a reference picture buffer or reference picture list of an enhancement layer, and that each of these inter-layer reference pictures may be considered to reside in a base layer or reference layer of an enhancement layer for encoding and/or decoding. Furthermore, it is to be understood that other types of inter-layer processing may instead or in addition to reference layer picture upsampling. For example, the bit depth of samples of the reference layer picture may be converted to the bit depth of the enhancement layer and/or the sample values may undergo a mapping from the color space of the reference layer to the color space of the enhancement layer.
The scalable video coding and/or decoding scheme may use multi-loop coding and/or decoding, which may be characterized as follows. In encoding/decoding, a base layer picture may be reconstructed/decoded to be used as a motion compensated reference picture within the same layer for a subsequent picture in encoding/decoding order, or as a reference for inter-layer (or inter-view or inter-component) prediction. The reconstructed/decoded base layer picture may be stored in the DPB. Enhancement layer pictures can also be reconstructed/decoded to be used as motion compensated reference pictures in the same layer in coding/decoding order for subsequent pictures, or as references for inter-layer (or inter-view or inter-component) prediction of higher enhancement layers, if any. In addition to the reconstructed/decoded sample values, syntax element values of the base/reference layer or variables derived from the syntax element values of the base/reference layer may be used in inter-layer/inter-component/inter-view prediction.
Inter-layer prediction may be defined as prediction performed in a manner depending on data elements (e.g., sample values or motion vectors) of reference pictures from a layer different from the layer of the current picture (encoded or decoded). There are many types of inter-layer prediction and can be applied in scalable video encoders/decoders. The available type of inter-layer prediction may for example depend on the coding profile according to which the bitstream or a specific layer within the bitstream is encoded or on the coding profile according to which the bitstream or a specific layer within the bitstream is indicated to conform when decoded. Alternatively or additionally, the available type of inter-layer prediction may depend on the type of scalability being used or the type of scalable codec or video coding standard modification (e.g., SHVC, MV-HEVC, or 3D-HEVC).
A direct reference layer may be defined as a layer that may be used for inter-layer prediction of another layer for which the layer is a direct reference layer. A layer of direct prediction may be defined as a layer where another layer is a direct reference layer. An indirect reference layer may be defined as a layer that is not a direct reference layer of the second layer but is a direct reference layer of a third layer that is either a direct reference layer of the second layer of the indirect reference layer or an indirect reference layer of the direct reference layer. An indirect prediction layer may be defined as a layer in which another layer is an indirect reference layer. The independent layer may be defined as a layer without a direct reference layer. In other words, the independent layer is not predicted using inter-layer prediction. The non-base layer may be defined as any other layer than the base layer, and the base layer may be defined as the lowest layer in the bitstream. A separate non-base layer may be defined as a layer that is both a separate layer and a non-base layer.
In some cases, the data in the enhancement layer may be truncated after a certain location, even at arbitrary locations, where each truncated location may include additional data representing increasingly enhanced visual quality. This scalability is known as fine grain (granularity) scalability (FGS).
Similar to MVC, in MV-HEVC, inter-view reference pictures may be included in reference picture list(s) of a current picture being encoded or decoded. SHVC uses a multi-loop decoding operation (unlike the SVC extension of h.264/AVC). SHVC may be considered to use a reference index based approach, i.e., an inter-layer reference picture may be included in one or more reference picture lists of a current picture being encoded or decoded (as described above).
For enhancement layer coding, the concept and coding tools of the HEVC base layer may be used in SHVC, MV-HEVC, etc. However, additional inter-layer prediction tools that can use already encoded data (including reconstructed picture samples and motion parameters, also referred to as motion information) in the reference layer to efficiently encode the enhancement layer may be integrated into SHVC, MV-HEVC, and/or similar codecs.
As described above, video and image samples are typically encoded using color representations, such as YUV or YCbCr, which consists of one luminance channel and two chrominance channels. In these cases, the luminance channels, which represent mainly scene illumination, are typically encoded with a certain resolution, while the chrominance channels, which represent typically the differences between certain color components, are typically encoded with a second resolution, which is lower than the luminance signal. The purpose of such a differential representation is to de-correlate the color components and to be able to compress the data more efficiently. However, in many cases, there is still some correlation between channels, which can be used to represent data more efficiently.
An improved method for color channel coding is now presented.
A method according to an aspect is shown in fig. 5, the method comprising: determining a residual signal (500) for at least one sample; determining whether the residual signal represents a residual for samples in more than one channel (502); and if affirmative, applying the residual signal for at least a first sample in a first channel to generate a first reconstructed sample (504); and applying the residual signal for at least a second sample in a second channel to generate a second reconstructed sample (506).
According to an embodiment, the combined residual signal is applied for a chrominance channel of a still image or video sequence.
Thus, a method is provided for jointly encoding residual signals of two or more color channels, in particular chrominance channels. The bitstream identifier may be used to detect the case where the same residual signal is applied to a plurality of chroma channels. The compression efficiency of video encoding/decoding can be further improved by indicating and decoding an indication that the same residual signal can be applied to a plurality of chroma channels.
Unless otherwise indicated herein, the method and related embodiments are equally applicable to operations performed by an encoder or decoder. The method and related embodiments may be implemented in different ways. For example, the order of the operations described above may be changed, or the operations may be staggered in a different manner. Moreover, different additional operations may be applied in different stages of the process. For example, there may be additional filtering, scaling, mapping, or other processing applied to the final or intermediate results of the described operations. The final or intermediate results of the above operations may be further combined with the results of other operations.
Determining the residual signal may be performed in various ways, wherein the actual implementation of determining the residual signal typically varies between the decoder and the encoder. For example, a video or image decoder may decode syntax elements from a bitstream describing transform coefficients that may be further transformed into values of a residual sample block using, for example, an inverse Discrete Cosine Transform (DCT). Naturally, other transforms may also be used, such as Discrete Sine Transforms (DST) or wavelet transforms, or the transform steps may be omitted, and the residual sample values may be directly constructed based on syntax elements describing the sample differences of individual residual samples or residual sample blocks.
The residual signal may also be determined in various ways in the video or image encoder. In the case of sharing the residual between two or more channels, the joint residual may be generated, for example, by averaging or using a weighted average of the residual signal and optionally transforming, quantizing to the desired precision, inverse quantizing and inverse transforming to the sampling domain. The video or image encoder may, for example, be configured to add half the residual from one channel and subtract half the residual from the other channel when generating the joint residual signal or joint residual sample block. As part of the quantization process, a rate distortion optimized quantization process may be used. In this case, the multi-channel nature of the joint residual may be considered, and the quantization process may be adapted to weight the estimated quantization error, for example, unlike the case of single-channel residual coding.
Whether the residue represents a sample residue for more than one channel may be determined in various ways. In a video or image encoder, different modes of operation may be tested and the lowest rate distortion cost may be selected, for example, for the coding unit, prediction unit, transform unit, or other block of samples. The relevant information may be signaled in a bitstream from which the video/image decoder may obtain and decode the information. Bit stream signaling and information decoding may also be accomplished in different ways.
According to an embodiment, the method further comprises: a flag is included in the bitstream for indicating that at least one predefined condition for residual decoding is met. Such a flag may be referred to as a combined residual flag. For example, a 1-bit identifier flag or a 1-bin context encoded identifier flag may be used as an indication in the bitstream, wherein such a combined residual flag may advantageously be included in the bitstream under one or more predefined conditions.
According to an embodiment, the combined residual flag is decoded and a single residual block is decoded and the residual is applied to the first and second channels in response to the combined residual flag being 1 or true. Thus, examples of such conditions include: only the flag is indicated or decoded if the two chroma-related channels are indicated first with a residual associated with them.
According to an embodiment, the coded block flag for the first channel is decoded, the coded block flag for the second channel is decoded, and in response to both being 1 or true, the combined residual flag is decoded, and in response to the combined residual flag being 1 or true, a single residual block is decoded and applied to the first and second channels. An example of such an indicated bitstream syntax is given below as transform_unit_sample_1, wherein first the chroma-related channels cb and cr are indicated with a coded block flag (cbf) with residues associated therewith, and then the joint residue is indicated to be applicable to both channels.
According to an embodiment, the combined residual flag is decoded and a single residual block is decoded in response to the combined residual flag being 1 or true and added to and subtracted from the prediction block in the first channel. In this case, the residue for the second channel may be omitted in the bitstream, but may be calculated from the joint residue received for the first channel, e.g., from the above syntax.
According to an embodiment, an identifier associated with the first color channel is decoded and, in response to the identifier, a residual signal of the second channel is subtracted from the predicted signal in the first channel. Thus, another example of a conditional indication for a flag is to first indicate or decode an identifier that indicates whether a residual exists for the first channel. If the indication implies that there is a residual for the first channel, a flag is used to indicate whether the residual is a joint residual of two or more channels or is applied to the first channel only. In this example, in the case of indicating a joint residual, there is no need to decode a syntax element indicating the encoded residual of the second channel. An example bitstream syntax for such an indication is given below as transform_unit_sample_2, wherein a coded block flag (cbf) is used as an identifier to indicate that there is a residual associated with chroma-related channel cb. The syntax element tu_cb_cr_joint_residual is used to indicate that there is a joint residual for both chroma-related channels cb and cr.
Yet another possibility for conditional signaling is to first indicate that one of the channels has a residual. An example of this is given below as transform_unit_sample_3. In this case, using the tu_cbf_cr flag only indicates that the residue of the second channel will trigger decoding of the tu_cb_cr_joint_residual flag. Of course, a flag may also be used for the first channel (tu_cbf_cb).
In another example, the indication of the joint residual mode may be completed before the block flag is encoded. In this case, in the case where the joint residual mode is signaled and both cbf flags do not need to be explicitly encoded or decoded, two encoded block flags tu_cbf_cb and tu_cbf_cr may be assigned a value of true or 1. Thus, the above-mentioned embodiments may also be applied herein, wherein the combined residual flag is decoded, and in response to the combined residual flag being 1 or true, a single residual block is decoded and the residual is applied to the first channel and the second channel.
In another example, the indication of the joint residual mode may be done within the residual coding syntax of one of the residual blocks. This is illustrated by transform_unit_sample_4, which contains identifiers of the coded block flags of Cb and Cr blocks; and illustrated by residual_coding_sample_4, which contains a check that matches the current channel index with the first channel (in this example, the Cr channel, with comparison of cidx= 2), and the second channel (in this example, the Cb channel with tu_cbf_cb) has its coded block marked as either 1 or true. In case both conditions are true, the indication of the combined residual mode is decoded, and in case the evaluation is true, further parsing operations on the Cr residual may be omitted, as the Cr residual may be derived from the Cb residual. Alternatively, some additional signaling may occur to indicate how the residuals relate to each other.
According to an embodiment, the combined residual pattern is applied to a subset of blocks determined by bitstream signaling.
According to an embodiment, the combined residual mode is applied to the block using only prediction modes belonging to a predetermined set of prediction modes. The predetermined set of prediction modes may include, for example, all or a subset of all intra prediction modes.
According to an embodiment, the combined residual mode is applied to the block using only residual coding modes belonging to a predetermined set of residual coding modes. The predetermined set of residual coding modes may comprise, for example, all or a subset of all transform coding modes. The indication of the combined residual mode may be encoded or decoded for a single transform unit or covering multiple transform units. For example, the indication may be given in a coding tree unit, coding unit, prediction unit or root transform tree level, which may cover a plurality of transform units.
If it is determined that the residual signal represents sample residues for more than one channel, the coding unit, coding block, prediction unit, prediction block, transform unit or transform block for which a determination is made may be said to be a unit or block of "combined residual mode" or "shared residual mode" or "joint residual mode". Likewise, the "combined/shared/joint residual mode" of the unit or block can be said to be "on".
According to an embodiment, the method further comprises: applying post-processing means to the at least first reconstructed sample to generate at least a first output sample in the first channel; and applying post-processing means to the at least second reconstructed sample to generate at least a second output sample in the second channel.
According to an embodiment, the method further comprises: the quantization parameter or inverse quantization parameter used in the residual coding is set based on whether the combined residual coding mode is applied to the sample block. The inverse quantization parameter or inverse quantization step size may advantageously be set smaller for blocks to which the combined residual coding mode is applied than for those blocks for which the mode is turned off. Likewise, the combined residue may be scaled up before encoding and/or scaled down after decoding to achieve a similar effect.
According to an embodiment, the additional information is decoded from the bitstream and used together with the joint residual to determine the reconstructed block of samples. Such additional information may include, for example, a scalar or matrix of scaled values, parameters for mapping sample values linearly or non-linearly; or in addition, transform-coded or non-transform-coded sample values or sample offsets.
If it is determined that the residual signal represents a sample residual for more than one channel, the residual may be applied to the channels in different ways. According to an embodiment, the encoded block flag of the first channel is decoded, the encoded block flag of the second channel is decoded, and in response to the encoded block flag of the first channel being 0 or false and the encoded block flag of the second channel being 1 or true, a scaled version of the residual of the second channel is added to the prediction of the first channel. For example, for the first channel, the value of s may be used u The scalar scale represented, while for the second channel a different scalar value s may be used v . For a sample block, it can be described in matrix form as:
O u =P u +s u *R uv
O v =P v +s v *R uv
wherein O is u And O v Representing reconstructed sample blocks in the U and V channels, respectively (a single sample if a block size of 1x1 samples is used). Similarly, P u And P v Representing corresponding sample prediction blocks in these channels, R uv Representing the joint residual.
According to an embodiment, a residual signal associated with a first color channel is decoded, an identifier associated with a second color channel is decoded, and in response to the identifier, the residual signal of the first channel is subtracted from a predicted signal in the second channel. Thus, scalar values s of YUV or UCbCr video can be used u Sum s v Advantageously chosen to be 1 and-1, allowing the equation to be expressed in a more compact form:
O u =P u +R uv
O v =P v -R uv
according to an embodiment, the difference between the residual signal of the first channel and the residual signal of the second channel is encoded in the bitstream.
According to an embodiment, the decoded residual signal is added to the block of samples in the first channel and subtracted from the block of samples in the second channel.
Fig. 6a and 6b show examples illustrating the benefits of a combined/shared residual mode according to an embodiment. Fig. 6a illustrates a conventional encoding of a residual signal. In this example, there are two chrominance channels, both with separate residual signals. Generating an output block O by adding a residual block to a prediction block on each channel u And O v . Fig. 6b illustrates in turn the operation in shared residual mode. In this example, weights +1 and-1 are associated with the first and second channels, respectively. That is, the signaled residual is added to the prediction block in the first channel and subtracted from the prediction block in the second channel to form two output blocks O u And O v
Naturally the solution can be generalized to use a weight matrix instead of scalar weights s u Sum s v . Alternatively or additionally, the weights may be signal dependent and derived for each sample separately using, for example, linear, piecewise linear, or non-linear mapping. One example of such a mapping is the use of reconstructed block O u And O v To calculate a linear model and apply the model to calculate s at coordinates x, y within the sample block u (x, y) and s v The values of (x, y) give:
O u =P u +s u (x,y)*R uv
O v =P v +s v (x,y)*R uv
or:
O u (x,y)=P u (x,y)+s u (x,y)*R uv (x,y)
O v (x,y)=P v (x,y)+s v (x,y)*R uv (x,y)
or the weights are expressed as a function of the predicted signal:
O u (x,y)=P u (x,y)+s u (P u (x,y))*R uv (x,y)
O v (x,y)=P v (x,y)+s v (P v (x,y))*R uv (x,y)
when applied in a video or image encoder, determining the combined or joint residual signal may include calculating one or more differences between the original or processed input samples and the predicted sample values. It may also include additional operations such as filtering or sample-based nonlinear operations applied to intermediate instances of the combined residual. The joint residual may be transformed to a different domain (e.g. using DCT) and quantized in the case of transform coding, or simply quantized if a coding mode such as transform skip is applied. If rate-distortion optimized quantization is applied to the combined residual, the encoder may advantageously adjust the lambda parameters for correlating the reconstruction error and the bits needed for encoding to be different from the lambda parameters used for the conventional residual coding mode, since in the case of combined residual coding, the residual is applied to multiple coded blocks. Alternatively, when making the mode selection, the encoder may choose to adjust the estimation of the quantization error.
Fig. 7 shows a block diagram of a video decoder suitable for employing an embodiment of the invention. Fig. 7 depicts the structure of a two-layer decoder, but it is to be appreciated that the decoding operation can similarly be used in a single-layer decoder.
The video decoder 550 includes a first decoder portion 552 for a base layer and a second decoder portion 554 for a prediction layer. Block 556 illustrates a demultiplexer for passing information about the base layer picture to the first decoder portion 552 and information about the prediction layer picture to the second decoder portion 554. Reference sign P' n represents a predictive representation of an image block. Reference numeral D' n denotes a reconstructed prediction error signal. Blocks 704, 804 illustrate the preliminary reconstructed image (I' n). Reference sign R' n denotes the final reconstructed image. Blocks 703, 803 illustrate the inverse transform (T -1 ). Blocks 702, 802 illustrate inverse quantization (Q -1 ). Blocks 701, 801 illustrate entropy decoding(E -1 ). Blocks 705, 805 illustrate a Reference Frame Memory (RFM). Blocks 706, 806 illustrate prediction (P) (inter prediction or intra prediction). Blocks 707, 807 illustrate filtering (F). Blocks 708, 808 may be used to combine the decoded prediction error information with the predicted base layer/prediction layer image to obtain a preliminary reconstructed image (I' n). The primarily reconstructed and filtered base layer image may be output 709 from the first decoder section 552 and the primarily reconstructed and filtered base layer image may be output 809 from the first decoder section 554.
In this context, a decoder should be interpreted to cover any unit of operation capable of performing decoding operations, such as a player, a receiver, a gateway, a demultiplexer and/or a decoder.
As another aspect, there is provided an apparatus comprising: at least one processor and at least one memory, the at least one memory stored with code thereon, which when executed by the at least one processor, causes an apparatus to at least perform: determining a residual signal for at least one sample; determining whether the residual signal represents a sample residual for more than one channel; applying the residual signal for at least a first sample in a first channel for generating a first reconstructed sample in response to the residual signal representing sample residuals in more than one channel; and applying the residual signal for at least a second sample in a second channel for generating a second reconstructed sample.
Such an apparatus further includes code stored in the at least one memory, which when executed by the at least one processor, causes the apparatus to perform one or more embodiments disclosed herein.
Fig. 8 is a graphical representation within which an example multimedia communication system of various embodiments may be implemented. The data source 1510 provides a source signal in analog, uncompressed digital, or compressed digital format, or any combination of these formats. Encoder 1520 may include or be coupled to preprocessing such as data format conversion and/or filtering of the source signal. The encoder 1520 encodes the source signal into an encoded media bitstream. It should be noted that the bitstream to be decoded may be received directly or indirectly from a remote device located within virtually any type of network. In addition, the bit stream may be received from local hardware or software. The encoder 1520 may be capable of encoding more than one media type, such as audio and video, or more than one encoder 1520 may be required to encode source signals of different media types. Encoder 1520 may also obtain synthetically produced input, such as graphics and text, or it may be capable of producing encoded bitstreams of synthetic media. In the following, only the processing of one coded media bitstream of one media type is considered to simplify the description. It should be noted, however, that typically, real-time broadcast services comprise several streams (typically at least one audio, video and text subtitle stream). It should also be noted that the system may include many encoders, but only one encoder 1520 is shown in the drawings to simplify the description without lacking versatility. It should be further understood that while the text and examples contained herein may specifically describe an encoding process, those skilled in the art will understand that the same concepts and principles apply to a corresponding decoding process and vice versa.
The encoded media bitstream may be transferred to a storage 1530. The storage 1530 may include any type of mass storage to store the coded media bitstream. The format of the encoded media bitstream in the storage 1530 may be a basic stand-alone bitstream format, or one or more encoded media bitstreams may be packaged into a container file, or the encoded media bitstreams may be packaged into a segment format suitable for DASH (or similar streaming system) and stored as a sequence of segments. If one or more media bitstreams are packaged in a container file, the one or more media bitstreams may be stored in the file using a file generator (not shown in the figures) and file format metadata is created or may be stored in the file. The encoder 1520 or the storage 1530 may include a file generator, or the file generator may be operatively attached to the encoder 1520 or the storage 1530. Some systems operate "in real-time", i.e., omit storage and transmit the encoded media bitstream directly from encoder 1520 to transmitter 1540. The encoded media bitstream may then be transmitted to a sender 1540 (also referred to as a server) as needed. The format used in the transmission may be a basic independent bitstream format, a packet stream format, a segmented format suitable for DASH (or similar streaming system), or one or more coded media bitstreams may be encapsulated into a container file. The encoder 1520, the storage 1530, and the server 1540 may reside in the same physical device, or they may be included in separate devices. The encoder 1520 and server 1540 may operate with real-time content, in which case the encoded media bitstream is typically not permanently stored, but is buffered for a short period of time in the content encoder 1520 and/or server 1540 to smooth out variations in processing delay, transmission delay, and encoded media bitrate.
Server 1540 transmits the coded media bitstream using a communication protocol stack. The stack may include, but is not limited to, one or more of real-time transport protocol (RTP), user Datagram Protocol (UDP), hypertext transfer protocol (HTTP), transmission Control Protocol (TCP), and Internet Protocol (IP). When the communication protocol stack is packet-oriented, server 1540 encapsulates the coded media bitstream into packets. For example, when RTP is used, the server 1540 encapsulates the coded media bitstream into RTP packets according to an RTP payload format. Typically, each media type has a dedicated RTP payload format. It should be noted again that a system may contain more than one server 1540, but for simplicity, the following description considers only one server 1540.
If the media content is packaged in a container file for storage 1530 or for inputting data to transmitter 1540, transmitter 1540 may include a "transmit file parser" or be operably attached to a "transmit file parser" (not shown in the figures). In particular, if the container file is not so transferred, but instead encapsulates at least one of the contained encoded media bitstreams for transfer over the communication protocol, the send file parser will locate the appropriate portion of the encoded media bitstream to be transferred over the communication protocol. The sendfile parser may also assist in creating the correct format of the communication protocol, such as data headers and payloads. The multimedia container file may contain encapsulation instructions, such as hint tracks in the ISOBMFF, to encapsulate at least one of the contained media bitstreams over a communication protocol.
The server 1540 may or may not be connected to the gateway 1550 via a communication network, which may be, for example, a CDN, the internet, and/or a combination of one or more access networks. The gateway may also or alternatively be referred to as a middlebox. For DASH, the gateway may be an edge server (of the CDN) or a network proxy. It is noted that the system may generally include any number of gateways, etc., but for simplicity, the following description considers only one gateway 1550. Gateway 1550 may perform different types of functions such as translating packet flows to one communication protocol stack according to another, merging and branching of data flows, and manipulating data flows according to downlink and/or receiver functions, such as controlling the bit rate of a forwarding flow according to prevailing downlink network conditions. In various embodiments, gateway 1550 may be a server entity.
The system includes one or more receivers 1560 that are generally capable of receiving, demodulating, and decapsulating transmitted signals into coded media bitstreams. The encoded media bitstream may be transmitted to a record storage 1570. Record storage 1570 may include any type of mass storage to store an encoded media bitstream. The record storage 1570 may alternatively or additionally include computing memory, such as random access memory. The format of the encoded media bitstream in record store 1570 may be a basic stand-alone bitstream format or one or more encoded media bitstreams may be encapsulated into a container file. If there are multiple encoded media bitstreams associated with each other, such as an audio stream and a video stream, a container file is typically used, and receiver 1560 includes or is attached to a container file generator that generates a container file from an input stream. Some systems operate "in real time", i.e., omit the recording storage 1570 and transmit the encoded media bitstream directly from the receiver 1560 to the decoder 1580. In some systems, only the most recent portion of the record stream (e.g., the last 10 minutes of the snippet of the record stream) remains in record store 1570, while any earlier record data is discarded from record store 1570.
The encoded media bitstream may be transferred from the record storage 1570 to the decoder 1580. If there are many coded media bitstreams, such as an audio stream and a video stream, associated with each other and packaged into a container file or a single media bitstream, for example, for easier access, a file parser (not shown in the figures) is used to decapsulate each coded media bitstream from the container file. The record storage 1570 or the decoder 1580 may include a file parser or the file parser is attached to the record storage 1570 or the decoder 1580. It should also be noted that the system may include many decoders, but only one decoder 1570 is discussed herein to simplify the description without lacking generality.
The encoded media bitstream may be further processed by a decoder 1570, the output of which is one or more uncompressed media streams. Finally, the renderer 1590 may reproduce the uncompressed media streams, for example, using a speaker or a display. The receiver 1560, the record store 1570, the decoder 1570, and the presenter 1590 may reside in the same physical device or they may be included in separate devices.
Transmitter 1540 and/or gateway 1550 may be configured to perform switching between different representations, e.g., to switch between different view ports of 360 degree video content, view switching, bit rate adaptation and/or quick start, and/or transmitter 1540 and/or gateway 1550 may be configured to select the transmitted representation(s). May be switched between different representations for a variety of reasons, such as in response to a request by receiver 1560 or a prevailing condition of the network over which the bitstream is transmitted (such as throughput). In other words, receiver 1560 may initiate a switch between representations. The request from the receiver may be, for example, a request from a segment or sub-segment of a representation different from the previous one, a request to change the transmitted scalability layer and/or sub-layer, or a change of a rendering device having a different function than the last one. The request for a segment may be an HTTP GET request. The request for a sub-segment may be an HTTP GET request with a byte range. Additionally or alternatively, for example, bit rate adjustment or bit rate adaptation may be used to provide a so-called fast start-up in streaming services, wherein the bit rate of the transport stream is lower than the channel bit rate after the start-up or random access stream in order to immediately start playback and to achieve a buffer occupancy level that tolerates occasional packet delays and/or retransmissions. The bit rate adaptation may include multiple representation or layer up-switch, and representation or layer down-switch operations that occur in various orders.
The decoder 1580 may be configured to perform switching between different representations, e.g., to switch between different view ports of 360 degree video content, view switching, bitrate adaptation and/or fast start-up, and/or the decoder 1580 may be configured to select the transmitted representation(s). Switching between different representations may be done for a variety of reasons, such as to enable faster decoding operations or to adapt to the transmitted bit stream, e.g. in terms of bit rate, the prevailing conditions of the network that carries the bit stream, such as throughput. For example, if the device including the decoder 1580 is multi-tasked and uses computing resources for other purposes than decoding the video bitstream, faster decoding operations may be required. In another example, when the content is played back at a faster than normal playback speed, for example, two or three times faster than conventional real-time playback speed, a faster decoding operation may be required.
In the foregoing, some embodiments have been described with reference to and/or using terminology of HEVC. It is to be appreciated that embodiments may be similarly implemented with any video encoder and/or video decoder.
In the above, where example embodiments have been described with reference to an encoder, it is to be understood that the resulting bitstream and decoder may have corresponding elements therein. Also, in the case where the example embodiments have been described with reference to a decoder, it is to be understood that an encoder may have a structure and/or a computer program for generating a bitstream to be decoded by the decoder. For example, some embodiments have been described in connection with generating a prediction block as part of encoding. An embodiment may be similarly implemented by generating a prediction block as part of decoding, except that encoding parameters such as horizontal and vertical offsets are decoded from the bitstream as determined by the encoder.
The embodiments of the invention described above describe the codec in terms of separate encoder and decoder devices to aid in understanding the processes involved. However, it is to be understood that the apparatus, structures and operations may be implemented as a single encoder-decoder apparatus/structure/operation. Furthermore, the encoder and decoder may share some or all of the common elements.
While the above examples describe embodiments of the invention operating within a codec within an electronic device, it is to be appreciated that the invention as defined in the claims may be implemented as part of any video codec. Thus, for example, embodiments of the invention may be implemented in a video codec that may implement video encoding over a fixed or wired communication path.
Thus, the user equipment may comprise a video codec such as described in the embodiments of the invention above. It should be appreciated that the term user equipment is intended to cover any suitable type of wireless user equipment, such as a mobile telephone, a portable data processing device or a portable web browser.
Furthermore, elements of the Public Land Mobile Network (PLMN) may also comprise a video codec as described above.
In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
Embodiments of the invention may be implemented by computer software executable by a data processor of a mobile device, such as in a processor entity, either in hardware, or in a combination of software and hardware. Further in this regard, it should be noted that any blocks of logic flows as in the figures may represent program steps or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on such physical media as memory chips, memory blocks implemented within the processor, magnetic memory such as hard or floppy disks, and optical memory such as, for example, DVDs and their data modification CDs.
The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory, and removable memory. The data processor may be of any type suitable to the local technical environment and may include, as non-limiting examples, one or more of a general purpose computer, a special purpose computer, a microprocessor, a Digital Signal Processor (DSP), and a processor based on a multi-core processor architecture.
Embodiments of the invention may be practiced in various components such as integrated circuit modules. The design of integrated circuits is generally a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.
Programs such as those provided by Synopsys, inc. of mountain view, california and Cadence Design, of san Jose, california will automatically route conductors and locate components on a semiconductor chip using well-established Design rules as well as libraries of pre-stored Design modules. Once the design of the semiconductor circuit is completed, the resulting design in a standardized electronic format (e.g., opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication.
The foregoing description provides a complete and informative description of exemplary embodiments of the invention by way of exemplary and non-limiting examples. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this invention will still fall within the scope of this invention.

Claims (15)

1. An apparatus for motion compensated prediction, comprising:
means for decoding a bitstream, the bitstream comprising syntax elements describing transform coefficients or sample differences for residual samples;
means for determining, based on the syntax element, a residual signal for at least one sample for at least a first chroma channel, the residual signal comprising a value;
means for decoding a combined residual flag indicating whether the residual signal represents a residual for samples in more than one chroma channel;
means for applying the residual signal comprising the value to at least a first sample in a first chroma channel for generating a first reconstructed sample in response to the residual signal representing a residual for samples in more than one chroma channel, such that the first sample in the first chroma channel is set based on the value; and
means for applying the residual signal comprising the value for at least a second sample in a second chroma channel for generating a second reconstructed sample such that the second sample in the second chroma channel is set based on the value.
2. The apparatus of claim 1, further comprising
Means for applying the combined residual signal for a chrominance channel of a still image or video sequence.
3. The apparatus of claim 1, further comprising
Means for decoding a single residual block in response to the combined residual flag being 1 or true; and
means for applying the residue to both the first chrominance channel and the second chrominance channel.
4. The apparatus of claim 1, further comprising
Means for decoding the encoded block flag for the first chroma channel and the encoded block flag for the second chroma channel;
means for decoding the combined residual flag in response to the two encoded block flags being either 1 or true;
means for decoding a single residual block in response to the combined residual flag being 1 or true; and
means for applying the single residual block for both the first chroma channel and the second chroma channel.
5. The apparatus of claim 1, further comprising
Means for decoding a single residual block in response to the combined residual flag being 1 or true;
means for adding the single residual block to a prediction block in a first chroma channel; and
Means for subtracting the single residual block from the prediction block in the second chroma channel.
6. The apparatus of claim 1, further comprising
Means for decoding an identifier associated with the first chroma channel; and
means for subtracting a residual signal of the second chroma channel from a predicted signal in the first chroma channel.
7. The apparatus of claim 1, further comprising
Means for applying the combined residual signal to a subset of blocks determined by bit stream signaling.
8. The apparatus of claim 7, further comprising
Means for applying the combined residual signal to the block using prediction modes belonging to a predetermined set of prediction modes.
9. The apparatus of claim 7, further comprising
Means for applying the combined residual signal to the block using a residual coding mode belonging to a predetermined set of residual coding modes.
10. The apparatus of any one of claims 1 to 9, further comprising
Means for applying post-processing to at least the first reconstructed sample to generate at least a first output sample in the first chroma channel; and
means for applying post-processing to at least the second reconstructed sample to generate at least a second output sample in the second chroma channel.
11. A method for motion compensated prediction, the method comprising
Decoding a bitstream, the bitstream comprising syntax elements describing transform coefficients or sample differences for residual samples;
determining a residual signal for at least one sample based on the syntax element for at least a first chroma channel, the residual signal comprising a value;
decoding a combined residual flag indicating whether the residual signal represents a residual for samples in more than one chroma channel;
in response to the residual signal representing a residual for samples in more than one chroma channel, applying the residual signal comprising the value to at least a first sample in a first chroma channel for generating a first reconstructed sample such that the first sample in the first chroma channel is set based on the value; and
the residual signal comprising the value is applied for at least a second sample in a second chroma channel for generating a second reconstructed sample such that the second sample in the second chroma channel is set based on the value.
12. The method of claim 11, further comprising
The combined residual signal is applied for the chrominance channels of a still image or video sequence.
13. The method of claim 12, further comprising
A flag is included in the bitstream for indicating that at least one predefined condition for residual decoding is met.
14. The method of any of claims 11 to 13, further comprising
Decoding a single residual block in response to the combined residual flag being 1 or true; and
the residual is applied to both the first chrominance channel and the second chrominance channel.
15. An apparatus for motion compensated prediction, comprising:
at least one processor and at least one memory stored with code thereon, which when executed by the at least one processor, causes the apparatus to at least perform:
decoding a bitstream, the bitstream comprising syntax elements describing transform coefficients or sample differences for residual samples;
determining a residual signal for at least one sample based on the syntax element for at least a first chroma channel, the residual signal comprising a value;
decoding a combined residual flag indicating whether the residual signal represents a residual for samples in more than one chroma channel;
In response to the residual signal representing sample residuals in more than one chroma channel, applying the residual signal including the value to at least a first sample in a first chroma channel for generating a first reconstructed sample such that the first sample in the first chroma channel is set based on the value; and
the residual signal comprising the value is applied for at least a second sample in a second chroma channel for generating a second reconstructed sample such that the second sample in the second chroma channel is set based on the value.
CN201911101171.2A 2018-12-17 2019-11-12 Apparatus, method and computer program for video encoding and decoding Active CN111327893B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
FI20186097 2018-12-17
FI20186097 2018-12-17

Publications (2)

Publication Number Publication Date
CN111327893A CN111327893A (en) 2020-06-23
CN111327893B true CN111327893B (en) 2023-10-17

Family

ID=68655383

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911101171.2A Active CN111327893B (en) 2018-12-17 2019-11-12 Apparatus, method and computer program for video encoding and decoding

Country Status (5)

Country Link
US (1) US11212548B2 (en)
EP (1) EP3672255A1 (en)
CN (1) CN111327893B (en)
PH (1) PH12019000380A1 (en)
ZA (1) ZA201908191B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3700205A1 (en) * 2019-02-19 2020-08-26 Nokia Technologies Oy Quantization parameter derivation for cross-channel residual encoding and decoding
KR20210139336A (en) * 2019-03-12 2021-11-22 프라운호퍼-게젤샤프트 추르 푀르데룽 데어 안제반텐 포르슝 에 파우 Optional inter-component transformation for image and video coding
KR20200109276A (en) 2019-03-12 2020-09-22 주식회사 엑스리스 Method for encoding/decoidng video signal and apparatus therefor
CN114365490B (en) * 2019-09-09 2024-06-18 北京字节跳动网络技术有限公司 Coefficient scaling for high precision image and video codecs
EP4018648A4 (en) 2019-09-21 2022-11-23 Beijing Bytedance Network Technology Co., Ltd. High precision transform and quantization for image and video coding
WO2021079951A1 (en) * 2019-10-25 2021-04-29 パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカ Coding device, decoding device, coding method, and decoding method
US11719160B2 (en) * 2020-02-03 2023-08-08 Rohr, Inc. Acoustic liner and method of forming same
CN112468818B (en) * 2021-01-22 2021-06-29 腾讯科技(深圳)有限公司 Video communication realization method and device, medium and electronic equipment
US11601656B2 (en) * 2021-06-16 2023-03-07 Western Digital Technologies, Inc. Video processing in a data storage device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102144391A (en) * 2008-09-05 2011-08-03 微软公司 Skip modes for inter-layer residual video coding and decoding
CN105580373A (en) * 2013-07-23 2016-05-11 诺基亚技术有限公司 An apparatus, a method and a computer program for video coding and decoding

Family Cites Families (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030112863A1 (en) * 2001-07-12 2003-06-19 Demos Gary A. Method and system for improving compressed image chroma information
KR100723408B1 (en) * 2004-07-22 2007-05-30 삼성전자주식회사 Method and apparatus to transform/inverse transform and quantize/dequantize color image, and method and apparatus to encode/decode color image using it
US8050915B2 (en) * 2005-07-11 2011-11-01 Lg Electronics Inc. Apparatus and method of encoding and decoding audio signals using hierarchical block switching and linear prediction coding
US8300698B2 (en) * 2006-10-23 2012-10-30 Qualcomm Incorporated Signalling of maximum dynamic range of inverse discrete cosine transform
AU2012281918C1 (en) * 2011-07-11 2016-11-17 Sun Patent Trust Decoding Method, Coding Method, Decoding Apparatus, Coding Apparatus, And Coding and Decoding Apparatus
US9948938B2 (en) * 2011-07-21 2018-04-17 Texas Instruments Incorporated Methods and systems for chroma residual data prediction
CN103918265B (en) * 2011-11-07 2018-09-18 英特尔公司 Across channel residual prediction
JP5325360B1 (en) * 2011-12-15 2013-10-23 パナソニック株式会社 Image encoding method and image encoding apparatus
WO2013118485A1 (en) * 2012-02-08 2013-08-15 パナソニック株式会社 Image-encoding method, image-decoding method, image-encoding device, image-decoding device, and image-encoding-decoding device
EP2868078A4 (en) * 2012-06-27 2016-07-27 Intel Corp Cross-layer cross-channel residual prediction
CN104604225B (en) * 2012-09-10 2018-01-26 太阳专利托管公司 Method for encoding images, picture decoding method, picture coding device, picture decoding apparatus and image encoding/decoding device
AU2012232992A1 (en) * 2012-09-28 2014-04-17 Canon Kabushiki Kaisha Method, apparatus and system for encoding and decoding the transform units of a coding unit
RU2641223C2 (en) * 2012-11-08 2018-01-16 Кэнон Кабусики Кайся Method, device and system for coding and decoding units of coding unit conversion
US9615090B2 (en) * 2012-12-28 2017-04-04 Qualcomm Incorporated Parsing syntax elements in three-dimensional video coding
US9148672B2 (en) * 2013-05-08 2015-09-29 Mediatek Inc. Method and apparatus for residue transform
KR20150027530A (en) * 2013-09-04 2015-03-12 한국전자통신연구원 High efficiency video coding intra frame prediction apparatus and method thereof
US10397607B2 (en) * 2013-11-01 2019-08-27 Qualcomm Incorporated Color residual prediction for video coding
EP3120561B1 (en) * 2014-03-16 2023-09-06 VID SCALE, Inc. Method and apparatus for the signaling of lossless video coding
WO2016203981A1 (en) * 2015-06-16 2016-12-22 シャープ株式会社 Image decoding device and image encoding device
US20180160118A1 (en) * 2015-06-18 2018-06-07 Sharp Kabushiki Kaisha Arithmetic decoding device and arithmetic coding device
CN109196863B (en) * 2016-05-27 2021-07-30 夏普株式会社 System and method for changing quantization parameter
WO2018037853A1 (en) * 2016-08-26 2018-03-01 シャープ株式会社 Image decoding apparatus and image coding apparatus
WO2018061550A1 (en) * 2016-09-28 2018-04-05 シャープ株式会社 Image decoding device and image coding device
US20200045305A1 (en) * 2016-09-30 2020-02-06 Lg Electronics Inc. Picture processing method and apparatus for same
US10523966B2 (en) * 2017-03-31 2019-12-31 Mediatek Inc. Coding transform blocks
JP2021010046A (en) * 2017-10-06 2021-01-28 シャープ株式会社 Image encoding device and image decoding device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102144391A (en) * 2008-09-05 2011-08-03 微软公司 Skip modes for inter-layer residual video coding and decoding
CN105580373A (en) * 2013-07-23 2016-05-11 诺基亚技术有限公司 An apparatus, a method and a computer program for video coding and decoding

Also Published As

Publication number Publication date
US11212548B2 (en) 2021-12-28
ZA201908191B (en) 2022-06-29
EP3672255A1 (en) 2020-06-24
CN111327893A (en) 2020-06-23
US20200195953A1 (en) 2020-06-18
PH12019000380A1 (en) 2020-09-28

Similar Documents

Publication Publication Date Title
KR102191846B1 (en) Video encoding and decoding
CN111327893B (en) Apparatus, method and computer program for video encoding and decoding
US9800893B2 (en) Apparatus, a method and a computer program for video coding and decoding
KR102474636B1 (en) Quantization parameter derivation for cross-channel residual encoding and decoding
KR20170101983A (en) Interlayer Prediction for Scalable Video Coding and Decoding
JP2018524897A (en) Video encoding / decoding device, method, and computer program
US11223849B2 (en) Transform sign compression in video encoding and decoding
CN113711594A (en) Apparatus, method and computer program for video encoding and decoding
JP7390477B2 (en) Apparatus, method, and computer program for video coding and decoding
RU2795346C1 (en) Device method and computer program for encoding and decoding video
US20240007672A1 (en) An apparatus, a method and a computer program for video coding and decoding
US20220078481A1 (en) An apparatus, a method and a computer program for video coding and decoding
KR20240027829A (en) Apparatus, method and computer program for cross-component parameter calculation
GB2534591A (en) Video encoding and decoding
WO2024074754A1 (en) An apparatus, a method and a computer program for video coding and decoding
WO2019211522A2 (en) An apparatus, a method and a computer program for video coding and decoding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40032097

Country of ref document: HK

GR01 Patent grant
GR01 Patent grant