AU2016203314A1

AU2016203314A1 - Method, apparatus and system for encoding and decoding video data

Info

Publication number: AU2016203314A1
Application number: AU2016203314A
Authority: AU
Inventors: Jonathan GAN; Volodymyr KOLESNIKOV; Christopher James ROSEWARNE
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2016-05-20
Filing date: 2016-05-20
Publication date: 2017-12-07

Abstract

- 46 METHOD, APPARATUS AND SYSTEM FOR ENCODING AND DECODING Abstract 5 A method of encoding video data. Residual data of a block of the video data to be encoded is received. A Rice parameter to encode the residual data is determined, the Rice parameter being determined to increase a size of the received residual data such that a size of the encoded residual data exceeds a minimum output buffer utilisation. The minimum output buffer utilisation indicates an amount of space in an output buffer that the received 10 residual data is required to fill to prevent buffer underrun of the output buffer for a video bitstream. The received residual data is stored to the output buffer using the determined Rice parameter to encode the video data. P15181 0_speci as filed (11341263v1) (Wide-Area) Communications Network 220 Printer 215 *. Microphone Cf 224 280 221 01*01. 217 (Local-Area) Communications Video Network 222 Display Ext. 223 edim 225 Audio-Video 1/O Interfaces Local Net. Appl. Prog Storage Interface 207 208 I/f ace 211 HDD 0 Devices t r 204 Processor 1/O Interface Memory Optical Disk 205 213 206 Drive 212 Keyboard20 Scanner 26] Disk Storage 203 Medium 225 Fig. 2A

Description

METHOD, APPARATUS AND SYSTEM FOR ENCODING AND DECODING

VIDEO DATA

TECHNICAL FIELD

The present invention relates generally to digital video signal processing and, in particular, to a method, apparatus and system for encoding and decoding video data. The present invention also relates to a computer program product including a computer readable medium having recorded thereon a computer program for encoding and decoding video data.

BACKGROUND

Many applications for video coding currently exist, including applications for transmission and storage of video data. Many video coding standards have also been developed and others are currently in development. Recent developments in video coding standardisation have led to the formation of a group called the “Joint Collaborative Team on Video Coding” (JCT-VC). The Joint Collaborative Team on Video Coding (JCT-VC) includes members of Study Group 16, Question 6 (SG16/Q6) of the Telecommunication Standardisation Sector (ITU-T) of the International Telecommunication Union (ITU), known as the Video Coding Experts Group (VCEG), and members of the International Organisations for Standardisation / International Electrotechnical Commission Joint Technical Committee 1 / Subcommittee 29 / Working Group 11 (ISO/IEC JTC1/SC29/WG11), also known as the Moving Picture Experts Group (MPEG).

The Joint Collaborative Team on Video Coding (JCT-VC) has produced a new video coding standard that significantly outperforms the "H.264/MPEG-4 AVC” (ISO/IEC 14496-10) video coding standard. The new video coding standard has been named “high efficiency video coding (HEVC)”. Further development of high efficiency video coding (HEVC) is directed towards introducing improved support for content known variously as ‘screen content’ or ‘discontinuous tone content’. Such content is typical of video output from a computer or a tablet device, e.g. from a DVI connector or as would be transmitted over a wireless HDMI link. The content is poorly handled by previous video compression standards and thus a new activity directed towards improving the achievable coding efficiency for this type of content is underway. P151810_speci_as filed (11341263v1)

Other developments, e.g. in the Video Electronics Standards Association (VESA), have been directed towards video coding algorithms capable of latencies under one frame. Traditional video compression standards, such as H.264/AVC and HEVC, have latencies of multiple frames, as measured from the input of the encoding process to the output of the decoding process. Codecs complying with such standards may be termed ‘distribution codecs’, as they are intended to provide compression for distribution of video data from a source, such as a studio, to the end consumer, e.g. via terrestrial broadcast or internet streaming. Note that HEVC does have signalling support for latencies under one frame, in the form of a Decoding Unit Supplementary Enhancement Information (SEI) message.

The Decoding Unit SEI message is an extension of the timing signalling present in the Picture Timing SEI message, allowing specification of the timing of units less than one frame. However, the signalling is insufficient to achieve very low latencies with minimal buffering, and the consequently tight coupling of the encoding and decoding processes. Applications requiring low latency are generally present within a broadcast studio.

In a broadcast studio, video may be captured by a camera before undergoing several transformations, including real-time editing, graphic and overlay insertion and muxing. Once the video has been adequately processed, a distribution encoder is used to encode the video data for final distribution to end consumers. Within the studio, the video data is generally transported in an uncompressed format. This necessitates the use of very high speed links. Variants of the Serial Digital Interface (SDI) protocol can transport different video formats. For example, 3G-SDI (operating with a 3Gbps electrical link) can transport 1080p HDTV (1920x1080 resolution) at 30fps and 8 bits per sample. Interfaces having a fixed bit rate are suited to transporting data having a constant bit rate (CBR). Uncompressed video data is generally CBR, and compressed video data may also be CBR. As bit rates increase, achievable cabling lengths reduce, which becomes problematic for cable routing through a studio. For example, UHDTV (3840x2160) requires a 4X increase in bandwidth compared to 1080p HDTV, implying a 12Gbps interface. Increasing the data rate of a single electrical channel reduces the achievable length of the cabling. At 3 Gbps, cable runs generally cannot exceed 150m, the minimum usable length for studio applications. One method of achieving higher rate links is by replicating cabling, e.g. by using four 3G-SDI links, with frame tiling or some other multiplexing scheme. However, the cabling replicating method increases cable routing complexity, requires more physical space, and may reduce reliability compared to use of a single cable. Thus, a codec that can P151810_speci_as filed (11341263v1) perform compression at relatively low compression ratios (e.g. 4:1) while retaining a ‘visually lossless’ (i.e. having no perceivable artefacts compared to the original video data ) level of performance is desired.

Activity within VESA has produced a standard named Display Stream Compression (DSC) and is standardising a newer variant named Advanced Display Stream Compression (ADSC). However, this activity is directed more towards distribution of high-resolution video data within electronic devices, such as smart phones and tablets, as a means of reducing the printed circuit board (PCB) routing difficulties for supporting very high resolutions (e.g. as used in ‘retina’ displays), by reducing either clock rate of the required PCB traces. As such, ADSC is targeting applications where a single encode-decode cycle (‘single-generation’ operation) is anticipated. Within a broadcast studio, video data is typically passed between several processing stages prior to final encoding for distribution. For passing UHD video data through bandwidth-limited interfaces, such as 3G-SDI, multiple encode-decode cycles (‘multi-generational’ operation) is anticipated. Then, the quality level of the video data must remain visually lossless after as many as seven encode-decode cycles.

Video data includes one or more colour channels. Generally there is one primary colour channel and two secondary colour channels. The primary colour channel is generally referred to as the ‘luma’ channel and the secondary colour channel(s) are generally referred to as the ‘chroma’ channels. Video data is represented using a colour space, such as ‘YCbCr’ or ‘RGB’. For screen content applications, ‘RGB’ is commonly used, as this is the format generally used to drive LCD panels. Note that the greatest signal strength is present in the ‘G’ (green) channel, so generally the G channel is coded using the primary colour channel, and the remaining channels (i.e. ‘B’ and ‘R’) are coded using the secondary colour channels. This arrangement may be referred to as ‘GBR’. When the ‘YCbCr’ colour space is in use, the Ύ’ channel is coded using the primary colour channel and the ‘Cb’ and ‘Cr’ channels are coded using the secondary colour channels.

Video data is also represented using a particular chroma format. The primary colour channel and the secondary colour channels are spatially sampled at the same spatial density when the 4:4:4 chroma format is in use. For screen content, the commonly used chroma format is 4:4:4, as generally LCD panels provide pixels at a 4:4:4 chroma format. The bit-depth defines the bit width of samples in the respective colour channel, which P151810_speci_as filed (11341263v1) implies a range of available sample values. Generally, all colour channels have the same bit-depth, although they may alternatively have different bit-depths. Other chroma formats are also possible. For example, if the chroma channels are sampled at half the rate vertically (compared to the luma channel), a 4:2:2 chroma format is said to be in use.

Also, if the chroma channels are sampled at half the rate horizontally and vertically (compared to the luma channel), a 4:2:0 chroma format is said to be in use. These chroma formats exploit a characteristic of the human visual system that sensitivity to intensity is higher than sensitivity to colour. As such, it is possible to reduce sampling of the colour channels without causing undue visual impact. However, this property is less applicable to studio environments, where multiple generations of encoding and decoding are common. Also, for screen content the use of chroma formats other than 4:4:4 can be problematic as distortion is introduced to aliased text and sharp object edges.

Frame data may also contain a mixture of screen content and camera captured content. For example, a computer screen may include various windows, icons and control buttons, text, and also contain a video being played, or an image being viewed. Such content, in terms of the entirety of a computer screen, can be referred to as ‘mixed content’. Moreover, the level of detail (or ‘texture’) varies within a frame. Generally, regions of detailed textures (e.g. foliage, text), or resulting from noise (e.g. from a camera sensor) are difficult to compress. The detailed textures can only be coded at a low compression ratio without losing detail. Conversely, regions with little detail (e.g. flat regions, sky, background from a computer application) can be coded with a high compression ratio, with little loss of detail.

In the context of sub-frame latency video compression, the buffering included in the video encoder and the video decoder is generally substantially smaller than one frame (e.g. only dozens of lines of samples). Then, the video encoder and video decoder must not only operate in real-time, but also with sufficiently tightly controlled timing that the available buffers do not underrun or overrun. In the context of real-time operation, it is not possible to stall input or delay output (e.g. due to buffer overrun or underrun). If input was stalled or output delayed, the result would be some highly noticeable distortion of the video data being passed through the video encoder and decoder. Thus, a need exists for algorithms to control the behaviour of the video encoder and decoder to avoid such situations. P151810_speci_as filed (11341263v1)

SUMMARY

It is an object of the present invention to substantially overcome, or at least ameliorate, one or more disadvantages of existing arrangements.

According to one aspect of the present disclosure, there is provided a method of encoding video data, the method comprising: receiving residual data of a block of the video data to be encoded; determining a Rice parameter to encode the residual data, the Rice parameter being determined to increase a size of the received residual data such that a size of the encoded residual data exceeds a minimum output buffer utilisation, the minimum output buffer utilisation indicating an amount of space in an output buffer that the received residual data is required to fill to prevent buffer underrun of the output buffer for a video bitstream; and storing the received residual data to the output buffer using the determined Rice parameter to encode the video data.

According to another aspect of the present disclosure, there is provided an encoder for encoding video data, the encoder comprising: module for receiving residual data of a block of the video data to be encoded; module for determining a Rice parameter to encode the residual data, the Rice parameter being determined to increase a size of the received residual data such that a size of the encoded residual data exceeds a minimum output buffer utilisation, the minimum output buffer utilisation indicating an amount of space in an output buffer that the received residual data is required to fill to prevent buffer underrun of the output buffer for a video bitstream; and module for storing the received residual data to the output buffer using the determined Rice parameter to encode video data.

According to still another aspect of the present disclosure, there is provided a system for encoding video data to produce a band limited bit rate video bitstream, the system comprising: a memory for storing data and a computer program; a processor coupled to the memory for executing the computer program, the computer program comprising instructions for: receiving residual data of a block of the video data to be encoded; P151810_speci_as filed (11341263v1) determining a Rice parameter to encode the residual data, the Rice parameter being determined to increase a size of the received residual data such that a size of the encoded residual data exceeds a minimum output buffer utilisation, the minimum output buffer utilisation indicating an amount of space in an output buffer that the received residual data is required to fill to prevent buffer underrun of the output buffer fora video bitstream; and storing the received residual data to the output buffer using the determined Rice parameter to encode the video data.

According to still another aspect of the present disclosure, there is provided an apparatus for encoding video data, the apparatus comprising: means for receiving residual data of a block of the video data to be encoded; means for determining a Rice parameter to encode the residual data, the Rice parameter being determined to increase a size of the received residual data such that a size of the encoded residual data exceeds a minimum output buffer utilisation, the minimum output buffer utilisation indicating an amount of space in an output buffer that the received residual data is required to fill to prevent buffer underrun of the output buffer for a video bitstream; and means for storing the received residual data to the output buffer using the determined Rice parameter to encode the video data.

According to still another aspect of the present disclosure, there is provided a computer readable medium having a program stored thereon for encoding video data to produce a band limited bit rate video bitstream, the program comprising: code for receiving residual data of a block of the video data to be encoded; code for determining a Rice parameter to encode the residual data, the Rice parameter being determined to increase a size of the received residual data such that a size of the encoded residual data exceeds a minimum output buffer utilisation, the minimum output buffer utilisation indicating an amount of space in an output buffer that the received residual data is required to fill to prevent buffer underrun of the output buffer for a video bitstream; and code for storing the received residual data to the output buffer using the determined Rice parameter to encode the video data.Other aspects are also disclosed. P151810_speci_as filed (11341263v1)

BRIEF DESCRIPTION OF THE DRAWINGS

At least one embodiment of the present invention will now be described with reference to the following drawings and and appendices, in which:

Fig. lisa schematic block diagram showing a sub-frame latency video encoding and decoding system;

Figs. 2A and 2B form a schematic block diagram of a general purpose computer system upon which one or both of the video encoding and decoding system of Fig. 1 may be practiced;

Fig. 3 is a schematic block diagram showing functional modules of a video encoder;

Fig. 4 is a schematic block diagram showing functional modules of a video decoder;

Fig. 5A is a schematic block diagram showing square coding tree unit (CTU) configurations for the sub-frame latency video encoding and decoding system of Fig. 1;

Fig. 5B is a schematic block diagram showing non-square coding tree unit (CTU) configurations for the sub-frame latency video encoding and decoding system of Fig. 1;

Fig. 5C is a schematic block diagram showing square block configurations for the sub-frame latency video encoding and decoding system of Fig. 1;

Fig. 6A is a schematic block diagram showing a horizontally scanned sub-block;

Fig. 6B is a schematic block diagram showing vertically scanned sub-block;

Fig. 6C is a schematic block diagram showing a diagonally-scanned sub-block;

Fig. 7 is a schematic block diagram showing a coefficient magnitude syntax structure;

Fig. 8 is a schematic block diagram showing a bitstream portion with residual data and a coded data buffer; P151810_speci_as filed (11341263v1)

Fig. 9 is a schematic block diagram showing a truncated residual for a sub-block;

Fig. 10 is a schematic flow diagram showing a method of padding a bitstream with data to meet a minimum buffer utilisation requirement; and

Fig. 11 is a schematic flow diagram showing a method of padding a bitstream by adjusting a Rice parameter for coded residual data to meet a minimum buffer utilisation requirement.

DETAILED DESCRIPTION INCLUDING BEST MODE

Where reference is made in any one or more of the accompanying drawings to steps and/or features, which have the same reference numerals, those steps and/or features have for the purposes of this description the same function(s) or operation(s), unless the contrary intention appears.

Fig. lisa schematic block diagram showing function modules of a sub-frame latency video encoding and decoding system 100. The system 100 may use a rate control mechanism to ensure delivery of portions of a frame by video encoder 114 within a timeframe that allows video decoder 134 to deliver decoded frame data in real time.

The rate control mechanism ensures that no buffer underruns and resulting failure to deliver decoded video occur (e.g. due to variations in the complexity and time taken for encoder searching of possible modes) of the incoming video data to the video encoder 114), so that decoded video frames from the video decoder 114 are delivered according to the timing of the interface over which the video frames are delivered. The interface over which the video frames are delivered may be, for example, SDI. Interfaces such as SDI have sample timing synchronised to a clock source, with horizontal and vertical blanking periods. As such, samples of the decoded video need to be delivered in accordance with the frame timing of the SDI link. Video data formatted for transmission over SDI may also be conveyed over Ethernet, e.g. using methods as specified in SMPTE ST. 2022-6. In the event that samples were not delivered according to the required timing, noticeable visual artefacts would result (e.g. from invalid data being interpreted as sample values by the downstream device). Accordingly, the rate control mechanism ensures that no buffer overruns and resulting inability to process incoming video data occur. A similar constraint exists for the inbound SDI link to the video encoder 114, which needs to encode samples in P151810_speci_as filed (11341263v1) accordance with arrival timing and may not stall incoming video data to the video encoder 114, e.g. due to varying processing demand for encoding different regions of a frame.

As mentioned previously, the video encoding and decoding system 100 has a latency of less than one frame of video data. In particular, some applications require latencies as low as 32 lines of video data from the input of the video encoder 114 to the output of the video decoder 134. The latency may include time taken during input/output of video data and storage of partially-coded video data prior to and after transit over a communications channel. Generally, video data is transmitted and received in raster scan order, e.g. over an SDI link. However, the video encoding and decoding system 100 processes video data in coding tree units “CTUs”. Each frame is divided into an array of square-shaped CTUs. The video encoder 114 requires all samples in a given CTU before encoding of that CTU can begin. The structure of a CTU is described further with reference to Fig. 5 A. \An alternative structure, using a non-square shaped CTU, is described with reference to Fig. 5B.

The system 100 includes a source device 110 and a destination device 130. A communication channel 120 is used to communicate encoded video information from the source device 110 to the destination device 130. In some arrangements, the source device 110 and destination device 130 may comprise respective broadcast studio equipment, such as overlay insertion and real-time editing module, in which case the communication channel 120 may be an SDI link. In other arrangements, the source device 110 and destination device 130 may comprise a graphics driver as part of a system-on-chip (SOC) and an LCD panel (e.g. as found in a smart phone, tablet or laptop computer), in which case the communication channel 120 is typically a wired channel, such as PCB trackwork and associated connectors. Moreover, the source device 110 and the destination device 130 may comprise any of a wide range of devices, including devices supporting over the air television broadcasts, cable television applications, internet video applications and applications where encoded video data is captured on some storage medium or a file server. The source device 110 may also be a digital camera capturing video data and outputting the video data in a compressed format offering visually lossless compression, as such the performance may be considered as equivalent to a truly lossless format (e.g. uncompressed). P151810_speci_as filed (11341263v1)

As shown in Fig. 1, the source device 110 includes a video source 112, the video encoder 114 and a transmitter 116. The video source 112 typically comprises a source of captured video frame data, such as an imaging sensor, a previously captured video sequence stored on a non-transitory recording medium, or a video feed from a remote imaging sensor. The video source 112 may also be the output of a computer graphics card, e.g. displaying the video output of an operating system and various applications executing upon a computing device, for example a tablet computer. Such content is an example of ‘screen content’. Examples of source devices 110 that may include an imaging sensor as the video source 112 include smart-phones, video camcorders and network video cameras. The video encoder 114 converts the captured frame data from the video source 112 into encoded video data and will be described further with reference to Fig. 3.

The video encoder 114 encodes a given frame as the frame is being input to the video encoder 114. The frame is generally input to the video encoder 114 as a sequence of samples in raster scan order, from the uppermost line in the frame to the lowermost line in the frame. The video encoder 114 is required to process the incoming sample data in realtime, i.e., it is not able to stall the incoming sample data if the rate of processing the incoming data were to fall below the input data rate. The encoded video data is typically an encoded bitstream containing a sequence to blocks of compressed video data. In a video streaming application, the entire bitstream is not stored in any one location. Instead, the blocks of compressed video data are continually being produced by the encoder and consumed by the decoder, with intermediate storage, e.g., in the communication channel 120. Blocks of compressed video data are transmitted by the transmitter 116 over the communication channel 120 (e.g. an SDI link) as encoded video data (or “encoded video information”). The coded picture buffer is used to store a portion of the frame in encoded form and generally comprises a non-transitory memory buffer. It is also possible for the encoded video data to be stored in a non-transitory storage device 122, such as a “Flash” memory or a hard disk drive, until later being transmitted over the communication channel 120, or in-lieu of transmission over the communication channel 120.

The destination device 130 includes a receiver 132, a video decoder 134 and a display device 136. The receiver 132 receives encoded video data from the communication channel 120 and passes received video data to the video decoder 134. The video decoder 134 then outputs decoded frame data to the display device 136. Examples of the display device 136 include a cathode ray tube, a liquid crystal display (such as in P151810_speci_as filed (11341263v1) smart-phones), tablet computers, computer monitors or in stand-alone television sets. It is also possible for the functionality of each of the source device 110 and the destination device 130 to be embodied in a single device, examples of which include mobile telephone handsets and tablet computers.

Notwithstanding the example devices mentioned above, each of the source device 110 and destination device 130 may be configured within a general purpose computing system, typically through a combination of hardware and software components. Fig. 2A illustrates such a computer system 200, which includes: a computer module 201; input devices such as a keyboard 202, a mouse pointer device 203, a scanner 226, a camera 227, which may be configured as the video source 112, and a microphone 280; and output devices including a printer 215, a display device 214, which may be configured as the display device 136, and loudspeakers 217. An external Modulator-Demodulator (Modem) transceiver device 216 may be used by the computer module 201 for communicating to and from a communications network 220 via a connection 221. The communications network 220, which may represent the communication channel 120, may be a wide-area network (WAN), such as the Internet, a cellular telecommunications network, or a private WAN. Where the connection 221 is a telephone line, the modem 216 may be a traditional “dialup” modem. Alternatively, where the connection 221 is a high capacity (e.g., cable) connection, the modem 216 may be a broadband modem. A wireless modem may also be used for wireless connection to the communications network 220. The transceiver device 216 may provide the functionality of the transmitter 116 and the receiver 132 and the communication channel 120 may be embodied in the connection 221.

The computer module 201 typically includes at least one processor unit 205, and a memory unit 206. For example, the memory unit 206 may have semiconductor random access memory (RAM) and semiconductor read only memory (ROM). The computer module 201 also includes a number of input/output (I/O) interfaces including: an audiovideo interface 207 that couples to the video display 214, loudspeakers 217 and microphone 280; an I/O interface 213 that couples to the keyboard 202, mouse 203, scanner 226, camera 227 and optionally a joystick or other human interface device (not illustrated); and an interface 208 for the external modem 216 and printer 215. The signal from the audio-video interface 207 to the computer monitor 214 is generally the output of a computer graphics card and provides an example of ‘screen content’. In some implementations, the modem 216 may be incorporated within the computer module 201, P151810_speci_as filed (11341263v1) for example within the interface 208. The computer module 201 also has a local network interface 211, which permits coupling of the computer system 200 via a connection 223 to a local-area communications network 222, known as a Local Area Network (LAN). As illustrated in Fig. 2A, the local communications network 222 may also couple to the wide network 220 via a connection 224, which would typically include a so-called “firewall” device or device of similar functionality. The local network interface 211 may comprise an Ethernet™ circuit card, a Bluetooth™ wireless arrangement or an IEEE 802.11 wireless arrangement; however, numerous other types of interfaces may be practiced for the interface 211. The local network interface 211 may also provide the functionality of the transmitter 116 and the receiver 132 and communication channel 120 may also be embodied in the local communications network 222.

The EO interfaces 208 and 213 may afford either or both of serial and parallel connectivity, the former typically being implemented according to the Universal Serial Bus (USB) standards and having corresponding USB connectors (not illustrated). Storage devices 209 are provided and typically include a hard disk drive (HDD) 210. Other storage devices such as a floppy disk drive and a magnetic tape drive (not illustrated) may also be used. An optical disk drive 212 is typically provided to act as a non-volatile source of data. Portable memory devices, such optical disks (e.g. CD-ROM, DVD, Blu-ray Disc™), USB-RAM, portable, external hard drives, and floppy disks, for example, may be used as appropriate sources of data to the computer system 200. Typically, any of the HDD 210, optical drive 212, networks 220 and 222 may also be configured to operate as the video source 112, or as a destination for decoded video data to be stored for reproduction via the display 214. The source device 110 and the destination device 130 of the system 100, or the source device 110 and the destination device 130 of the system 100 may be embodied in the computer system 200.

The components 205 to 213 of the computer module 201 typically communicate via an interconnected bus 204 and in a manner that results in a conventional mode of operation of the computer system 200 known to those in the relevant art. For example, the processor 205 is coupled to the system bus 204 using a connection 218. Likewise, the memory 206 and optical disk drive 212 are coupled to the system bus 204 by connections 219. Examples of computers on which the described arrangements can be practised include IBM-PC’s and compatibles, Sun SPARCstations, Apple Mac™ or alike computer systems. P151810_speci_as filed (11341263v1)

Where appropriate or desired, the video encoder 114 and the video decoder 134, as well as methods described below, may be implemented using the computer system 200 wherein the video encoder 114, the video decoder 134 and methods to be described, may be implemented as one or more software application programs 233 executable within the computer system 200. In particular, the video encoder 114, the video decoder 134 and the steps of the described methods are effected by instructions 231 (see Fig. 2B) in the software 233 that are carried out within the computer system 200. The software instructions 231 may be formed as one or more code modules, each for performing one or more particular tasks. The software may also be divided into two separate parts, in which a first part and the corresponding code modules performs the described methods and a second part and the corresponding code modules manage a user interface between the first part and the user.

The software may be stored in a computer readable medium, including the storage devices described below, for example. The software is loaded into the computer system 200 from the computer readable medium, and then executed by the computer system 200. A computer readable medium having such software or computer program recorded on the computer readable medium is a computer program product. The use of the computer program product in the computer system 200 preferably effects an advantageous apparatus for implementing the video encoder 114, the video decoder 134 and the described methods.

The software 233 is typically stored in the HDD 210 or the memory 206. The software is loaded into the computer system 200 from a computer readable medium, and executed by the computer system 200. Thus, for example, the software 233 may be stored on an optically readable disk storage medium (e.g., CD-ROM) 225 that is read by the optical disk drive 212.

In some instances, the application programs 233 may be supplied to the user encoded on one or more CD-ROMs 225 and read via the corresponding drive 212, or alternatively may be read by the user from the networks 220 or 222. Still further, the software can also be loaded into the computer system 200 from other computer readable media. Computer readable storage media refers to any non-transitory tangible storage medium that provides recorded instructions and/or data to the computer system 200 for execution and/or processing. Examples of such storage media include floppy disks, magnetic tape, CD-ROM, DVD, Blu-ray Disc™, a hard disk drive, a ROM or integrated P151810_speci_as filed (11341263v1) circuit, USB memory, a magneto-optical disk, or a computer readable card such as a PCMCIA card and the like, whether or not such devices are internal or external of the computer module 201. Examples of transitory or non-tangible computer readable transmission media that may also participate in the provision of the software, application programs, instructions and/or video data or encoded video data to the computer module 401 include radio or infra-red transmission channels as well as a network connection to another computer or networked device, and the Internet or Intranets including e-mail transmissions and information recorded on Websites and the like.

The second part of the application programs 233 and the corresponding code modules mentioned above may be executed to implement one or more graphical user interfaces (GUIs) to be rendered or otherwise represented upon the display 214. Through manipulation of typically the keyboard 202 and the mouse 203, a user of the computer system 200 and the application may manipulate the interface in a functionally adaptable manner to provide controlling commands and/or input to the applications associated with the GUI(s). Other forms of functionally adaptable user interfaces may also be implemented, such as an audio interface utilizing speech prompts output via the loudspeakers 217 and user voice commands input via the microphone 280.

Fig. 2B is a detailed schematic block diagram of the processor 205 and a “memory” 234. The memory 234 represents a logical aggregation of all the memory modules (including the HDD 209 and semiconductor memory 206) that can be accessed by the computer module 201 in Fig. 2A.

When the computer module 201 is initially powered up, a power-on self-test (POST) program 250 executes. The POST program 250 is typically stored in a ROM 249 of the semiconductor memory 206 of Fig. 2A. A hardware device such as the ROM 249 storing software is sometimes referred to as firmware. The POST program 250 examines hardware within the computer module 201 to ensure proper functioning and typically checks the processor 205, the memory 234 (209, 206), and a basic input-output systems software (BIOS) module 251, also typically stored in the ROM 249, for correct operation. Once the POST program 250 has run successfully, the BIOS 251 activates the hard disk drive 210 of Fig. 2A. Activation of the hard disk drive 210 causes a bootstrap loader program 252 that is resident on the hard disk drive 210 to execute via the processor 205. This loads an operating system 253 into the RAM memory 206, upon which the operating P151810_speci_as filed (11341263v1) system 253 commences operation. The operating system 253 is a system level application, executable by the processor 205, to fulfil various high level functions, including processor management, memory management, device management, storage management, software application interface, and generic user interface.

The operating system 253 manages the memory 234 (209, 206) to ensure that each process or application running on the computer module 201 has sufficient memory in which to execute without colliding with memory allocated to another process.

Furthermore, the different types of memory available in the computer system 200 of Fig. 2A need to be used properly so that each process can run effectively. Accordingly, the aggregated memory 234 is not intended to illustrate how particular segments of memory are allocated (unless otherwise stated), but rather to provide a general view of the memory accessible by the computer system 200 and how such is used.

As shown in Fig. 2B, the processor 205 includes a number of functional modules including a control unit 239, an arithmetic logic unit (ALU) 240, and a local or internal memory 248, sometimes called a cache memory. The cache memory 248 typically includes a number of storage registers 244-246 in a register section. One or more internal busses 241 functionally interconnect these functional modules. The processor 205 typically also has one or more interfaces 242 for communicating with external devices via the system bus 204, using a connection 218. The memory 234 is coupled to the bus 204 using a connection 219.

The application program 233 includes a sequence of instructions 231 that may include conditional branch and loop instructions. The program 233 may also include data 232 which is used in execution of the program 233. The instructions 231 and the data 232 are stored in memory locations 228, 229, 230 and 235, 236, 237, respectively. Depending upon the relative size of the instructions 231 and the memory locations 228230, a particular instruction may be stored in a single memory location as depicted by the instruction shown in the memory location 230. Alternately, an instruction may be segmented into a number of parts each of which is stored in a separate memory location, as depicted by the instruction segments shown in the memory locations 228 and 229.

In general, the processor 205 is given a set of instructions which are executed therein. The processor 205 waits for a subsequent input, to which the processor 205 reacts P151810_speci_as filed (11341263v1) to by executing another set of instructions. Each input may be provided from one or more of a number of sources, including data generated by one or more of the input devices 202, 203, data received from an external source across one of the networks 220, 202, data retrieved from one of the storage devices 206, 209 or data retrieved from a storage medium 225 inserted into the corresponding reader 212, all depicted in Fig. 2A. The execution of a set of the instructions may in some cases result in output of data. Execution may also involve storing data or variables to the memory 234.

The video encoder 114, the video decoder 134 and the described methods may use input variables 254, which are stored in the memory 234 in corresponding memory locations 255, 256, 257. The video encoder 114, the video decoder 134 and the described methods produce output variables 261, which are stored in the memory 234 in corresponding memory locations 262, 263, 264. Intermediate variables 258 may be stored in memory locations 259, 260, 266 and 267.

Referring to the processor 205 of Fig. 2B, the registers 244, 245, 246, the arithmetic logic unit (ALU) 240, and the control unit 239 work together to perform sequences of micro-operations needed to perform “fetch, decode, and execute” cycles for every instruction in the instruction set making up the program 233. Each fetch, decode, and execute cycle comprises: (a) a fetch operation, which fetches or reads an instruction 231 from a memory location 228, 229, 230; (b) a decode operation in which the control unit 239 determines which instruction has been fetched; and (c) an execute operation in which the control unit 239 and/or the ALU 240 execute the instruction.

Thereafter, a further fetch, decode, and execute cycle for the next instruction may be executed. Similarly, a store cycle may be performed by which the control unit 239 stores or writes a value to a memory location 232.

Each step or sub-process in the method of Fig. 11, to be described, is associated with one or more segments of the program 233 and is typically performed by the register section 244, 245, 247, the ALU 240, and the control unit 239 in the processor 205 working P151810_speci_as filed (11341263v1) together to perform the fetch, decode, and execute cycles for every instruction in the instruction set for the noted segments of the program 233.

Fig. 3 is a schematic block diagram showing functional modules of the video encoder 114. Fig. 4 is a schematic block diagram showing functional modules of the video decoder 134. Generally, data is passed between functional modules within the video encoder 114 and the video decoder 134 in blocks or arrays (e.g., blocks of samples or blocks of transform coefficients). Where a functional module is described with reference to the behaviour of individual array elements (e.g., samples or transform coefficients), the behaviour shall be understood to be applied to all array elements. The video encoder 114 and video decoder 134 may be implemented using a general-purpose computer system 200, as shown in Figs. 2A and 2B, where the various functional modules may be implemented by dedicated hardware within the computer system 200, by software executable within the computer system 200 such as one or more software code modules of the software application program 233 resident on the hard disk drive 205 and being controlled in its execution by the processor 205, or alternatively by a combination of dedicated hardware and software executable within the computer system 200. The video encoder 114, the video decoder 134 and the described methods may alternatively be implemented in dedicated hardware, such as one or more integrated circuits performing the functions or sub functions of the described methods. Such dedicated hardware may include graphic processors, digital signal processors, application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs) or one or more microprocessors and associated memories. In particular the video encoder 114 comprises modules 320-348 and the video decoder 134 comprises modules 420-432 which may each be implemented as one or more software code modules of the software application program 233, or an FPGA ‘bitstream file’ that configures internal logic blocks in the FPGA to realise the video encoder 114 and the video decoder 134.

Although the video encoder 114 of Fig. 3 is an example of a low latency video encoding pipeline, other video codecs may also be used to perform the processing stages described herein. The video encoder 114 receives captured frame data, such as a series of frames, each frame including one or more colour channels.

The video encoder 114 divides each frame of the captured frame data, such as frame data 310, into regions generally referred to as ‘coding tree units’ (CTUs) with side P151810_speci_as filed (11341263v1) sizes which are powers of two. The coding tree units (CTUs) in a frame are scanned in raster scan order and the sequentially scanned coding tree units (CTUs) are grouped into ‘slices’. A division of each frame into multiple slices provides ‘random access’ (commencing of decoding at a point other than the start of a bitstream) at each slice boundary within a frame. The term “coding tree unit” (CTU) refers collectively to all colour channels of the frame. Every coding tree unit (CTU) includes one coding tree block (CTB) for each colour channel. For example, in a frame coded using the YCbCr colour space, a coding tree unit (CTU) consists of tree coding tree blocks (CTBs) for Y, Cb and Cr colour planes corresponding to the same spatial location in the picture. The size of individual coding tree blocks (CTBs) may vary across colour components and generally depends on the selected ‘chroma format’. For example, for the 4:4:4 chroma format, the sizes of the coding tree blocks (CTBs) will be the same. For the mode 4:2:0 chroma format, the dimensions of chroma coding tree blocks (CTBs) in samples are halved (both horizontally and vertically) relative to the size of the luma coding tree block (CTB). The size of a coding tree unit (CTU) is specified as the size of the corresponding luma coding tree block (CTB). The sizes of the chroma coding tree blocks (CTBs) are inferred from the size of the coding tree unit (CTU) and the chroma format.

Each coding tree unit (CTU) includes a hierarchical quad-tree subdivision of a portion of the frame with a collection of ‘coding units’ (CUs), such that at each leaf node of the hierarchical quad-tree subdivision one coding unit (CU) exists. The subdivision can be continued until the coding units (CU) present at the leaf nodes have reached a specific predetermined minimum size. The specific minimum size is referred to as a smallest coding unit (SCU) size. Generally, the smallest coding unit (SCU) size is 8x8 luma samples, but other sizes are also possible, such as 16x16 or 32x32 luma samples. For low latency video coding, smaller CTUs are desirable, as the resulting smaller blocks require less buffering prior to encoding and less buffering after decoding for conversion to/from line-based raster scan input/output of samples. The corresponding coding block (CB) for the luma channel has the same dimensions as the coding unit (CU). The corresponding coding blocks (CBs) for the chroma channels have dimensions scaled according to the chroma format. If no subdivision of a coding tree unit (CTU) is done and a single coding unit (CU) occupies the whole coding tree unit (CTU), such a coding unit (CU) is referred to as a largest coding unit (LCU) (or maximum coding unit size). These dimensions are also specified in units of luma samples. As a result of the quad-tree hierarchy, the entirety P151810_speci_as filed (11341263v1) of the coding tree unit (CTU) is occupied by one or more coding units (CUs). The largest coding unit size is signalled in the bitstream for a collection of frames known as a coded video sequence. For a given frame, the largest coding unit (LCU) size and the smallest coding unit (SCU) size do not vary.

The video encoder 114 produces one or more ‘prediction units’ (PUs) for each coding block (CU). A PU includes all colour channels and is divided into one prediction block (PB) per colour channel. Various arrangements of prediction units (PUs) in each coding unit (CU) are possible and each arrangement of prediction units (PUs) in a coding unit (CU) is referred to as a ‘partition mode’. It is a requirement that the prediction units (PUs) do not overlap and that the entirety of the coding unit (CU) is occupied by the one or more prediction units (PUs). Such a requirement ensures that the prediction units (PUs) cover the entire frame area. A partitioning of a coding unit (CU) into prediction units (PUs) implies subdivision of coding blocks (CBs) for each colour component into ‘prediction blocks’ (PBs). Depending on used chroma format, the sizes of prediction blocks (PBs) corresponding to the same coding unit (CU) for different colour component may differ in size. For coding units (CUs) configured to use intra-prediction, two partition modes are possible, known as ‘PART_2Nx2N’ and ‘PART_NxN’. The PART_2Nx2N partition mode results in one prediction unit (PU) being associated with the coding unit (CU) and occupying the entirety of the coding unit (CU). The PART NxN partition mode results in four prediction units (PUs) being associated with the coding unit (CU) and collectively occupying the entirety of the coding unit (CU) by each occupying one quadrant of the coding unit (CU).

The video encoder 114 operates by outputting a prediction unit (PU) 378. When intra-prediction is used, a transform block (TB)-based reconstruction process is applied for each colour channel. The TB-based reconstruction process results in the prediction unit (PU) 378 being derived on a TB basis. As such, a residual quad-tree decomposition of the coding unit (CU) associated with the prediction unit (PU) indicates the arrangement of TUs, and hence TBs, to be reconstructed to reconstruct the PU 378. A difference module 344 produces a ‘residual sample array’ 360. The residual sample array 360 is the difference between the PU 378 and a corresponding 2D array of data samples from a coding unit (CU) of the coding tree block (CTB) of the frame data 310. The difference is calculated for corresponding samples at each location in the array. The transform module 320 may apply a forward DCT to transform the residual sample array 360 into the P151810_speci_as filed (11341263v1) frequency domain, producing ‘transform coefficients’. An 8x8 CU is always divided into an 8x8 TU, however multiple configurations of the 8x8 TU are possible, as described further with reference to Figs. 5A and 5B.

Within the TU, individual TBs are present and TB boundaries do not cross PB boundaries. As such, when the coding unit (CU) is configured to use a PARTNxN partition mode, the associated residual quad-tree (RQT) is inferred to have a subdivision at the top level of the hierarchy of subdivisions, resulting in four 4x4 TBs being associated with the luma channel of the CU. A rate control module 348 ensures that the bit rate of the encoded data meets a predetermined constraint. The predetermined constraint may be referred to as a rate control target. As the quantity of bits required to represent each CU varies, the rate control target can only be met by averaging across multiple CUs.

Moreover, each run of CUs (or CTUs) forms a ‘slice’ and the size of each slice is fixed. The fixed size of each slice facilitates architectures using parallelism, as it becomes possible to determine the start location of each slice without having to search for markers in the bitstream. The encoder may also encode multiple slices in parallel, storing the slices progressively as the slices are produced. The predetermined constraint may be determined by the capacity of the communications channel 120, or some other requirement. For example, the predefined constraint is for operation at a ‘constant bit rate’ (CBR). As such, the encoder rate control target may be determined according to a constant bit rate channel capacity for a target communication channel (e.g., the channel 120) to carry video data containing a video frame.

The constraint operates at a sub-frame level, and, due to channel rate limitations and intermediate buffer size limitations, also imposes timing constraints on the delivery of blocks of compressed video data by the video encoder 114. In particular, to ensure the fixed size requirement of each slice is met, the cumulative cost of the CTUs within each slice must not exceed the fixed size requirement. The cost may be less than the fixed size requirement. For example, the timing constraints are discussed further with reference to Figs. 6A and 6B. The rate control module may also influence the selection of prediction modes within the video encoder 114, as discussed with reference to the method 800 of Fig. 8. For example, particular prediction modes have lower bit cost to code a block compared to other prediction modes and are thus considered low cost, albeit offering poor performance in terms of quality. Then, if the remaining available bits to code a given slice segment falls below a threshold (the threshold being updated for each coded block in the P151810_speci_as filed (11341263v1) slice segment), the rate control module 348 enters a ‘fallback’ state where the remaining blocks in the slice segments are coded using this low cost prediction mode. As such, CBR operation is guaranteed, regardless of the complexity of the incoming uncompressed video data. A quantisation parameter (QP) 384 is output from the rate control module 348. The QP 384 varies on a block by block basis as the frame is being encoded. In particular, the QP 384 is signalled using a ‘delta QP’ syntax element, signalled at most once per transform unit (TU). Delta QP is only signalled when at least one significant residual coefficient is present for the TU. Other methods for controlling the QP 384 are also possible. The QP defines a divisor applied by a quantiser module 322 to the transform coefficients 362 to produce residual coefficients 364. The remainder of the division operation in the quantiser module 322 is discarded. Lower QPs result in larger magnitude residual coefficients but with a smaller range of remainders to discard. As such, lower QPs give a higher quality at the video decoder 134 output, at the expense of a lower compression ratio. Note that the compression ratio is influenced by a combination of the QP 384 and the magnitude of the transform coefficients 362. The magnitude of the transform coefficients 362 relates to the complexity of the incoming uncompressed video data and the ability of the selected prediction mode to predict the contents of the uncompressed video data. Thus, overall compression efficiency is only indirectly influenced by the QP 384 and varies along each slice segment as the complexity of the data at each block varies. The residual coefficients 364 are an array of values having the same dimensions as the residual sample array 360. The residual coefficients 364 provide a frequency domain representation of the residual sample array 360 when a transform is applied. The residual coefficients 364 and determined quantisation parameter 384 are taken as input to a dequantiser module 326.

The dequantiser module 326 reverses the scaling performed by the quantiser module 322 to produce rescaled transform coefficients 366. The rescaled transform coefficients are rescaled versions of the residual coefficients 364. The residual coefficients 364 and the determined quantisation parameter 384 are also taken as input to an entropy encoder module 324. The entropy encoder module 324 encodes the values of the transform coefficients 364 in the encoded bitstream 312 (or ‘video bitstream’). Due to the loss of precision resulting from the operation of the quantiser module 322, the rescaled transform coefficients 366 are not identical to the original values present in the transform P151810_speci_as filed (11341263v1) coefficients 362. The rescaled transform coefficients 366 from the dequantiser module 326 are then output to an inverse transform module 328. The inverse transform module 328 performs an inverse transform from the frequency domain to the spatial domain to produce a spatial-domain representation 368 of the rescaled transform coefficients 366. The spatial-domain representation 368 is substantially identical to a spatial domain representation that is produced at the video decoder 134. The spatial-domain representation 368 is then input to a summation module 342.

The intra-frame prediction module 336 produces an intra-predicted prediction unit (PU) 378 using reconstructed samples 370 obtained from the summation module 342. In particular, the intra-frame prediction module 336 uses samples from neighbouring blocks (i.e. above, left or above-left of the current block) that have already been reconstructed to produce intra-predicted samples for the current prediction unit (PU). When a neighbouring block is not available (e.g. at the frame or independent slice segment boundary) the neighbouring samples are considered as ‘not available’ for reference. In such cases, a default value is used instead of the neighbouring sample values. Typically, the default value (or ‘half-tone’) is equal to half of the range implied by the bit-depth. For example, when the video encoder 114 is configured for a bit-depth of eight (8), the default value is 128. The summation module 342 sums the prediction unit (PU) 378 from the intra-frame prediction module 336 and the spatial domain output of the inverse transform module 328.

Prediction units (PUs) may be generated using an intra-prediction method. Intraprediction methods make use of data samples adjacent to the prediction unit (PU) that have previously been reconstructed (typically above and to the left of the prediction unit) in order to generate reference data samples within the prediction unit (PU). Thirty-three angular intra-prediction modes are available. Additionally, a ‘DC mode’ and a ‘planar mode’ are also available for intra-prediction, to give a total of thirty-five (35) available intra-prediction modes. An intra-prediction mode 388 indicates which one of the thirty-five available intra-prediction modes is selected for the current prediction unit (PU) when the prediction unit (PU) is configured to use intra-prediction (i.e. as indicated by the prediction mode 386). The summation module 342 produces the reconstructed samples 370 that are stored in a reconstructed picture buffer 332. Standards such as HEVC specify filtering stages, such as sample adaptive offset (SAO) or deblocking. Such filtering is generally beneficial, e.g. for removing blocking artefacts, at the higher compression ratios (e.g. 50:1 to 100:1) typically seen in applications such as distribution of compressed video P151810_speci_as filed (11341263v1) data across the internet to households, or broadcast. The video encoder 114 does not perform filtering operations such as adaptive loop filter, SAO or deblocking filtering. The video encoder 114 is intended for operation at lower compression ratios, e.g. 4:1 to 6:1 or even 8:1. At such compression ratios, these additional filtering stages have little impact on the frame data, and thus the complexity of the additional filtering operations is not justified by the resulting small improvement in quality. The reconstructed picture buffer 332 is configured within the memory 206 and provides storage for at least a portion of the frame, acting as an intermediate buffer prior for storage of samples to be used for reference for subsequent intra predicted blocks.

The entropy encoder 324 encodes the transform coefficients 364, the QP 388 and other parameters, collectively referred to as ‘syntax elements’, into the encoded bitstream 312 as sequences of symbols. At targeted compression ratios of 4:1 to 8:1, the data rates for video data at UHD resolutions are very high. At such data rates, techniques such as arithmetic coding, in particular the context adaptive binary arithmetic coding (CABAC) algorithm of HEVC, are not feasible. One issue is that the use of adaptive contexts requires large memory bandwidth to the context memory for updating the probability associated with each context-coded bin in a syntax element. Another issue is the inherently serial nature of coding and decoding each bin into the bitstream. Even bins coded as so-called ‘equi-probable’ or ‘bypass-coded’ bins have a serial process that limits parallelism to only a few bins per clock cycle. At compression ratios such as 4:1 to 8:1, the bin rate is extremely high, for example at UHD 4:4:4 10-bit 60 frame per second video data, the data rate is 14.93Gb/s uncompressed, so the compressed data rates between 1.866 to 3.732Gb/s can be expected. Hence, in the video processing system 100, the use of adaptive probabilities for coding of bins is disabled. Consequently, all bins are coded in the “equi-probable state”, i.e. bin probabilities equally assigned between ‘0’ bins and ‘1’ bins. As a consequence, there is alignment between bins and bits in the encoded bitstream 312, which results in the ability to directly code bins into the bitstream and read bins from the bitstream as bits. Then, the encoded bitstream effectively contains only only variable length and fixed length codewords, each codeword including an integer number of (equi-probable) bits. The absence of misalignment between (bypass coded) bins and bitss greatly simplifies the design of the entropy encoder 324, as the sequence of bins defining a given syntax element value can be directly stored into the encoded bitstream 312. Moreover, the absence of context coded bins also removes dependencies necessary for P151810_speci_as filed (11341263v1) selecting contexts for bins. Such dependencies, when present, require buffers to store the values of previously coded bins, with those values used to select one context out of a set of contexts for a current bin. Then, encoding and decoding multiple bins per clock cycle is greatly simplified compared to when adaptive context coding is used, resulting in the potential to achieve the compressed data rates mentioned previously. In such architectures, the system clock can be expected to be in the order of several hundred MHz, with bus sufficiently wide to achieve the required data rate. All these attributes of the entropy encoder 324 are also present in an entropy decoder 420 of the video decoder 134.

The video decoder 134 of Fig. 4 is described with reference to a low latency video decoding pipeline, however other video codecs may also employ the processing stages of modules 420-432. The encoded video information may also be read from memory 206, the hard disk drive 210, a CD-ROM, a Blu-ray disk™ or other computer readable storage medium. Alternatively the encoded video information may be received from an external source, such as a server connected to the communications network 220 or a radiofrequency receiver.

As seen in Fig. 4, received video data, such as the encoded bitstream 312, is input to the video decoder 134. The encoded bitstream 312 may be read from memory 206, the hard disk drive 210, a CD-ROM, a Blu-ray disk™ or other non-transitory computer readable storage medium. Alternatively the encoded bitstream 312 may be received from an external source such as a server connected to the communications network 220 or a radio-frequency receiver. The encoded bitstream 312 contains encoded syntax elements representing the captured frame data to be decoded.

The encoded bitstream 312 is input to an entropy decoder module 420 which extracts the syntax elements from the encoded bitstream 312 and passes the values of the syntax elements to other blocks in the video decoder 134. The entropy decoder module 420 applies variable length coding to decode syntax elements from codes present in the encoded bitstream 312. The decoded syntax elements are used to reconstruct parameters within the video decoder 134. Parameters include zero or more residual data array 450, a prediction mode 454 and an intra-prediction mode 457, and a QP 452. The residual data array 450 and the QP 452 are passed to a dequantiser module 421, and the intra-prediction mode 457 is passed to an intra-frame prediction module 426. P151810_speci_as filed (11341263v1)

The dequantiser module 421 performs inverse scaling on the residual data of the residual data array 450 to create reconstructed data 455 in the form of transform coefficients. The dequantiser module 421 outputs the reconstructed data 455 to an inverse transform module 422. The inverse transform module 422 applies an ‘inverse transform’ to convert the reconstructed data 455 (i.e., the transform coefficients) from a frequency domain representation to a spatial domain representation, outputting a residual sample array 456 via a multiplexer module 423. The inverse transform module 422 performs the same operation as the inverse transform module 328. The inverse transform module 422 is configured to perform an inverse transform. The transforms performed by the inverse transform module 422 are selected from a predetermined set of transform sizes required to decode an encoded bitstream 312.

When the prediction mode 454 indicates that the current prediction unit (PU) was coded using intra-prediction, the intra-frame prediction module 426 produces an intrapredicted prediction unit (PU) 464 for the prediction unit (PU) according to the intraprediction mode 457. The intra-predicted prediction unit (PU) 464 is produced using data samples spatially neighbouring the prediction unit (PU) and a prediction direction also supplied by the intra-prediction mode 457. The spatially neighbouring data samples are obtained from reconstructed samples 458, output from a summation module 424. The prediction unit (PU) 466, which is output from the multiplexer module 428, is added to the residual sample array 456 from the inverse scale and transform module 422 by the summation module 424 to produce reconstructed samples 458. The reconstructed samples 458 are stored in the frame buffer module 432 configured within the memory 206. The frame buffer module 432 provides sufficient storage to hold part of one frame, as required for just in time output of decoded video data by the video decoder 134. The decoded video data may be sent to devices such as a display device (e.g. 136, 214) or other equipment within a broadcast environment, such as a ‘distribution encoder’, graphics overlay insertion, or other video processing apparatus.

Fig. 5 A is a schematic block diagram showing square coding tree unit (CTU) configurations for the sub-frame latency video encoding and decoding system 100. A frame of video data is decomposed into a series of CTUs in the video encoder 114. To meet an end-to-end latency requirement of 32 lines of samples, the CTU height is limited to eight rows. This latency is divided equally between the video encoder 114 and P151810_speci_as filed (11341263v1) the video decoder 134, i.e. 16 lines in each portion of the video encoding and decoding system 100. Dividing the latency equally permits input of uncompressed data in raster scan order to fill half of the 16 line input buffer whilst the other half is being encoded. A similar division occurs in the video decoder 134. Using square CTUs, the resulting size is 8x8 (luma) samples, smaller than the minimum size of 16x16 specified in HEVC. Also an additional source of latency results from the buffering of partially-coded slices in the video encoder 114 prior to transmission and the buffering of partially-received slices in the video decoder 134 prior to decoding. Further latency present in the communications channel 120 is not considered.

One consequence of an 8x8 CTU size is that no quadtree subdivision into multiple coding units (CUs) is performed. Instead, each CTU is always associated with one 8x8 CU. For an 8x8 CU, a residual quadtree is defined to always include one 8x8 transform unit (TU). As described in detail below, possible configurations of the 8x8 TU are shown in Fig. 5A, as TU configurations. The possible configurations are a result of the ‘partition mode’ of the CU and the chroma format of the video data.

For the primary colour channel (primary), the chroma format is not relevant and, as seen in Fig. 5 A, an 8x8 transform block (TB) 501 is present when a PART_2Nx2N partition mode is used. As also seen in Fig. 5A, four 4x4 TBs (referenced at 502 in Fig. 5A) are present when a PART_NxN partition mode is used.

As seen in Fig. 5A, for the secondary colour channels (secondary), the possible arrangements of TBs also depend on the chroma format. When the video data is in the 4:2:0 chroma format, two 4x4 TBs (referenced at 503 in Fig. 5A) are present (one for each secondary colour channel), regardless of the partition mode of the CU.

When the video data is in the 4:2:2 chroma format, two pairs of 4x4 TBs (referenced at 504 in Fig. 5A) are present (one pair of each secondary colour channel), regardless of the partition mode of the CU.

When the video data is in the 4:4:4 chroma format, the partition mode of the CU influences the arrangement of TBs, such that the same arrangements as for the primary colour channel is used. In particular, as seen in Fig. 5A, one 8x8 TB is used per secondary colour channel (referenced at 505 in Fig. 5 A) when the partition mode of the CU is PART_2Nx2N and four 4x4 TBs (referenced at 506 in Fig. 5A) per secondary colour P151810_speci_as filed (11341263v1) channel are used when the partition mode of the CU is PARTNxN. For cases where multiple TBs are present for a given colour channel, the scan order of the TBs is shown in Fig. 5A using a thick arrows line. The scan order used is defined as a ‘Z-scan’ order, i.e. iterating over the blocks firstly top left-to-right and then lower left to right. The colour channels are processed with primary colour channel first, followed by the secondary colour channels, i.e. Y, Cb, then Cr, or G, B then R.

Fig. 5B is a schematic block diagram showing non-square coding tree unit (CTU) configurations for the sub-frame latency video encoding and decoding system 100 of Fig. 1. In non-square coding tree unit (CTU) configurations, the height of the CTU is retained at 8 lines, in order to meet the end-to-end latency requirement of 32 lines, as discussed earlier. However, the width of the CTU is doubled to 16 lines, resulting in non-square CTUs. Then, the CTU may contain either two 8x8 CUs (referenced at 512 in Fig. 5B), each having the various structures of TBs as described with reference to Fig. 5 A. Additionally, the 16x8 CTU may instead contain one non-square 16x8 CU 514. In such a case, the 16x8 CU 514 is divided into TBs as shown in Fig. 5B. The divisions of Fig. 5B for the primary colour channel (primary) and the secondary colour channels (secondary) are analogous to the divisions shown in Fig. 5A, noting that the width of each TB is doubled with respect to the cases shown in Fig. 5 A, so that the possible TB sizes are 16x8 samples and 8x4 samples. The use of larger transforms enables more compact representation of the residual signal for a given area in the frame, resulting in improved compression efficiency. The improved compression efficiency is balanced against the possibility of highly detailed areas, for which larger transforms offer no benefit, and thus the original transform sizes are still available via the selection of 8x8 CUs. The selection of one 16x8 CU or two 8x8 CUs for a given 16x8 CTU is controlled using a ‘cu split’ flag, coded in the bitstream. As the split results in two CUs, rather than four CUs, the split differs from the ‘quad-tree’ subdivision prevalent in HEVC.

Fig. 5C is a schematic block diagram showing a block configuration for the video encoding and decoding system 100 of Fig. 1. In particular, the configuration of Fig. 5C has a height of 4 luma samples in the buffering stages. Retaining a CTU width of 8 samples results in supporting two block configurations within the CTU: 4x4 and 8x4. Note that the 4:2:0 chroma format would result in a requirement to pair 4x4 chroma blocks with 8x8 (or 2x2 arrangements of 4x4 blocks) in the luma channel. As this would violate the height restriction of 4 luma samples, the 4:2:0 chroma format is not supported in the block P151810_speci_as filed (11341263v1) configuration of Fig. 5C. Instead, only the 4:2:2 and 4:4:4 chroma formats are supported. The 4:4:4 chroma format is supported using one pair of 4x4 blocks for each of the three colour channels in the case of 4:4:4 (i.e. blocks542 for luma and blocks 546 for chroma). Alternatively, one 8x4 block may be used for each of the three colour channels in the case of 4:4:4 (i.e. block 548). For the 4:2:2 chroma format, either a pair of 4x4 blocks (i.e. blocks 542) or one 8x4 block (i.e. block 543) is present for the primary colour channel, and one 4x4 block is present per secondary colour channel (i.e. block 544). Each 4x4 block corresponds to one 4x4 TB and one 4x4 PB, hence there is no concept of multiple partition modes. As the configuration of Fig. 5C has a maximum buffering height of 4 luma samples, the end-to-end latency introduces by sample PO and block processing is sixteen (16) lines; eight lines in the video encoder 114 (four for receiving video data in raster order and four for processing a row of CTUs) and eight lines in the video decoder 134 (four for decoding a row of CTUs and another four for outputting decoded video data in raster scan order). Additional latency may be introduced by the buffering of partially coded slices in the video encoder 114 prior to transmission and the buffering of partially received slices in the video decoder 134 prior to decoding. Such buffering is synchronised to the timing of encoding and decoding each row of 4x4 blocks. Accordingly, buffering of coded slices introduces an additional four lines latency in each of the video encoder 114 and the video decoder 134, resulting in an overall latency of 16 + 8 = 24 lines latency for the video processing system 100 when using the block configuration of Fig. 5C. An alternative for the configuration of Fig. 5C involves using an 8x4 CTU size. The block configurations resulting in the use of an 8x4 TB are referred to as a ‘PART_2NxN’ partition mode. In such cases, the intra-prediction process is performed on an 8x4 PB.

Figs. 6A, 6B and 6C are schematic block diagrams showing scan patterns for 4x4 sub-blocks, with a division into 2x2 quadrants. In the example of Figs. 6A, 6B and 6C, a TB is decomposed into one or more 4x4 sub-blocks. The decomposition of the TB provides a regular block size for processing residual data for TBs of any size. A 4x4 subblock 500 may be scanned in one of three possible scan patterns, horizontal, vertical and diagonal. The particular scan pattern in use for a given 4x4 sub-block depends on the intra prediction mode and is selected to approximate the anticipated distribution of residual coefficient magnitudes in the sub-block. The 4x4 sub-block 500 is scanned using a fixed relationship between the angular intra prediction direction and the scan pattern that accords with the HEVC specification. However, the structure of the scan patterns is altered with P151810_speci_as filed (11341263v1) respect to the HEVC specification to enable the provision of fine-granularity significance signalling within each 4x4 sub-block by way of a ‘quadrant’ signalling syntax element ‘sigcoefquadrant’. In particular, for a given 4x4 sub-block, a set of four sigcoefquadrant flags is used to signal the potential for the presence of at least one significant residual coefficient in each quadrant of the 4x4 sub-block (i.e. each 2x2 group of residual coefficients). For a given 2x2 quadrant, even if the sig coef quadrant flag is set it is possible for all the residual coefficients contained therein to have values of zero. This potential redundancy in the syntax of the residual coding is necessary to facilitate the dual purpose of localised rate control. Moreover, there are no significance map flag coded for each coefficient, or dedicated flags for particular magnitudes (e.g, greater than one or greater than two). As such, the residual is coded simply using blocks of ‘coef abs level’ codewords using a Rice-based coding scheme as described with reference to Fig. 7.

As seen in Figs. 6A, 6B and 6C, the scan patterns are modified to accommodate the quadrant-based scheme of coding each 4x4 sub-block. In particular, a horizontally-scanned sub-block 602, a vertically-scanned sub-block 603 and a diagonally-scanned subblock 604 are scanned as shown in Fig. 6. A set of sig coef quadrant flags 601, applied to the horizontally-scanned sub-block 502 each indicate the necessity to scan the corresponding quadrant in the horizontally-scanned sub-block 602. For cases where particular set coef quadrant flags in the set of sig coef quadrant flags 601 indicated no need to scan particular quadrants of the horizontally-scanned sub-block 602, the corresponding quadrants would not be scanned over, and the resulting scan pattern would be effectively modified accordingly. Similar provisions exist for the vertically-scanned sub-block 603 and the diagonally-scanned sub-block 604.

Fig. 7 is a schematic block diagram showing a syntax structure for coding a coefficient magnitude (i.e. a coef abs level syntax element) 700. The coef abs level syntax element includes a Truncated Rise (TR) prefix 702 and, optionally, either a Truncated Rice (TR) suffix 704 or an kth-order exponential-Golomb (EGk) prefix 706 and an kth-order exponential-Golomb (EGk) suffix 708. The TR prefix 702 is a unary codeword with a maximum length of four bits. If the codeword length is less than four, then the TR suffix 704 is also present. The TR suffix 704, if present, is a fixed-length codeword with a length equal to Rice parameter value in use for coding the coef abs level syntax element 700. The TR suffix 704 values are depicted in Fig. 7 as ‘X’ or ‘XX’ for Rice parameter values of k=l and k=2 respectively. The TR suffix 704 values should be P151810_speci_as filed (11341263v1) considered to expand to the space of all possible values (i.e. ‘0’ and ‘ 1’ for ‘X’ and ‘00’, ‘0Γ, ‘10’ and ‘11’ for ‘XX’), for coding discrete residual coefficient values. If the TR prefix 702 has the value of ‘ 1111 ’ then a kth-order exponential-Golomb (EGk) codeword is present.

Example binarisations for the EGk prefix 706 and the EGk suffix 708 are shown in Fig. 7. As can be seen in Fig. 7, coding the coef_abs_level syntax element 700 as shown in Fig. 7 results in a smooth transition in terms of codeword length between the truncated Rice portion and the EGk portion.

Fig. 8 is a schematic block diagram showing a bitstream portion with residual data and a coded data buffer. The coded data buffer 330 is a FIFO that receives data from the entropy encoder 324 in the form of an encoded bitstream 311 and delivers data to the communications channel 120 in the form of the encoded bitstream 312. Note that the encoded bitstream 311 and the encoded bitstream 312 are generally not actually a serial bitstream, as feasible clock frequencies for ASIC and FPGA implementations are far below the bit rate of the communications channel 120. Instead, a bus is used with a fixed quantity of bits being transferred per clock cycle. For applications with a CBR channel, the supply of data into the communications channel 120 is fixed. For example, a fixed number of bits are supplied on each clock cycle in an FPGA or ASIC implementation to a serialiser (the transmit portion of a serialiser/deserialiser ‘SerDes’) to supply a serialised bitstream to the communications channel 120. In particular, note that the coded data buffer 330 lacks the capacity to hold the entire coded slice (e.g. an entire row of CTUs). Buffering an entire slice prior to transmission introduces excessive latency into the video encoder 114 (and corresponding latency into the video decoder 134). Instead, the coded data buffer 330 has a much smaller size, sufficient to hold a small number of coded CTUs (e.g. six to ten). As the cost of coding individual CTUs varies considerably, due to the varying complexity of the underlying textures and features being coded, the size of each coded CTU cannot be known in advance. The variations in size cause the utilisation of the coded data buffer 330 to vary during coding. To enable delivery of the encoded bitstream 312 in real-time and at a constant bit rate, the supply of data via the encoded bitstream 311 must never result in an underflow or an overflow condition within the coded data buffer 330. As such, the low latency afforded by the reduced size of the coded data buffer 330 results in more restrictive constraints on the localised bit rate of the output from the entropy encoder 324. P151810_speci_as filed (11341263v1)

Fig. 9 is a schematic block diagram showing a truncated residual for a subblock 902. A set of four coefquadrantflags 901 indicate that each set of four residual coefficients in the sub-block 902 are to be coded. In the example of Fig. 9, the lower-right of the coef quadrant flags 901 has a value of zero, and thus the lower-right set of 2x2 residual coefficients in the sub-block 902 are not scanned. The use of the set of coef quandrant flags 901 allows a coarse-granularity of coding the potential for significant coefficients in the sub-block 902. In HEVC, one significance flag is arithmetically coded for each residual coefficient. In the video processing system 100, arithmetic coding is not used, and thus the cost of signalling significant coefficients is absorbed into the coefficient magnitude coding (i.e. coefabslevel). The use of coefgroupflags in the video processing system 100 controls the coding of the coef quadrant flags 901, which in turn control the coding of each set of 2x2 residual coefficients in the sub-block 902. If a coef quadrant flag indicates that a given 2x2 group of residual coefficients in the sub-block 901 is coded, this does not imply that at least one of the residual coefficients should be significant. It is still possible to code a block of 2x2 residual coefficients where each coef abs level has a value of zero (i.e. all four residual coefficients are not significant). Using the coef group flags in the video processing system 100 to control the coding of the coef quadrant flags 901, as described above, reduces the logic complexity, as there is no dependence for coding of values between neighbouring residual coefficients within the 2x2 set of residual coefficients. The absence of such a dependency also simplifies the cost estimation for a given set of residual coefficients, as the cost of coding each residual coefficient in the sub-block 902 can be estimated in parallel, and the costs summed for the entire sub-block 902.

Fig. 10 is a schematic flow diagram showing a method 1000 of padding a bitstream with data to meet a minimum buffer utilisation requirement. The method 1000 may be implemented by the video encoder 114, as one or more software code modules of the application program 233 resident in the hard disk drive 110 and being controlled in its execution by the processor 205. The method 1000 results in the creation of an encoded bitstream 312 that includes ‘padding’ data, inserted when insufficient data is generated from the entropy coding process in the entropy encoder 324 to fill the coded data buffer 330. The padding data prevents the coded data buffer 330 from under-flowing, in which case no valid data would be available for transmission over the communications channel 120. P151810_speci_as filed (11341263v1)

The method 1000 starts at a quantise coefficients step 1002.

At the quantise coefficients step 1002, the quantiser 322, under control of the processor 205, quantises coefficients from the transform module 320 according to the quantisation parameter 384. Step 1002 results in residual coefficients for a TB, to be coded into the encoded bitstream 312. The TB is coded as a sequence of sub-blocks. Control in the processor 205 then passes to an encode sub-block step 1004.

At the encode sub-block step 1004, the entropy encoder 324, under control of the processor 205, encodes the residual coefficients of the considered sub-block into the encoded bitstream 311. For TBs sized larger than 4x4 a coefficient group flag (‘coefgroupflag’) is coded to indicate the presence of at least one significant residual coefficient in the sub-block. Then, for each 2x2 quadrant within the sub-block, a corresponding coefquadrantflag is coded, indicating the presence of at least one significant residual coefficient in the corresponding set of 2x2 residual coefficients in the sub-block. For each 2x2 quadrant, the absolute magnitude of the residual coefficients are coded as coefabslevel, and the sign coded as coef sign, for the sets of 2x2 residual coefficients as indicated by each coef quadrant flag.

Coef abs level is coded according to the binarisation scheme of Fig. 7, with the Rice parameter initialised according to a predictive scheme, based upon the coefficient magnitudes from previous sub-blocks, that accords with the HEVC specification. As a result of the encode sub-block step 1004, a given quantity of bits is stored in the coded data buffer 330. Control in the processor 205 then passes to a buffer underrun test step 1006.

At the buffer underrun test step 1006, the processor 205 tests the status of the coded data buffer 330 to determine the utilisation, or amount of data presently stored in the buffer. The utilisation can be measured in bits, as this is the granularity with which data is written to the coded data buffer 330. Despite this, the coded data buffer 330 is generally implemented using a fixed word width FIFO operating at the clock frequency of the FPGA or ASIC implementing the entropy encoder 114, with an indication of a bit offset b0ffset within the input word of width w from which valid data starts. To complete the current word, brcmauuno uoni equal to H’ - b0ffset bits needs to be supplied to the coded data buffer 330. Moreover, the delivery of the next sub-block is scheduled in n clock cycles time. If the utilisation of the coded data buffer 330 is less than bremaming, equal ton* w + bremainingW0rd, P151810_speci_as filed (11341263v1) the coded data buffer 330 will not have sufficient contents to supply the communications channel 120 with data at the required constant bit rate. In such a case, control in the processor 205 passes to an insert VLC padding syntax element step 1016. Otherwise, control in the processor passes to an insert 1-bit padding syntax element step 1008.

At the insert VLC padding syntax element step 1016, the entropy encoder 324, under control of the processor 205, inserts a variable length codeword syntax element into the encoded bitstream 311. The variable length codeword syntax element is coded as padding into the encoded bitstream 311. The syntax element is coded after each sub-block is coded, and the size of the syntax element needs to be sufficient to address any shortfall between bremainmg and the utilisation of the coded data buffer 330. An 0th order exponential-Golomb (EGO) or a unary-coded value may be used for the padding syntax element. Control in the processor 205 then passes to a last sub-block test step 1010.

At the insert 1-bit padding syntax element step 1008, the entropy encoder 324, under control of the processor 205, inserts a minimal 1-bit value for the padding syntax element (e.g. signalling the shortest possible unary codeword, or EGO codeword). Control in the processor 205 then passes to the last sub-block test step 1010.

At the last sub-block test step 1010, the processor 205 tests if the just-processed sub-block is the last sub-block in the TB. If the just-processed sub-block is not the last sub-block in the TB, control in the processor advances to the next sub-block in the TB, and control in the processor 205 passes to the encode sub-block step 1004. Otherwise, the method 1000 terminates.

As the padding syntax element represents unused capacity in the communications channel 120, the quantity of bits consumed by the syntax element may be minimised. A method 1100 of padding a bitstream by adjusting a Rice parameter for coded residual data to meet a minimum buffer utilisation requirement, will now be described with reference to Fig. 11. The method 1000 may be implemented by the video encoder 114, as one or more software code modules of the application program 233 resident in the hard disk drive 110 and being controlled in its execution by the processor 205.

The video decoder 134 receives and parses the bitstream 312, produced in accordance with the method 1000, and decodes residual data. The residual data is decoded by decoding, for each sub-block, coef group flags (if present), coef quadrant flags (if P151810_speci_as filed (11341263v1) present) and coefabslevel and coefsign syntax elements. After the syntax elements associated with a given sub-block are decoded, the padding syntax element is parsed, and the resulting value is discarded.

The method 1100, performed in the video encoder 114, for padding a bitstream by adjusting a Rice parameter for coded residual data to meet a minimum buffer utilisation requirement, will now be described. The method 1100 does not require the introduction of an explicit padding syntax element. Instead, the residual coding itself, in particular the coef abs level syntax elements, are used to implement the padding function. The padding function is implemented by adjusting the Rice parameter to vary the cost of coding the residual data. The method 1100 begins at a quantise coefficients step 1102.

At the quantise coefficients step 1102, the quantiser 322, under control of the processor 205, quantises coefficients from the transform module 320 according to the quantisation parameter 384. Step 1102 results in residual coefficients for a TB, to be coded into the encoded bitstream 312. The TB is coded as a sequence of sub-blocks. Control in the processor 205 then passes to a predict Rice parameter step 1104.

At the predict Rice parameter step 1104, an initial Rice parameter value is generated, under execution of the processor 205, based on the magnitudes of the residual coefficients of previous sub-blocks, in accordance with the HEVC specification. Using the initial Rice parameter, an initial cost costmitmi is produced for coding the residual. The cost is the sum of the number of required flags (e.g. coef group flag and coef quadrant flag) and the coef sign bits and the lengths of the coef abs level syntax elements for the residual coefficients. Control in the processor 205 then passes to a buffer underrun test step 1106.

At the buffer underrun test step 1106, the processor 205 tests the status of the coded data buffer 330 to determine the utilisation, or amount of data presently stored in the buffer. The utilisation can be measured in bits, as this is the granularity with which data is written to the coded data buffer 330. The coded data buffer 330 may be implemented using a fixed word width FIFO operating at the clock frequency of the FPGA or ASIC implementing the entropy encoder 114, with an indication of a bit offset b0ffset within the input word of width w from which valid data starts. To complete the current word, bremaming word, equal to H’ - b0ffset bits, needs to be supplied to the coded data buffer 330. P151810_speci_as filed (11341263v1)

Moreover, the delivery of the next sub-block is scheduled in n clock cycles time. If the utilisation of the coded data buffer 330 bufutu is less than bremainmg, equal ton* w + bremainingjivord-t then the coded data buffer 330 will not have sufficient contents to supply the communications channel 120 with data at the required constant bit rate. If the coded data buffer 330 does not have sufficient contents to supply the communications channel 120 with data at the required constant bit rate, control in the processor 205 passes to an adjust Rice parameter stepl 108. Otherwise, control in the processor 205 passes to a buffer overrun test step 1110.

At the adjust Rice parameter step 1108, a candidate adjusted Rice parameter is derived under execution of the processor 205. Additionally, at step 1108, the coefquadrantflags for the sub-block are set to indicate that the all residual coefficients of the sub-block are to be coded. The coef quadrant flags for the sub-block are set as a subblock with none, or almost no significant residual coefficients cannot consume much space in the coded data buffer 330 regardless of the chosen Rice parameter, or QP. If the coef quadrant flags is adjusted, the residual coding cost is revised accordingly. If the revised cost is sufficient to alleviate the buffer underrun condition, control in the processor 205 passes to the buffer overrun test step 1110. Otherwise, adjustment of the Rice parameter is required to further increase the residual coding cost. The required increase in the residual cost is determined from the increment in the Rice parameter and the magnitudes of the residual coefficients. An alternative is to ignore the magnitude of the residual coefficients and just use the tendancy that each increment of the Rice parameter increases the codeword length of the coef abs level syntax element by one bit. Then, as a sub-block contains 16 residual coefficients to be coded (given that coef quadrant flags were thus set earlier), the required Rice parameter increase can be determined by a division of the required size to resolve the buffer underrun by sixteen (16), with rounding up by one (1) applied for any fractional component. For larger magnitude residual coefficients, increasing the Rice parameter can result in a shorter codeword length, due to the transfer of coding cost from the EGk suffix to the TR prefix. However, as the Rice parameter adjustment was determined based purely upon the increase in cost of the fixed length portion (i.e. the TR suffix or the EGk suffix, depending on the residual coefficient magnitude), the adjusted Rice parameter will prevent the buffer underrun from occurring. Control in the processor 205 then passes to the buffer overrun test step 1110. P151810_speci_as filed (11341263v1)

In another arrangement of the adjust Rice parameter step 1108, the initial Rice parameter is not signalled in the encoded bitstream output by the video encoder 114. The determined adjusted Rice parameter is also not signalled in the encoded bitstream output by the video encoder 114. Instead, in such an arrangement, both the initial Rice parameter and adjusted Rice parameter are implicitly predicted by the video decoder 134 from a model of the coded data buffer 330. The advantage of implicitly predicting both the initial Rice parameter and adjusted Rice parameter instead of signalling the initial Rice parameter and the adjusted Rice parameter, is that the overall bitrate of the encoded bitstream is reduced. However, the disadvantage is that the video decoder 134 is required to maintain a model of the coded data buffer 330.

At the buffer overrun test step 1110, the expected buffer utilisation is tested after adding the residual coefficients, coded using the required Rice parameter (e.g. from step 1104 or step 1108). A buffer overrun occurs if the length of the coded residual coefficient data is too large to store inside the coded data buffer 330. In such a case, it is not possible to store the entirety of the coded residual data in the coded data buffer 330. Moreover, it is not possible to wait until more data is output over the communications channel 120, as the video encoder 114 must progress to the next sub-block in order to meet the real-time characteristic of the video processing system 100. Generally, buffer overrun occurs when a succession of sub-blocks with large magnitude residual coefficients was encountered. As such, it is not likely to occur when an adjusted Rice parameter (i.e. from the step 1108) is in use. Note, however, that for a limited capacity coded data buffer 330 buffer overrun may occur. If the storage of the coded residual data into the coded data buffer 330 results in an overflow, control in the processor 205 passes to a truncate residual step 1112. Otherwise, control in the processor 205 passes to an encode sub-block step 1114.

At the truncate residual step 1112, the residual data is truncated, under execution of the processor 205, by setting residual coefficients to zero, in order to reduce the cost of coding the residual. The residual data may be truncated by setting coef quadrant flags to zero, starting from the lower-rightmost flag, that corresponds to the highest frequency residual coefficients, and progressing backwards in a Z-scan order to the upper-left quadrant flag. As each flag is set to zero, the associated residual coefficients are implicitly discarded from the coding process, and the residual cost of the sub-block is reduced accordingly. Once the residual coding cost is reduced to a level where the residual data can be stored in the coded data buffer, control in the processor 205 passes to the encode P151810_speci_as filed (11341263v1) sub-block step 1114. Note that the absence of inter-coefficient dependencies for determining the Rice parameter for each coefficient (as is the case for HEVC) simplifies the cost calculations, as the costs of the discarded significant coefficients can simply be removed from the overall cost. As the order of discarding coefficients progresses from the high frequency values, working towards the low-frequency coefficients, the scheme of HEVC would require recalculation of the entire residual cost, as this is also the direction in which the inter-coefficient Rice parameter dependency operates (i.e. the so-called ‘backward-adaptive’ approach). The tendency to remove higher frequency coefficients results in some visual distortions. However, the selection of coefficients to remove is intended to minimise the visual impact, at least for the case where the DCT is employed. The visual distortions are a trade-off against the much more severe impact of failing to meet buffer size requirements, which results in an overflow and later slippage of the transmission rate of the data. As the video processing system 100 has closely linked timing between the video encoder 114 and the video decoder 134, such slippage would result in a loss of synchronisation of the output decoded video data, which would impact the ability of a display to properly present the decoded video data. Note that the DCT can be skipped (‘transform skip’ mode), in which case the visual significance of coefficients does not tend to diminish with increasing frequency (although high frequency coefficients are important for sharp features, such as text).

For transform-skipped blocks, the method of reducing excessive residual is performed uniformly across the sub-block. For example, by decimating alternating input values in either a horizontal direction or a vertical direction prior to a delta-PCM stage and in accordance with the direction, then requantising the resulting input values to produce a new set of residual coefficients. In such a case, the decimation step results in a sparser residual (i.e. alternating rows or columns of non-significant residual coefficients).

At the encode sub-block step 1114, the entropy encoder 324, under control of the processor 205, encodes the residual coefficients of the considered sub-block into the encoded bitstream 311 (i.e. going into the coded data buffer 330). A delta Rice parameter syntax element that codes the delta between the predicted rice parameter (i.e. from the step 1104) and the final Rice parameter value (e.g. from the step 1108) is coded. For the case where no adjustment is performed, the coded delta is a zero. The delta includes a sign flag and magnitude component, coding using e.g. 0th-order exponential Golomb coding. For TBs sized larger than 4x4, a coefficient group flag (‘coef group flag’) is coded to P151810_speci_as filed (11341263v1) indicate the presence of at least one significant residual coefficient in the sub-block. Then, for each 2x2 quadrant within the sub-block, a corresponding coefquadrantflag is coded, generally indicating the presence of at least one significant residual coefficient in the corresponding set of 2x2 residual coefficients in the sub-block.

As per the step 1104, additional coef quadrant flags may be set even though the associated residual coefficients are not significant. Then, for the quadrants, the absolute magnitude of the residual coefficients are coded as coefabslevel, and the sign coded as coef sign, for the sets of 2x2 residual coefficients as indicated by each coef quadrant flag. Coef abs level is coded according to the binarisation scheme of Fig. 7, using the Rice parameter from the step 1104 and potentially step 1108. Asa result of the encode subblock step 1114, a given quantity of bits is stored in the coded data buffer 330. Control in the processor 205 then passes to a last sub-block test step 1116.

At the last sub-block test step 1116, the processor 205 tests if the just-processed sub-block is the last sub-block in the TB. If the just-processed sub-block is not the last sub-block in the TB, control in the processor 205 advances to the next sub-block in the TB, and control in the processor 205 passes to the predict Rice parameter step 1104.

Otherwise, control in the processor 205 passes to an adjust QP step 1118.

At the adjust QP step 1118, the processor 205 may adjust the QP for use in subsequent TBs. Lowering the QP reduces the divisor applied to coefficients from the transform module 320, resulting in higher quality due to less discarded remainder at the expense of higher bit rate. If the processor 205 performed the adjust Rice parameter step 1108, the data rate is evidently well below the ceiling imposed by CBR operation, and thus a lowering of the QP used for subsequent TBs is performed. Lowering the QP used for subsequent TBs results in larger magnitude residual coefficients, taking more bits to code. Also, some residual coefficients that would otherwise be quantised to zero (not significant) may now quantise to a non-zero value, as a result of the reduced QP value.

If the truncate residual step 1112 is performed, the data rate is evidently excessive for the ceiling imposed by CBR operation, in which case, the QP used for subsequent TBs is raised. Raising the QP for subsequent TBs reduces the magnitudes of future residual coefficients, lowering the coding cost. Note that the rate of change of QP should be limited (e.g. to an increment or decrement of one per TB), to avoid excessive reaction to P151810_speci_as filed (11341263v1) spatially localised variations in block complexity. The adjusted QP is signalled using a delta QP syntax element, signalled at the next TB. Then, the method 1100 terminates.

The video decoder 134 decodes an encoded bitstream 312, produced by the video encoder 114 according to the method 1100 using the signalled Rice parameter. As such, a Rice parameter is predicted for the sub-block and a delta is applied in accordance with the signalled delta sign and delta magnitude.

In an arrangement of the video processing system 100, scanning is performed in a forward direction (i.e. from the DC coefficient up to the highest frequency coefficient). Scanning in a forward direction is possible because the ‘backward adaptive’ Rice parameter adaptation, where the Rice parameter used for lower-frequency residual coefficients is dependent on the value used for higher frequency coefficients, is not present. In the absence of the dependency, the scan order does not affect the coding efficiency, so a simpler, fixed, scan pattern may be used. Moreover, there is no need for different scan patterns for different intra-prediction modes, as there is no means to exploit statistical traits of residuals resulting from different intra-prediction modes. For example, the gains achieved from the arithmetic coding process of HEVC are not available to the video processing system 100 because such coding schemes prohibit the high data rates necessary for the target applications.

In HEVC, the timing of the decoder is specified via a ‘hypothetical reference decoder’, which specifies the relationship between the arrival of coded pictures and the delivery of decoded pictures. Coded pictures are delivered via a ‘hypothetical stream scheduler’, which, in practice, is likely to be an encoder. The coded pictures are delivered according to a pre-set schedule (or predetermined schedule). Thus, when delivered over a bandwidth-limited channel, each picture is implicitly able to be transferred over the channel in a sufficiently short timeframe to meet the schedule. In the video processing system 100, this concept is extended down to the CTU level. As such, each CTU is required to be delivered to the video decoder 134 at a particular time, tied to the decoding of the CTU and supply of the decoded samples to the decoded picture buffer 432. A coded CTU may be delivered ahead of time, although the degree of advance delivery is constrained by the size of the coded data buffer 330 and the coded slice segment buffer 430. P151810_speci_as filed (11341263v1)

In arrangements where an entire slice is buffered in the coded data buffer 330 prior to transmission, and the corresponding entire slice is buffered during reception in the coded slice segment buffer 430 prior to decoding, the timing constraint operates at the slice level only. As such, the timing of individual CTUs may vary more greatly, compared to the case of buffering less than a complete slice in the coded data buffer 330 and the coded slice segment buffer 430.

In another arrangement of the video processing system 100, the method 1100 is modified such that truncation of the residual data occurs on a coefficient-by-coefficient basis, with coef quadrant flags being altered to reflect the progressive elimination of significant residual coefficients. As residual truncation results in distortion in the decoded video data, truncation on a coefficient basis, rather than in sets of 2x2 residual coefficients, results in reduced distortion, at the expense of slightly greater computational complexity resulting from the finer granularity of the truncation process.

In HEVC, a total of thirty-three ‘angular’ intra-prediction modes are available, in addition to ‘DC’ intra-prediction and ‘planar’ intra-prediction. Such an approach is appropriate for specifying prediction over large prediction block sizes, e.g. up to 64x64 in size. The prediction operation is performed on a TB by TB basis, limiting the maximum region size within which one set of reconstructed samples is produced to that maximum TB size of 32x32. In the video processing system 100, the maximum block size is much smaller, i.e. 8x8. As a consequence, such a large number of intra-prediction modes is not required.

In the video processing system 100, the supported intra-prediction modes are as follows: horizontal, vertical, diagonal down-left and diagonal down-right. In this context, diagonal down-left and diagonal down-right refers to a downward prediction direction, with an angle of +/- 45 degrees. Note that there is no DC prediction mode. The selection of prediction modes was made to ensure simple mode coding via a fixed length code. The DC prediction mode is omitted because alternatives such as horizontal or vertical intraprediction provide the same result when a neighbouring block has ‘flat’ (i.e. DC) contents. The particular one of these four intra-prediction modes used for a given PU is coded using a two-bit fixed length code. This approach is chosen because the overhead introduced by flags for predictive mode coding schemes cannot be reduced through arithmetic coding methods. As such, a predictive scheme, using bypass-coded bins, would struggle to P151810_speci_as filed (11341263v1) produce a substantially lower bit rate on average compared to a simple fixed length codeword, in particular with relatively few intra-prediction modes to code. The ‘residual delta pulse coded modulation’ or ‘RDPCM’ feature of HEVC enables coding of a residual as a series of deltas between adjacent coefficients in the residual array. In the 2D residual array, the delta is performed in the direction of intra-prediction. Note that the diagonal modes are included in this RDPCM mode for the video processing system 100. RDPCM can be considered as an alternative form of transform, where the residual sample array is integrated (in the decoder) or differentiated (in the encoder) in the direction of intra prediction. The video processing system 100 supports RDPCM in each of the prediction modes, i.e. horizontal, vertical and both diagonal modes. One benefit of RDPCM when the DCT is skipped is that cumulative drift in residual that increases from the neighbouring reference samples (at the top and left edges of the PB) to the bottom and right edges of the PBs is captured in the coded residual deltas, rather than accumulating in the coded residual itself. Also, when neighbouring sample availability is limited, the available intra prediction modes perform in a manner similar to a DC intra prediction mode, as the unavailable reference samples (e.g. those reference samples that cross independent slice segment boundaries, or fall outside the frame boundary) are populated with default values instead.

In HEVC, for a TB, the position of the last significant residual coefficient is coded separately. This reduces the cost of coding the significance map, as significance map coding only proceeds in the range from the DC coefficient (top-left position in the TB) up to, but not including, the signalled position of the last significant position. Coding of both the last significant coefficient position and the significance map benefit from arithmetic coding in HEVC. As the video processing system 100 achieves higher coding throughput by not using arithmetic coding, the last significant coefficient position and the significance map are not coded. Instead, the coef quadrant flags are used, which provides a trade-off between the cost of coding significance of each coefficient individually (which instead is incurred in the coef abs level coding) and the coding overhead of this signalling. HEVC also provides for ‘hiding’ the sign bit of one residual coefficient per sub-block in the parity sum of the residual coefficients of that sub-block.

As a result of quantisation, the parity sum may match the required sign value for the affect residual coefficient, in which case no further action is required in the encoder. P151810_speci_as filed (11341263v1)

However, if the parity sum does not match the required value, then a process to determine the ‘least worst’ residual coefficient to adjust (increment or decrement) is performed. The least worst residual coefficient to adjust is the one that, when adjusted, will introduce the least distortion into the decoded video data. One approach to determining this is to perform the DCT on a series of adjusted residual coefficients, progressively searching and keeping track of the adjustment that resulted in minimum distortion. This approach increases computational complexity in the encoder quite substantially. Alternatively, when ‘rate-distortion optimised quantisation’ (RDOQ) is performed, such adjustment costs are already available, and the particular residual coefficient to increment or decrement can be determined using the already obtained data. However, RDOQ is generally too complex an operation for high throughput encoders to perform. Thus, for the high throughput operation of the video processing system 100, this ‘sign data hiding’ approach is not used, as it imposes a variable and somewhat costly complexity burden on the video encoder 114. Instead, for each significant residual coefficient (i.e. coef abs level is greater than zero), a corresponding sign bit is coded in the encoded bitstream 312.

INDUSTRIAL APPLICABILITY

The arrangements described are applicable to the computer and data processing industries and particularly for the digital signal processing for the encoding a decoding of signals such as video signals for a low-latency (sub-frame) video coding system.

The foregoing describes only some embodiments of the present invention, and modifications and/or changes can be made thereto without departing from the scope and spirit of the invention, the embodiments being illustrative and not restrictive. For example, one or more of the features of the various arrangements described above may be combined.

In the context of this specification, the word “comprising” means “including principally but not necessarily solely” or “having” or “including”, and not “consisting only of’. Variations of the word "comprising", such as “comprise” and “comprises” have correspondingly varied meanings. P151810_speci_as filed (11341263v1)

Claims

CLAIMS:

1. A method of encoding video data, the method comprising: receiving residual data of a block of the video data to be encoded; determining a Rice parameter to encode the residual data, the Rice parameter being determined to increase a size of the received residual data such that a size of the encoded residual data exceeds a minimum output buffer utilisation, the minimum output buffer utilisation indicating an amount of space in an output buffer that the received residual data is required to fill to prevent buffer underrun of the output buffer for a video bitstream; and storing the received residual data to the output buffer using the determined Rice parameter to encode the video data.
2. A method according to claim 1, wherein the minimum output buffer utilisation is determined from a bit rate of the video bitstream output from the output buffer.
3. A method according to claim 1, wherein the determined Rice parameter is stored in the video bitstream.
4. A method according to claim 1, wherein the Rice parameter is determined for each sub block in a transform block.
5. A method according to claim 1, wherein the size of the encoded residual data is measured in bits.
6. An encoder for encoding video data, the encoder comprising: module for receiving residual data of a block of the video data to be encoded; module for determining a Rice parameter to encode the residual data, the Rice parameter being determined to increase a size of the received residual data such that a size of the encoded residual data exceeds a minimum output buffer utilisation, the minimum output buffer utilisation indicating an amount of space in an output buffer that the received residual data is required to fill to prevent buffer underrun of the output buffer for a video bitstream; and P151810_speci_as filed (11341263v1) module for storing the received residual data to the output buffer using the determined Rice parameter to encode video data.
7. A system for encoding video data to produce a band limited bit rate video bitstream, the system comprising: a memory for storing data and a computer program; a processor coupled to the memory for executing the computer program, the computer program comprising instructions for: receiving residual data of a block of the video data to be encoded; determining a Rice parameter to encode the residual data, the Rice parameter being determined to increase a size of the received residual data such that a size of the encoded residual data exceeds a minimum output buffer utilisation, the minimum output buffer utilisation indicating an amount of space in an output buffer that the received residual data is required to fill to prevent buffer underrun of the output buffer fora video bitstream; and storing the received residual data to the output buffer using the determined Rice parameter to encode the video data.
8. A computer readable medium having a program stored thereon for encoding video data to produce a band limited bit rate video bitstream, the program comprising: code for receiving residual data of a block of the video data to be encoded; code for determining a Rice parameter to encode the residual data, the Rice parameter being determined to increase a size of the received residual data such that a size of the encoded residual data exceeds a minimum output buffer utilisation, the minimum output buffer utilisation indicating an amount of space in an output buffer that the received residual data is required to fill to prevent buffer underrun of the output buffer for a video bitstream; and code for storing the received residual data to the output buffer using the determined Rice parameter to encode the video data.
9. An apparatus for encoding video data, the apparatus comprising: means for receiving residual data of a block of the video data to be encoded; P151810_speci_as filed (11341263v1) means for determining a Rice parameter to encode the residual data, the Rice parameter being determined to increase a size of the received residual data such that a size of the encoded residual data exceeds a minimum output buffer utilisation, the minimum output buffer utilisation indicating an amount of space in an output buffer that the received residual data is required to fill to prevent buffer underrun of the output buffer for a video bitstream; and means for storing the received residual data to the output buffer using the determined Rice parameter to encode the video data. CANON KABUSHIKI KAISHA Patent Attorneys for the Applicant Spruson & Ferguson P151810_speci_as filed (11341263v1)