WO2020118351A1

WO2020118351A1 - Method, apparatus and system for encoding and decoding a transformed block of video samples

Info

Publication number: WO2020118351A1
Application number: PCT/AU2019/051178
Authority: WO
Inventors: Christopher James ROSEWARNE
Original assignee: Canon Kabushiki Kaisha; Canon Australia Pty Limited
Priority date: 2018-12-12
Filing date: 2019-10-25
Publication date: 2020-06-18
Also published as: TW202025785A; AU2018278914A1

Abstract

A system and method of decoding, from a bitstream, a coding unit in a current coding tree unit (CTU) to produce a frame, the coding unit referencing coding units previously decoded from the bitstream and the frame including a two-dimensional array of CTUs. The method comprises decoding a block vector for the coding unit from the bitstream, the coding unit located at a CTU at a left edge of the two-dimensional array of CTUs; and determining an offset block vector, the offset block vector locating the previously coded CTU to be adjacent to and left of the current CTU. The method further comprises producing a prediction block for the coding unit by fetching reference samples from the previous CTU according to the sum of the decoded block vector and the determined offset block vector; and forming the coding unit using the prediction block and decoded residual samples of the coding unit to produce a frame.

Description

METHOD, APPARATUS AND SYSTEM FOR ENCODING AND DECODING A TRANSFORMED BLOCK OF VIDEO SAMPLES

REFERENCE OT RLEATED APPLICATION(S)

[0001] This application claims the benefit under 35 U.S.C. §119 of the filing date of Australian Patent Application No. 2018278914, filed 12 December 2018, the contents of which are incorporated herein by reference in their entirety.

TECHNICAL FIELD

[0002] The present invention relates generally to digital video signal processing and, in particular, to a method, apparatus and system for encoding and decoding a transformed block of video samples. The present invention also relates to a computer program product including a computer readable medium having recorded thereon a computer program for encoding and decoding a transformed block of video samples.

BACKGROUND

[0003] Many applications for video coding currently exist, including applications for transmission and storage of video data. Many video coding standards have also been developed and others are currently in development. Recent developments in video coding standardisation have led to the formation of a group called the“Joint Video Experts Team” (JVET). The Joint Video Experts Team (JVET) includes members of Study Group 16, Question 6 (SG16/Q6) of the Telecommunication Standardisation Sector (ITU-T) of the International Telecommunication Union (ITU), also known as the“Video Coding Experts Group” (VCEG), and members of the International Organisations for Standardisation / International Electrotechnical Commission Joint Technical Committee 1 / Subcommittee 29 / Working Group 11 (ISO/IEC

JTC1/SC29/WG11), also known as the“Moving Picture Experts Group” (MPEG).

[0004] The Joint Video Experts Team (JVET) issued a Call for Proposals (CfP), with responses analysed at its 10^th meeting in San Diego, USA. The submitted responses demonstrated video compression capability significantly outperforming that of the current state-of-the-art video compression standard, i.e.:“high efficiency video coding” (HEVC). On the basis of this outperformance it was decided to commence a project to develop a new video compression standard, to be named‘versatile video coding’ (VVC). VVC is anticipated to address ongoing demand for ever-higher compression performance, especially as video formats increase in capability (e.g., with higher resolution and higher frame rate) and address increasing market demand for service delivery over Wireless Area Networks (WANs), where bandwidth costs are relatively high. At the same time, VVC must be implementable in contemporary silicon processes and offer an acceptable trade-off between the achieved performance versus the implementation cost (for example, in terms of silicon area, CPU processor load, memory utilisation and bandwidth).

[0005] Video data includes a sequence of frames of image data, each of which include one or more colour channels. Generally, one primary colour channel and two secondary colour channels are included. The primary colour channel is generally referred to as the‘luma’ channel and the secondary colour channel(s) are generally referred to as the‘chroma’ channels. Although video data is typically displayed in an RGB (red-green-blue) colour space, the colour space has a high degree of correlation between the three respective components. The video data representation seen by an encoder or a decoder is often using a colour space such as YCbCr. YCbCr concentrates luminance, mapped to‘luma’ according to a transfer function, in a Y (primary) channel and chroma in Cb and Cr (secondary) channels. Moreover, the Cb and Cr channels may be sampled spatially at a lower rate compared to the luma channel, for example half horizontally and half vertically, known as a‘4:2:0 chroma format’.

[0006] The VVC standard is a‘block based’ codec, in which frames are firstly divided into a square array of regions known as‘coding tree units’ (CTUs). CTUs generally occupy a relatively large area, such as 128x 128 luma samples. However, CTUs at the right and bottom edge of each frame may be smaller in area. Associated with each CTU is a‘coding tree’ that defines a decomposition of the area of the CTU into a set of areas, also referred to as‘coding units’ (CUs). The CUs are processed for encoding or decoding in a particular order. As a consequence of the coding tree and the use of the 4:2:0 chroma format, a given area in the frame is associated with a collection of collocated blocks across the colour channels. The luma block has a dimension of width ^c height and the chroma blocks have dimensions of width/2 ^c height/2 for each chroma block. The collections of collocated blocks for a given area are generally referred to as‘units’, for example the above-mentioned CUs, as well as‘prediction units’ (PUs), and‘transform units’ (TUs).

[0007] Notwithstanding the different dimensions of chroma blocks versus luma blocks for the same area, the size of a given‘unit’ is generally described in terms of the dimensions of the luma block for the unit. Individual blocks are typically identified by the type of unit for which the blocks are associated. For example,‘coding block’ (CB),‘transform block’ (TB)’, and prediction block (PB) are blocks for one colour channel and are associated with CU, TU, and PU, respectively. Notwithstanding the above distinction between‘units’ and‘blocks’, the term ‘block’ may be used as a general term for areas or regions of a frame for which operations are applied to all colour channels.

[0008] For each CU a prediction (PU) of the contents (sample values) of the corresponding area of frame data is generated (a‘prediction unit’). Further, a representation of the difference (or ‘residual’ in the spatial domain) between the prediction and the contents of the area as seen at input to the encoder is formed. The difference in each colour channel may be transformed coded as a sequence of residual coefficients, forming one or more TUs for a given CU. The applied transform may be a Discrete Cosine Transform (DCT) or other transform, applied to each block of residual values. The transform is applied separably, i.e. that is the two dimensional transform is performed in two passes. The block is firstly transformed by applying a one-dimensional transform to each row of samples in the block. Then, the partial result is transformed by applying a one-dimensional transform to each column of the partial result to produce a final block of transform coefficients that substantially decorrelates the residual samples. Transforms of various sizes are supported by the WC standard, including transforms of rectangular-shaped blocks, with each side dimension being a power of two. Transform coefficients are quantised for entropy encoding into a bitstream.

[0009] Implementations of the VVC standard typically use pipelining to divide the processing into a sequence of stages. Each stage operates concurrently and partially processed blocks are passed from one stage to the next, before fully processed (i.e. encoded or decoded) blocks are output. Efficient handling of transformed blocks in the context of pipelined architectures is needed to avoid excessive implementation cost for the VVC standard. Excessive

implementation cost can occur both with respect to memory consumption and with respect to functional modules required to process a‘worst case’ both in terms of the rate at which pipeline stages need to complete and the size of data processed at each stage.

SUMMARY

[00010] It is an object of the present invention to substantially overcome, or at least ameliorate, one or more disadvantages of existing arrangements. [00011] One aspect of the present disclosure provides a method of decoding, from a bitstream, a coding unit in a current coding tree unit (CTU) to produce a frame, the coding unit referencing coding units previously decoded from the bitstream and the frame including a two- dimensional array of CTUs, the method comprising: decoding a block vector for the coding unit from the bitstream, the coding unit located at a CTU at a left edge of the two-dimensional array of CTUs; determining an offset block vector, the offset block vector locating the previously coded CTU to be adjacent to and left of the current CTU; producing a prediction block for the coding unit by fetching reference samples from the previous CTU according to the sum of the decoded block vector and the determined offset block vector; and forming the coding unit using the prediction block and decoded residual samples of the coding unit to produce a frame.

[00012] According to another aspect, the offset vector has a Y component equal to negative of a CTU height.

[00013] According to another aspect, the offset vector has an X component equal to a width of the frame.

[00014] According to another aspect, the previously coded CTU is in a different row to the coding unit.

[00015] According to another aspect, the determined offset locates a portion of the previously coded CTU to be adjacent to and left of the current CTU.

[00016] Another aspect of the present disclosure provides a non-transitory computer readable medium having a computer program stored thereon to implement a method of decoding, from a bitstream, a coding unit in a current coding tree unit (CTU) to produce a frame, the coding unit referencing coding units previously decoded from the bitstream and the frame including a two- dimensional array of CTUs, the program comprising: code for decoding a block vector for the coding unit from the bitstream, the coding unit located at a CTU at a left edge of the two- dimensional array of CTUs; code for determining an offset block vector, the offset block vector locating the previously coded CTU to be adjacent to and left of the current CTU; code for producing a prediction block for the coding unit by fetching reference samples from the previous CTU according to the sum of the decoded block vector and the determined offset block vector; and code for forming the coding unit using the prediction block and decoded residual samples of the coding unit to produce a frame. [00017] Another aspect of the present disclosure provides a system, comprising: a memory; and a processor, wherein the processor is configured to execute code stored on the memory for implementing a method of decoding, from a bitstream, a coding unit in a current coding tree unit (CTU) to produce a frame, the coding unit referencing coding units previously decoded from the bitstream and the frame including a two-dimensional array of CTUs, the method comprising: decoding a block vector for the coding unit from the bitstream, the coding unit located at a CTU at a left edge of the two-dimensional array of CTUs; determining an offset block vector, the offset block vector locating the previously coded CTU to be adjacent to and left of the current CTU; producing a prediction block for the coding unit by fetching reference samples from the previous CTU according to the sum of the decoded block vector and the determined offset block vector; and forming the coding unit using the prediction block and decoded residual samples of the coding unit to produce a frame.

[00018] Another aspect of the present disclosure provides a video decoder configured to decode, from a bitstream, a coding unit in a current coding tree unit (CTU) to produce a frame, the coding unit referencing coding units previously decoded from the bitstream and the frame including a two-dimensional array of CTUs, by implementing a method comprising: decoding a block vector for the coding unit from the bitstream, the coding unit located at a CTU at a left edge of the two-dimensional array of CTUs; determining an offset block vector, the offset block vector locating the previously coded CTU to be adjacent to and left of the current CTU;

producing a prediction block for the coding unit by fetching reference samples from the previous CTU according to the sum of the decoded block vector and the determined offset block vector; and forming the coding unit using the prediction block and decoded residual samples of the coding unit to produce a frame.

[00019] Another aspect of the present disclosure provides a method of decoding a coding unit of a coding tree unit in an image frame from a bitstream, the coding tree unit including a plurality of processing regions, the method comprising: decoding, from a first coding unit of the coding tree unit, a first transform unit located in a first processing region of the plurality of processing regions, the first coding unit having transform units located within the first processing region and a second processing region of the plurality of processing regions;

decoding, from a second coding unit of the coding tree unit, a second transform unit located in the first processing region; decoding, from the first coding unit, a further transform unit located in the second processing region, the further transform unit being decoded after decoding the transform units of the first and second coding units located in the first processing region; and decoding the first coding unit of the coding tree unit by applying the decoded first, second and further transform units.

[00020] Another aspect of the present disclosure provides a non-transitory computer readable medium having a computer program stored thereon to implement a method of decoding a coding unit of a coding tree unit in an image frame from a bitstream, the coding tree unit including a plurality of processing regions, the program comprising: code for decoding, from a first coding unit of the coding tree unit, a first transform unit located in a first processing region of the plurality of processing regions, the first coding unit having transform units located within the first processing region and a second processing region of the plurality of processing regions; code for decoding, from a second coding unit of the coding tree unit, a second transform unit located in the first processing region; code for decoding, from the first coding unit, a further transform unit located in the second processing region, the further transform unit being decoded after decoding the transform units of the first and second coding units located in the first processing region; and code for decoding the first coding unit of the coding tree unit by applying the decoded first, second and further transform units.

[00021] Another aspect of the present disclosure provides a system, comprising: a memory; and a processor, wherein the processor is configured to execute code stored on the memory for implementing a method of decoding a coding unit of a coding tree unit in an image frame from a bitstream, the coding tree unit including a plurality of processing regions, the method comprising: decoding, from a first coding unit of the coding tree unit, a first transform unit located in a first processing region of the plurality of processing regions, the first coding unit having transform units located within the first processing region and a second processing region of the plurality of processing regions; decoding, from a second coding unit of the coding tree unit, a second transform unit located in the first processing region; decoding, from the first coding unit, a further transform unit located in the second processing region, the further transform unit being decoded after decoding the transform units of the first and second coding units located in the first processing region; and decoding the first coding unit of the coding tree unit by applying the decoded first, second and further transform unit

[00022] Another aspect of the present disclosure provides a video decoder configured to implement a method of decoding a coding unit of a coding tree unit in an image frame from a bitstream, the coding tree unit including a plurality of processing regions, the method comprising: decoding, from a first coding unit of the coding tree unit, a first transform unit located in a first processing region of the plurality of processing regions, the first coding unit having transform units located within the first processing region and a second processing region of the plurality of processing regions; decoding, from a second coding unit of the coding tree unit, a second transform unit located in the first processing region; decoding, from the first coding unit, a further transform unit located in the second processing region, the further transform unit being decoded after decoding the transform units of the first and second coding units located in the first processing region; and decoding the first coding unit of the coding tree unit by applying the decoded first, second and further transform unit.

[00023] Other aspects are also disclosed.

BRIEF DESCRIPTION OF THE DRAWINGS

[00024] At least one embodiment of the present invention will now be described with reference to the following drawings and and appendices, in which:

[00025] Fig. 1 is a schematic block diagram showing a video encoding and decoding system;

[00026] Figs. 2A and 2B form a schematic block diagram of a general purpose computer system upon which one or both of the video encoding and decoding system of Fig. 1 may be practiced;

[00027] Fig. 3 is a schematic block diagram showing functional modules of a video encoder;

[00028] Fig. 4 is a schematic block diagram showing functional modules of a video decoder;

[00029] Fig. 5 is a schematic block diagram showing the available divisions of a block into one or more blocks in the tree structure of versatile video coding;

[00030] Fig. 6 is a schematic illustration of a dataflow to achieve permitted divisions of a block into one or more blocks in a tree structure of versatile video coding;

[00031] Figs. 7A and 7B show an example division of a coding tree unit (CTU) into a number of coding units;

[00032] Fig. 8A is an example coding tree unit (CTU) with a conventional arrangement of transform units; [00033] Fig. 8B is an example coding tree units (CTUs) with a transform unit arrangement that allows processing according to a pipelined architecture;

[00034] Fig. 9A is a diagram showing a coding tree dividing a coding tree unit (CTU) into five coding units (CUs);

[00035] Fig 9B is a diagram showing the transform units resulting from the coding tree of Fig. 9 A;

[00036] Fig. 9C is a diagram showing a conventional coding order of the transform units of Fig. 9B in a CTU, divided into four virtual pipeline data units (VPDUs);

[00037] Fig. 9D is a diagram showing a conventional coding order of the transform units of Fig. 9C in the bitstream, divided into the four VPDUs of a CTU;

[00038] Fig. 9E is a diagram showing a coding order of the transform units of Fig. 9B in a CTU, the transform units being coded in a consecutive order with respect to the four VPDUs of the CTU;

[00039] Fig. 9F is a diagram showing a coding order of the transform units of Fig 9E in the bitstream, divided such that the transform units of each VPDU of the CTU are coded

adjacently;

[00040] Fig. 9G is a diagram showing a conventional order of two coding units and

corresponding transform units in a CTU;

[00041] Fig. 9H is a diagram showing a coding order of two coding units and corresponding transform units in a CTU, such that the transform units are coded in the order of VPDUs of the CTU;

[00042] Fig. 91 shows an example CTU with a top-level split being a binary split (horizontal direction);

[00043] Fig. 10 shows a method of encoding a coding unit using transforms, the method enabling pipelined implementations of the video encoder to be realised; and

[00044] Fig. 11 shows a method of decoding a coding unit using transforms, the method enabling pipelined implementations of the video decoder to be realised. [00045] Fig. 12 shows a method of generating a list of transform units for a coding unit, each transform unit being associated with one VPDU of the CTU;

[00046] Fig. 13 A shows reference areas in a current and left CTU for current picture referencing with a horizontal split at the top level of a current CTU;

[00047] Fig. 13B shows reference areas in a current and left CTU for current picture referencing with a vertical split at the top level of a current CTU;

[00048] Fig. 13C shows reference areas in a frame of CTUs with the CTUs grouped into tiles, with each tile allowed to have partial CTUs only along the leftmost column of the leftmost tiles and the lowermost row of the lowermost tiles;

[00049] Fig. 13D shows reference areas in a frame of CTUs with the CTUs grouped into tiles, with each tile allowed to have partial CTUs only along the leftmost column of each tile and the lowermost row of each tile;

[00050] Fig. 13E is a diagram showing reference areas in a frame of CTUs with the CTUs grouped into tiles, with tiles allowed to have partial CTUs along the rightmost column and topmost row of each tile;

[00051] Fig. 14 shows a method for encoding a coding unit using current picture referencing to a CTU in the above row of CTUs; and

[00052] Fig. 15 shows a method for decoding a coding unit using current picture referencing to a CTU in the above row of CTUs.

DETAILED DESCRIPTION INCLUDING BEST MODE

[00053] Where reference is made in any one or more of the accompanying drawings to steps and/or features, which have the same reference numerals, those steps and/or features have for the purposes of this description the same function(s) or operation(s), unless the contrary intention appears.

[00054] Fig. 1 is a schematic block diagram showing functional modules of a video encoding and decoding system 100. The system 100 may utilise implicit division of large blocks or coding units (CUs) into multiple, smaller, blocks or transform units (TUs) to enable processing the coding tree unit (CTU) in regions, or‘pipeline processing regions’ or‘virtual pipeline data units’ (VPDUs), smaller than the CTU size. The VPDUs effectively define processing regions of the CTU for processing/parsing in a pipelined manner. Moreover, the system 100 may also order the TUs in a bitstream such that the decoder is able to process the TUs according to a pipelined order irrespective of the actual coding tree of the CTU and without imposing constraints on the coding tree flexibility that would degrade compression efficiency. For example, the system 100 may process the CTU as four quadrants, each of which may contain many CUs and/or may contain parts of CUs that span across multiple regions. The system 100 may also constrain operation of prediction methods such as‘current picture referencing’ (CPR) to accord with the available memory buffering inherent in a VPDU-wise processing architecture.

[00055] The system 100 includes a source device 110 and a destination device 130. A communication channel 120 is used to communicate encoded video information from the source device 110 to the destination device 130. In some arrangements, the source device 110 and destination device 130 may either or both comprise respective mobile telephone handsets or “smartphones”, in which case the communication channel 120 is a wireless channel. In other arrangements, the source device 110 and destination device 130 may comprise video

conferencing equipment, in which case the communication channel 120 is typically a wired channel, such as an internet connection. Moreover, the source device 110 and the destination device 130 may comprise any of a wide range of devices, including devices supporting over- the-air television broadcasts, cable television applications, internet video applications

(including streaming) and applications where encoded video data is captured on some computer-readable storage medium, such as hard disk drives in a file server.

[00056] As shown in Fig. 1, the source device 110 includes a video source 112, a video encoder 114 and a transmitter 116. The video source 112 typically comprises a source of captured video frame data (shown as an arrow 113), such as an image capture sensor, a previously captured video sequence stored on a non-transitory recording medium, or a video feed from a remote image capture sensor. The video source 112 may also be an output of a computer graphics card, for example displaying the video output of an operating system and various applications executing upon a computing device, for example a tablet computer.

Examples of source devices 110 that may include an image capture sensor as the video source 112 include smart-phones, video camcorders, professional video cameras, and network video cameras. [00057] The video encoder 114 converts (or‘encodes’) the captured frame data (indicated by the arrow 113) from the video source 112 into a bitstream (indicated by an arrow 115) as described further with reference to Fig. 3. The bitstream 115 is transmitted by the

transmitter 116 over the communication channel 120 as encoded video data (or“encoded video information”). It is also possible for the bitstream 115 to be stored in a non-transitory storage device 122, such as a“Flash” memory or a hard disk drive, until later being transmitted over the communication channel 120, or in-lieu of transmission over the communication channel 120.

[00058] The destination device 130 includes a receiver 132, a video decoder 134 and a display device 136. The receiver 132 receives encoded video data from the communication channel 120 and passes received video data to the video decoder 134 as a bitstream (indicated by an arrow 133). The video decoder 134 then outputs decoded frame data (indicated by an arrow 135) to the display device 136. Examples of the display device 136 include a cathode ray tube, a liquid crystal display, such as in smart-phones, tablet computers, computer monitors or in stand-alone television sets. It is also possible for the functionality of each of the source device 110 and the destination device 130 to be embodied in a single device, examples of which include mobile telephone handsets and tablet computers.

[00059] Notwithstanding the example devices mentioned above, each of the source device 110 and destination device 130 may be configured within a general purpose computing system, typically through a combination of hardware and software components. Fig. 2A illustrates an example computer system 200, which includes: a computer module 201; input devices such as a keyboard 202, a mouse pointer device 203, a scanner 226, a camera 227, which may be configured as the video source 112, and a microphone 280; and output devices including a printer 215, a display device 214, which may be configured as the display device 136, and loudspeakers 217. An external Modulator-Demodulator (Modem) transceiver device 216 may be used by the computer module 201 for communicating to and from a communications network 220 via a connection 221. The communications network 220, which may represent the communication channel 120, may be a wide-area network (WAN), such as the Internet, a cellular telecommunications network, or a private WAN. Where the connection 221 is a telephone line, the modem 216 may be a traditional“dial-up” modem. Alternatively, where the connection 221 is a high capacity (e.g., cable or optical) connection, the modem 216 may be a broadband modem. A wireless modem may also be used for wireless connection to the communications network 220. The transceiver device 216 may provide the functionality of the transmitter 116 and the receiver 132 and the communication channel 120 may be embodied in the connection 221.

[00060] The computer module 201 typically includes at least one processor unit 205, and a memory unit 206. For example, the memory unit 206 may have semiconductor random access memory (RAM) and semiconductor read only memory (ROM). The computer module 201 also includes a number of input/output (I/O) interfaces including: an audio-video interface 207 that couples to the video display 214, loudspeakers 217 and microphone 280; an I/O interface 213 that couples to the keyboard 202, mouse 203, scanner 226, camera 227 and optionally a joystick or other human interface device (not illustrated); and an interface 208 for the external modem 216 and printer 215. The signal from the audio-video interface 207 to the computer monitor 214 is generally the output of a computer graphics card. In some implementations, the modem 216 may be incorporated within the computer module 201, for example within the interface 208. The computer module 201 also has a local network interface 211, which permits coupling of the computer system 200 via a connection 223 to a local-area communications network 222, known as a Local Area Network (LAN). As illustrated in Fig. 2A, the local communications network 222 may also couple to the wide network 220 via a connection 224, which would typically include a so-called“firewall” device or device of similar functionality. The local network interface 211 may comprise an Ethernet™ circuit card, a Bluetooth™ wireless arrangement or an IEEE 802.11 wireless arrangement; however, numerous other types of interfaces may be practiced for the interface 211. The local network interface 211 may also provide the functionality of the transmitter 116 and the receiver 132 and communication channel 120 may also be embodied in the local communications network 222.

[00061] The EO interfaces 208 and 213 may afford either or both of serial and parallel connectivity, the former typically being implemented according to the Universal Serial Bus (USB) standards and having corresponding USB connectors (not illustrated). Storage devices 209 are provided and typically include a hard disk drive (HDD) 210. Other storage devices such as a floppy disk drive and a magnetic tape drive (not illustrated) may also be used. An optical disk drive 212 is typically provided to act as a non-volatile source of data. Portable memory devices, such optical disks (e.g. CD-ROM, DVD, Blu ray Disc™), USB-RAM, portable, external hard drives, and floppy disks, for example, may be used as appropriate sources of data to the computer system 200. Typically, any of the HDD 210, optical drive 212, networks 220 and 222 may also be configured to operate as the video source 112, or as a destination for decoded video data to be stored for reproduction via the display 214. The source device 110 and the destination device 130 of the system 100 may be embodied in the computer system 200.

[00062] The components 205 to 213 of the computer module 201 typically communicate via an interconnected bus 204 and in a manner that results in a conventional mode of operation of the computer system 200 known to those in the relevant art. For example, the processor 205 is coupled to the system bus 204 using a connection 218. Likewise, the memory 206 and optical disk drive 212 are coupled to the system bus 204 by connections 219. Examples of computers on which the described arrangements can be practised include IBM-PC’s and compatibles, Sun SPARCstations, Apple Mac™ or alike computer systems.

[00063] Where appropriate or desired, the video encoder 114 and the video decoder 134, as well as methods described below, may be implemented using the computer system 200. In particular, the video encoder 114, the video decoder 134 and methods to be described, may be implemented as one or more software application programs 233 executable within the computer system 200. In particular, the video encoder 114, the video decoder 134 and the steps of the described methods are effected by instructions 231 (see Fig. 2B) in the software 233 that are carried out within the computer system 200. The software instructions 231 may be formed as one or more code modules, each for performing one or more particular tasks. The software may also be divided into two separate parts, in which a first part and the corresponding code modules performs the described methods and a second part and the corresponding code modules manage a user interface between the first part and the user.

[00064] The software may be stored in a computer readable medium, including the storage devices described below, for example. The software is loaded into the computer system 200 from the computer readable medium, and then executed by the computer system 200. A computer readable medium having such software or computer program recorded on the computer readable medium is a computer program product. The use of the computer program product in the computer system 200 preferably effects an advantageous apparatus for implementing the video encoder 114, the video decoder 134 and the described methods.

[00065] The software 233 is typically stored in the HDD 210 or the memory 206. The software is loaded into the computer system 200 from a computer readable medium, and executed by the computer system 200. Thus, for example, the software 233 may be stored on an optically readable disk storage medium (e.g., CD-ROM) 225 that is read by the optical disk drive 212. [00066] In some instances, the application programs 233 may be supplied to the user encoded on one or more CD-ROMs 225 and read via the corresponding drive 212, or alternatively may be read by the user from the networks 220 or 222. Still further, the software can also be loaded into the computer system 200 from other computer readable media. Computer readable storage media refers to any non-transitory tangible storage medium that provides recorded instructions and/or data to the computer system 200 for execution and/or processing. Examples of such storage media include floppy disks, magnetic tape, CD-ROM, DVD, Blu-ray Disc™, a hard disk drive, a ROM or integrated circuit, USB memory, a magneto-optical disk, or a computer readable card such as a PCMCIA card and the like, whether or not such devices are internal or external of the computer module 201. Examples of transitory or non-tangible computer readable transmission media that may also participate in the provision of the software, application programs, instructions and/or video data or encoded video data to the computer module 401 include radio or infra-red transmission channels, as well as a network connection to another computer or networked device, and the Internet or Intranets including e-mail transmissions and information recorded on Websites and the like.

[00067] The second part of the application program 233 and the corresponding code modules mentioned above may be executed to implement one or more graphical user interfaces (GUIs) to be rendered or otherwise represented upon the display 214. Through manipulation of typically the keyboard 202 and the mouse 203, a user of the computer system 200 and the application may manipulate the interface in a functionally adaptable manner to provide controlling commands and/or input to the applications associated with the GUI(s). Other forms of functionally adaptable user interfaces may also be implemented, such as an audio interface utilizing speech prompts output via the loudspeakers 217 and user voice commands input via the microphone 280.

[00068] Fig. 2B is a detailed schematic block diagram of the processor 205 and a

“memory” 234. The memory 234 represents a logical aggregation of all the memory modules (including the HDD 209 and semiconductor memory 206) that can be accessed by the computer module 201 in Fig. 2A.

[00069] When the computer module 201 is initially powered up, a power-on self-test (POST) program 250 executes. The POST program 250 is typically stored in a ROM 249 of the semiconductor memory 206 of Fig. 2A. A hardware device such as the ROM 249 storing software is sometimes referred to as firmware. The POST program 250 examines hardware within the computer module 201 to ensure proper functioning and typically checks the processor 205, the memory 234 (209, 206), and a basic input-output systems software (BIOS) module 251, also typically stored in the ROM 249, for correct operation. Once the POST program 250 has run successfully, the BIOS 251 activates the hard disk drive 210 of Fig. 2A. Activation of the hard disk drive 210 causes a bootstrap loader program 252 that is resident on the hard disk drive 210 to execute via the processor 205. This loads an operating system 253 into the RAM memory 206, upon which the operating system 253 commences operation. The operating system 253 is a system level application, executable by the processor 205, to fulfil various high level functions, including processor management, memory management, device management, storage management, software application interface, and generic user interface.

[00070] The operating system 253 manages the memory 234 (209, 206) to ensure that each process or application running on the computer module 201 has sufficient memory in which to execute without colliding with memory allocated to another process. Furthermore, the different types of memory available in the computer system 200 of Fig. 2A must be used properly so that each process can run effectively. Accordingly, the aggregated memory 234 is not intended to illustrate how particular segments of memory are allocated (unless otherwise stated), but rather to provide a general view of the memory accessible by the computer system 200 and how such is used.

[00071] As shown in Fig. 2B, the processor 205 includes a number of functional modules including a control unit 239, an arithmetic logic unit (ALU) 240, and a local or internal memory 248, sometimes called a cache memory. The cache memory 248 typically includes a number of storage registers 244-246 in a register section. One or more internal busses 241 functionally interconnect these functional modules. The processor 205 typically also has one or more interfaces 242 for communicating with external devices via the system bus 204, using a connection 218. The memory 234 is coupled to the bus 204 using a connection 219.

[00072] The application program 233 includes a sequence of instructions 231 that may include conditional branch and loop instructions. The program 233 may also include data 232 which is used in execution of the program 233. The instructions 231 and the data 232 are stored in memory locations 228, 229, 230 and 235, 236, 237, respectively. Depending upon the relative size of the instructions 231 and the memory locations 228-230, a particular instruction may be stored in a single memory location as depicted by the instruction shown in the memory location 230. Alternately, an instruction may be segmented into a number of parts each of which is stored in a separate memory location, as depicted by the instruction segments shown in the memory locations 228 and 229.

[00073] In general, the processor 205 is given a set of instructions which are executed therein. The processor 205 waits for a subsequent input, to which the processor 205 reacts to by executing another set of instructions. Each input may be provided from one or more of a number of sources, including data generated by one or more of the input devices 202, 203, data received from an external source across one of the networks 220, 202, data retrieved from one of the storage devices 206, 209 or data retrieved from a storage medium 225 inserted into the corresponding reader 212, all depicted in Fig. 2A. The execution of a set of the instructions may in some cases result in output of data. Execution may also involve storing data or variables to the memory 234.

[00074] The video encoder 114, the video decoder 134 and the described methods may use input variables 254, which are stored in the memory 234 in corresponding memory

locations 255, 256, 257. The video encoder 114, the video decoder 134 and the described methods produce output variables 261, which are stored in the memory 234 in corresponding memory locations 262, 263, 264. Intermediate variables 258 may be stored in memory locations 259, 260, 266 and 267.

[00075] Referring to the processor 205 of Fig. 2B, the registers 244, 245, 246, the arithmetic logic unit (ALU) 240, and the control unit 239 work together to perform sequences of micro operations needed to perform“fetch, decode, and execute” cycles for every instruction in the instruction set making up the program 233. Each fetch, decode, and execute cycle comprises: a fetch operation, which fetches or reads an instruction 231 from a memory location 228, 229, 230; a decode operation in which the control unit 239 determines which instruction has been fetched; and an execute operation in which the control unit 239 and/or the ALU 240 execute the instruction. [00076] Thereafter, a further fetch, decode, and execute cycle for the next instruction may be executed. Similarly, a store cycle may be performed by which the control unit 239 stores or writes a value to a memory location 232.

[00077] Each step or sub-process in the method of Figs. 12 and 13, to be described, is associated with one or more segments of the program 233 and is typically performed by the register section 244, 245, 247, the ALU 240, and the control unit 239 in the processor 205 working together to perform the fetch, decode, and execute cycles for every instruction in the instruction set for the noted segments of the program 233.

[00078] Fig. 3 is a schematic block diagram showing functional modules of the video encoder 114. Fig. 4 is a schematic block diagram showing functional modules of the video decoder 134. Generally, data passes between functional modules within the video encoder 114 and the video decoder 134 in groups of samples or coefficients, such as divisions of blocks into sub-blocks of a fixed size, or as arrays. The video encoder 114 and video decoder 134 may be implemented using a general-purpose computer system 200, as shown in Figs. 2A and 2B, where the various functional modules may be implemented by dedicated hardware within the computer system 200, by software executable within the computer system 200 such as one or more software code modules of the software application program 233 resident on the hard disk drive 205 and being controlled in its execution by the processor 205. Alternatively the video encoder 114 and video decoder 134 may be implemented by a combination of dedicated hardware and software executable within the computer system 200. The video encoder 114, the video decoder 134 and the described methods may alternatively be implemented in dedicated hardware, such as one or more integrated circuits performing the functions or sub functions of the described methods. Examples of dedicated hardware include graphic processing units (GPUs), digital signal processors (DSPs), application-specific standard products (ASSPs), application-specific integrated circuits (ASICs), field programmable gate arrays (FPGAs) or one or more microprocessors and associated memories. In particular, the video encoder 114 comprises modules 310-386 and the video decoder 134 comprises modules 420-496 which may each be implemented as one or more software code modules of the software application program 233.

[00079] Implementations of the video encoder 114 and the video decoder 134 described herein can reduce on-chip memory consumption by processing the image data or bitstream in regions smaller than a CTU. On-chip memory is particularly costly as on-chip memory consumes a large area on a die. Software implementations may also benefit by confining more memory access to low levels of cache (e.g LI and L2 cache), reducing a need to access external memory. Thus, for reduced memory consumption, implementations of the video encoder 114 and the video decoder 134 can process data at a granularity smaller than the granularity of one CTU at a time.

[00080] The smaller granularity spatial area for processing may be a region (or‘virtual pipeline data unit’) size of 64x64 luma samples, tiled within each CTU. VPDUs are similar to the four regions resulting from one quadtree subdivision of a CTU. Moreover, the smaller granularity defines a region, treated as an indivisible region in the sense that each processing stage operates upon all the samples in one VPDU before progressing to the next VPDU. As such, once a VPDU is processed, there is no need to revisit the same VPDU later to process some

unprocessed portion. The indivisible region is passed through each processing stage of a pipelined architecture. The pipelined processing region is considered indivisible in the sense that the processing region defines one aggregation or chunk of data (such as samples, a collection of blocks and coefficients, a portion of the bitstream). The aggregation of data corresponds to a particular area on a frame (such as the frame 800) and is passed through the pipeline. Within the processing region, there can be various arrangements of CUs, and CUs may span multiple of the smaller granularity regions. The processing regions allow each pipeline processing stage to locally store only data associated with the smaller region, for example 64x64 luma samples or less, as opposed to data associated with the full CTU size of 128x 128. A corresponding local memory reduction for the chroma data is also realised using pipeline processing according to a VPDU data granularity (that is processing or parsing a VPDU at a time).

[00081] Although the video encoder 114 of Fig. 3 is an example of a versatile video coding (VVC) video encoding pipeline, other video codecs may also be used to perform the processing stages described herein. The video encoder 114 receives captured frame data 113, such as a series of frames, each frame including one or more colour channels. A block partitioner 310 firstly divides the frame data 113 into CTUs, generally square in shape and configured such that a particular size for the CTUs is used. The size of the CTUs may be 64x64, 128x 128, or 256x256 luma samples for example. The block partitioner 310 further divides each CTU into one or more CUs, with the CUs having a variety of sizes, which may include both square and non-square aspect ratios. However, in the VVC standard, CUs, PUs, and TUs always have side lengths that are powers of two. Thus, a current CU, represented as 312, is output from the block partitioner 310, progressing in accordance with an iteration over the one or more blocks of the CTU, in accordance with the coding tree of the CTU. Options for partitioning CTUs into CUs are further described below with reference to Figs. 5 and 6.

[00082] The CTUs resulting from the first division of the frame data 113 may be scanned in raster scan order and may be grouped into one or more‘slices’. A slice may be an‘intra’ (or T) slice, indicating that every CU in the slice is intra predicted. Alternatively, a slice may be uni- or bi-predicted (‘P’ or‘B’ slice, respectively), indicating the additional availability of uni- and bi-prediction, respectively. As the frame data 113 typically includes multiple colour channels, the CTUs and CUs are associated with the samples from all colour channels that overlap with the block area defined from operation of the block partitioner 310. A CU includes one coding block (CB) for each colour channel of the frame data 113. Due to the potentially differing sampling rate of the chroma channels compared to the luma channel, the dimensions of CBs for chroma channels may differ from those of CBs for luma channels. When using the 4:2:0 chroma format, CBs of chroma channels of a CU have dimensions of half of the width and height of the CB for the luma channel of the CU.

[00083] For each CTU, the video encoder 114 operates in two stages. In the first stage (referred to as a‘search’ stage), the block partitioner 310 tests various potential configurations of the coding tree. Each potential configuration of the coding tree has associated‘candidate’ CUs. The first stage involves testing various candidate CUs to select CUs providing relatively high compression efficiency with relatively low distortion. The testing generally involves a Lagrangian optimisation whereby a candidate CU is evaluated based on a weighted

combination of the rate (coding cost) and the distortion (error with respect to the input frame data 113). The‘best’ candidate CUs (those with the lowest rate/distortion) are selected for subsequent encoding into the bitstream 115. Included in evaluation of candidate CUs is an option to use a CU for a given area or to split the area according to various splitting options and code each of the smaller resulting areas with further CUs, or split the areas even further. As a consequence, both the CUs and the coding tree themselves are selected in the search stage.

[00084] The video encoder 114 produces a prediction unit (PU), indicated by an arrow 320, for each CU, for example the CU 312. The PU 320 is a prediction of the contents of the associated CU 312. A subtracter module 322 produces a difference, indicated as 324 (or‘residual’, referring to the difference being in the spatial domain), between the PU 320 and the CU 312. The difference 324 is a block-size difference between corresponding samples in the PU 320 and the CU 312. The difference 324 is transformed, quantised and represented as a transform unit (TU), indicated by an arrow 336. The PU 320 and associated TU 336 are typically chosen as the‘best’ one of many possible candidate CUs. Selection as the‘best’ relates to selection based on associated efficiency and distortion.

[00085] A candidate coding unit (CU) is a CU resulting from one of the prediction modes available to the video encoder 114 for the associated PU and the resulting residual. Each candidate CU results in one or more corresponding TUs, as described hereafter with reference to Figs. 10-12. The TU 336 is a quantised and transformed representation of the difference 324. When combined with the predicted PU in the decoder 114, the TU 336 reduces the difference between decoded CUs and the original CU 312 at the expense of additional signalling in a bitstream.

[00086] Each candidate coding unit (CU), that is prediction unit (PU) in combination with a transform unit (TU), thus has an associated coding cost (or‘rate’) and an associated difference (or‘distortion’). The rate is typically measured in bits. The distortion of the CU is typically estimated as a difference in sample values, such as a sum of absolute differences (SAD) or a sum of squared differences (SSD). The estimate resulting from each candidate PU is determined by a mode selector 386 using the difference 324 to determine an intra prediction

mode (represented by an arrow 388). Estimation of the coding costs associated with each candidate prediction mode and corresponding residual coding can be performed at significantly lower cost than entropy coding of the residual. Accordingly, a number of candidate modes can be evaluated to determine an optimum mode in a rate-distortion sense.

[00087] Determining an optimum mode is typically achieved using a variation of Lagrangian optimisation. Selection of the intra prediction mode 388 typically involves determining a coding cost for the residual data resulting from application of a particular intra prediction mode. The coding cost may be approximated by using a‘sum of absolute transformed differences’ (SATD) whereby a relatively simple transform, such as a Hadamard transform, is used to obtain an estimated transformed residual cost. In some implementations using relatively simple transforms, the costs resulting from the simplified estimation method are monotonically related to the actual costs that would otherwise be determined from a full evaluation. In

implementations with monotonically related estimated costs, the simplified estimation method may be used to make the same decision (i.e. intra prediction mode) with a reduction in complexity in the video encoder 114. To allow for possible non-monotonicity in the relationship between estimated and actual costs, the simplified estimation method may be used to generate a list of best candidates. The non-monotonicity may result from further mode decisions available for the coding of residual data, for example. The list of best candidates may be of an arbitrary number. A more complete search may be performed using the best candidates to establish optimal mode choices for coding the residual data for each of the candidates, allowing a final selection of the intra prediction mode along with other mode decisions.

[00088] The other mode decisions include an ability to skip a forward transform, known as ‘transform skip’. Skipping the transforms is suited to residual data that lacks adequate correlation for reduced coding cost via expression as transform basis functions. Certain types of content, such as relatively simple computer generated graphics may exhibit similar behaviour. For a‘skipped transform’, residual coefficients are still coded even though the transform itself is not performed.

[00089] Lagrangian or similar optimisation processing can be employed to both select an optimal partitioning of a CTU into CUs (by the block partitioner 310) as well as the selection of a best prediction mode from a plurality of possibilities. Through application of a Lagrangian optimisation process of the candidate modes in the mode selector module 386, the intra prediction mode with the lowest cost measurement is selected as the best or current mode. The current mode is the selected intra prediction mode 388 and is also encoded in the bitstream 115 by an entropy encoder 338. The selection of the intra prediction mode 388 by operation of the mode selector module 386 extends to operation of the block partitioner 310. For example, candidates for selection of the intra prediction mode 388 may include modes applicable to a given block and additionally modes applicable to multiple smaller blocks that collectively are collocated with the given block. In cases including modes applicable to a given block and smaller collocated blocks, the process of selection of candidates implicitly is also a process of determining the best hierarchical decomposition of the CTU into CUs.

[00090] In the second stage of operation of the video encoder 114 (referred to as a‘coding’ stage), an iteration over the selected coding tree, and hence each selected CU, is performed in the video encoder 114. In the iteration, the CUs are encoded into the bitstream 115, as described further herein.

[00091] The entropy encoder 338 supports both variable-length coding of syntax elements and arithmetic coding of syntax elements. Arithmetic coding is supported using a context-adaptive binary arithmetic coding process. Arithmetically coded syntax elements consist of sequences of one or more‘bins’ . Bins, like bits, have a value of‘0’ or‘ G . However bins are not encoded in the bitstream 115 as discrete bits. Bins have an associated predicted (or‘likely’ or‘most probable’) value and an associated probability, known as a‘context’. When the actual bin to be coded matches the predicted value, a‘most probable symbol’ (MPS) is coded. Coding a most probable symbol is relatively inexpensive in terms of consumed bits. When the actual bin to be coded mismatches the likely value, a‘least probable symbol’ (LPS) is coded. Coding a least probable symbol has a relatively high cost in terms of consumed bits. The bin coding techniques enable efficient coding of bins where the probability of a‘0’ versus a‘ G is skewed. For a syntax element with two possible values (that is, a‘flag’), a single bin is adequate. For syntax elements with many possible values, a sequence of bins is needed.

[00092] The presence of later bins in the sequence may be determined based on the value of earlier bins in the sequence. Additionally, each bin may be associated with more than one context. The selection of a particular context can be dependent on earlier bins in the syntax element, the bin values of neighbouring syntax elements (i.e. those from neighbouring blocks) and the like. Each time a context-coded bin is encoded, the context that was selected for that bin (if any) is updated in a manner reflective of the new bin value. As such, the binary arithmetic coding scheme is said to be adaptive.

[00093] Also supported by the video encoder 114 are bins that lack a context (‘bypass bins’). Bypass bins are coded assuming an equiprobable distribution between a‘0’ and a‘ G. Thus, each bin occupies one bit in the bitstream 115. The absence of a context saves memory and reduces complexity, and thus bypass bins are used where the distribution of values for the particular bin is not skewed. One example of an entropy coder employing context and adaption is known in the art as CAB AC (context adaptive binary arithmetic coder) and many variants of this coder have been employed in video coding.

[00094] The entropy encoder 338 encodes the intra prediction mode 388 using a combination of context-coded and bypass-coded bins. Typically, a list of‘most probable modes’ is generated in the video encoder 114. The list of most probable modes is typically of a fixed length, such as three or six modes, and may include modes encountered in earlier blocks. A context-coded bin encodes a flag indicating if the intra prediction mode is one of the most probable modes. If the intra prediction mode 388 is one of the most probable modes, further signalling, using bypass- coded bins, is encoded. The encoded further signalling is indicative of which most probable mode corresponds with the intra prediction mode 388, for example using a truncated unary bin string. Otherwise, the intra prediction mode 388 is encoded as a‘remaining mode’. Encoding as a remaining mode uses an alternative syntax, such as a fixed-length code, also coded using bypass-coded bins, to express intra prediction modes other than those present in the most probable mode list.

[00095] A multiplexer module 384 outputs the PU 320 according to the determined best intra prediction mode 388, selecting from the tested prediction mode of each candidate CU. The candidate prediction modes need not include every conceivable prediction mode supported by the video encoder 114.

[00096] Prediction modes fall broadly into two categories. A first category is‘intra-frame prediction’ (also referred to as‘intra prediction’) In intra-frame prediction, predicted samples for a block are generated, and the generation method may use other samples obtained from the current frame. For an intra-predicted PU, it is possible for different intra-prediction modes to be used for luma and chroma, and thus intra prediction is described primarily in terms of operation upon PBs rather than PUs. In general, for intra prediction, samples are obtained according to a template which abuts the current block and the obtained samples are used to generate the predicted samples according to an‘intra prediction mode’ . Examples of intra prediction mode are‘DC’,‘planar’, and‘angular modes’ . Another type of prediction using samples from the current frame is known as‘current picture referencing’ (CPR). For CPR, instead of selecting samples using a fixed template relative to the current block, a block of samples is selected from samples already reconstructed in the current frame, the block being relative to the current block according to a block vector. The selected block is used to form predicted samples.

[00097] The second category of prediction modes is‘inter-frame prediction’ (also referred to as ‘inter prediction’). In inter-frame prediction a prediction for a block is produced using samples from one or two frames preceding the current frame in the order of coding frames in the bitstream (which may differ from the order of the frames when captured or displayed). When one frame is used for prediction, the block is said to be‘uni-predicted’ and has one associated motion vector. When two frames are used for prediction, the block is said to be‘bi-predicted’ and has two associated motion vectors. For a P slice, each CU may be intra predicted or uni- predicted and for a B slice, each CU may be intra predicted, uni -predicted, or bi-predicted. Frames are typically coded using a‘group of picture’ structure, enabling a temporal hierarchy of frames. A temporal hierarchy of frames allows a frame to reference a preceding and a subsequent picture in the order of displaying the frames, with the pictures being coded in the order necessary to ensure the dependencies for decoding each frame are met. The one of two frames are each selected from a‘reference picture list’, comprising previously decoded pictures from the video bitstream and a reference picture index is associated with each motion vector to select one picture from one reference picture list (RefListO) for uni-prediction or one picture from each reference picture list (RefListO and RefListl) for bi-prediction

[00098] In VVC, CPR is treated as an inter prediction mode in that the current picture is treated as referenceable by RefListO, however in constrast to inter prediction, the current picture is access prior to loop filtering and the accessable area of the current picture is constrained to reduce memory consumption, and references to samples that have not yet been reconstructed, i.e. from future blocks, are prohibited.

[00099] A subcategory of inter prediction is the‘skip mode’ . Inter prediction and skip mode will be described as two distinct modes, even though they both involve motion vectors referencing blocks of samples from preceding frames. Inter prediction involves a coded motion vector delta, providing a spatial offset to a selected motion vector prediction. Inter prediction also uses a coded residual in the bitstream 133. Skip mode uses only an index (a‘merge index’) to select one out of several motion vector candidates. The selected candidate is used without any further signalling. Also, skip mode does not support the coding of any residual.

The absence of coded residual when the skip mode is used means that there is no need to perform transforms for the skip mode and therefore skip mode does not result in pipeline processing issues, as may be the case for intra predicted CUs and inter predicted CUs. Due to the limited signalling of the skip mode, this mode is useful for achieving very high compression performance when high quality reference frames are available. Bi-predicted CUs in higher temporal layers of a random-access group-of-picture structure typically have high quality reference pictures and motion vector candidates that accurately reflect underlying motion. Consequently, skip mode is useful for bi-predicted blocks in frames at higher temporal layers in a‘random access’ group-of-picture structure.

[000100] The samples are selected according to a motion vector and reference picture index. The motion vector and reference picture index applies to all colour channels and thus inter prediction is described primarily in terms of operation upon PUs rather than PBs. Within each category (that is, intra- and inter-frame prediction), different techniques may be applied to generate the PU. For example, intra prediction may use values from adjacent rows and columns of previously reconstructed samples, in combination with a direction to generate a PU according to a prescribed filtering and generation process. Alternatively, the PU may be described using a relatively small number of parameters. Inter prediction methods may vary in the number of motion parameters and their precision. Motion parameters typically comprise a reference frame index, indicating which reference frame(s) from lists of reference frames are to be used plus a spatial translation for each of the reference frames, but may include more frames, special frames, or complex affine parameters such as scaling and rotation. In addition, a pre determined motion refinement process may be applied to generate dense motion estimates based on referenced sample blocks.

[000101] Having determined and selected a best PU 320, and subtracted the PU 320 from the original sample block at the subtractor 322, a residual with lowest coding cost, represented as 324, is obtained and subjected to lossy compression. The lossy compression process comprises the steps of transformation, quantisation and entropy coding. A transform module 326 applies a forward transform to the difference 324, converting the difference 324 from the spatial domain to the frequency domain, and producing transform coefficients represented by an arrow 332.

The forward transform is typically separable, transforming a set of rows and then a set of columns of each block. The transformation of each set of rows and columns is performed by applying one-dimensional transforms firstly to each row of a block to produce a partial result and then to each column of the partial result to produce a final result.

[000102] The transform coefficients 332 are passed to a quantiser module 334. At the module 334, quantisation in accordance with a‘quantisation parameter’ is performed to produce residual coefficients, represented by the arrow 336. The quantisation parameter is constant for a given TB and thus results in a uniform scaling for the production of residual coefficients for a TB. A non-uniform scaling is also possible by application of a‘quantisation matrix’, whereby the scaling factor applied for each residual coefficient is derived from a combination of the quantisation parameter and the corresponding entry in a scaling matrix, typically having a size equal to that of the TB. The residual coefficients 336 are supplied to the entropy encoder 338 for encoding in the bitstream 115. Typically, the residual coefficients of each TB with at least one significant residual coefficient of the TU are scanned to produce an ordered list of values, according to a scan pattern. The scan pattern generally scans the TB as a sequence of 4><4‘sub blocks’, providing a regular scanning operation at the granularity of 4x4 sets of residual coefficients, with the arrangement of sub-blocks dependent on the size of the TB. Additionally, the prediction mode 388 and the corresponding block partitioning are also encoded in the bitstream 115.

[000103] As described above, the video encoder 114 needs access to a frame representation corresponding to the frame representation seen in the video decoder 134. Thus, the residual coefficients 336 are also inverse quantised by a dequantiser module 340 to produce inverse transform coefficients, represented by an arrow 342. The inverse transform coefficients 342 are passed through an inverse transform module 348 to produce residual samples, represented by an arrow 350, of the TU. A summation module 352 adds the residual samples 350 and the PU 320 to produce reconstructed samples (indicated by an arrow 354) of the CU.

[000104] The reconstructed samples 354 are passed to a reference sample cache 356 and an in loop filters module 368. The reference sample cache 356, typically implemented using static RAM on an ASIC (thus avoiding costly off-chip memory access), is intended to minimise sample storage needed to satisfy the dependencies for generating intra-frame PBs for subsequent CUs in the frame. The minimal dependencies typically include a Tine buffer’ of samples along the bottom of a row of CTUs, for use by the next row of CTUs and column buffering the extent of which is set by the height of the CTU. The reference sample cache 356 supplies reference samples (represented by an arrow 358) to a reference sample filter 360. The sample filter 360 applies a smoothing operation to produce filtered reference samples (indicated by an arrow 362). The filtered reference samples 362 are used by an intra-frame prediction module 364 to produce an intra-predicted block of samples, represented by an arrow 366. For each candidate intra prediction mode the intra-frame prediction module 364 produces a block of samples, that is 366.

[000105] The reference sample cache 356 also holds reconstructed samples that may be needed for CPR. As such, a buffer generally amounting to one CTU is present, although the spatial arrangement of samples within the buffer need not be confined to the area of one CTU, instead the buffer may be divided into VPDUs, such that samples of the last few (e.g. four) VPDUs are held for reference. This buffering includes the VPDU within which CUs are presently being decoded. A current picture referencing module 390 produces a CPR PU 392, which is input to the multiplexier 384 and has a block vector. If CPR is chosen as the prediction mode then the CPR PU 392 is selected as the one for use by the CU.

[000106] The in-loop filters module 368 applies several filtering stages to the reconstructed samples 354. The filtering stages include a‘deblocking filter’ (DBF) which applies smoothing aligned to the CU boundaries to reduce artefacts resulting from discontinuities. Another filtering stage present in the in-loop filters module 368 is an‘adaptive loop filter’ (ALF), which applies a Wiener-based adaptive filter to further reduce distortion. A further available filtering stage in the in-loop filters module 368 is a‘sample adaptive offset’ (SAO) filter. The SAO filter operates by firstly classifying reconstructed samples into one or multiple categories and, according to the allocated category, applying an offset at the sample level.

[000107] Filtered samples, represented by an arrow 370, are output from the in-loop filters module 368. The filtered samples 370 are stored in a frame buffer 372. The frame buffer 372 typically has the capacity to store several (for example up to 16) pictures and thus is stored in the memory 206. The frame buffer 372 is not typically stored using on-chip memory due to the large memory consumption required. As such, access to the frame buffer 372 is costly in terms of memory bandwidth. The frame buffer 372 provides reference frames (represented by an arrow 374) to a motion estimation module 376 and a motion compensation module 380.

[000108] The motion estimation module 376 estimates a number of‘motion vectors’ (indicated as 378), each being a Cartesian spatial offset from the location of the present CU, referencing a block in one of the reference frames in the frame buffer 372. A filtered block of reference samples (represented as 382) is produced for each motion vector. The filtered reference samples 382 form further candidate modes available for potential selection by the mode selector 386. Moreover, for a given CU, the PU 320 may be formed using one reference block (‘uni -predicted’) or may be formed using two reference blocks (‘bi-predicted’). For the selected motion vector, the motion compensation module 380 produces the PU 320 in accordance with a filtering process supportive of sub-pixel accuracy in the motion vectors. As such, the motion estimation module 376 (which operates on many candidate motion vectors) may perform a simplified filtering process compared to that of the motion compensation module 380 (which operates on the selected candidate only) to achieve reduced computational complexity.

[000109] Although the video encoder 114 of Fig. 3 is described with reference to versatile video coding (VYC), other video coding standards or implementations may also employ the processing stages of modules 310-386. The frame data 113 (and bitstream 115) may also be read from (or written to) memory 206, the hard disk drive 210, a CD-ROM, a Blu-ray disk™ or other computer readable storage medium. Additionally, the frame data 113 (and bitstream 115) may be received from (or transmitted to) an external source, such as a server connected to the communications network 220 or a radio-frequency receiver.

[000110] The video decoder 134 is shown in Fig. 4. Although the video decoder 134 of Fig. 4 is an example of a versatile video coding (VVC) video decoding pipeline, other video codecs may also be used to perform the processing stages described herein. As shown in Fig. 4, the bitstream 133 is input to the video decoder 134. The bitstream 133 may be read from memory 206, the hard disk drive 210, a CD-ROM, a Blu-ray disk™ or other non-transitory computer readable storage medium. Alternatively, the bitstream 133 may be received from an external source such as a server connected to the communications network 220 or a radio frequency receiver. The bitstream 133 contains encoded syntax elements representing the captured frame data to be decoded.

[000111] The bitstream 133 is input to an entropy decoder module 420. The entropy decoder module 420 extracts syntax elements from the bitstream 133 and passes the values of the syntax elements to other modules in the video decoder 134. The entropy decoder module 420 applies a CAB AC algorithm to decode syntax elements from the bitstream 133. The decoded syntax elements are used to reconstruct parameters within the video decoder 134. Parameters include residual coefficients (represented by an arrow 424) and mode selection information such as an intra prediction mode (represented by an arrow 458). The mode selection information also includes information such as motion vectors, and the partitioning of each CTU into one or more CUs. Parameters are used to generate PUs, typically in combination with sample data from previously decoded CUs.

[000112] The residual coefficients 424 are input to a dequantiser module 428. The dequantiser module 428 performs inverse quantisation (or‘scaling’) on the residual coefficients 424 to create reconstructed transform coefficients, represented by an arrow 440, according to a quantisation parameter. Should use of a non-uniform inverse quantisation matrix be indicated in the bitstream 133, the video decoder 134 reads a quantisation matrix from the bitstream 133 as a sequence of scaling factors and arranges the scaling factors into a matrix. The inverse scaling uses the quantisation matrix in combination with the quantisation parameter to create the reconstructed intermediate transform coefficients.

[000113] The reconstructed transform coefficients 440 are passed to an inverse transform module 444. The module 444 transforms the coefficients from the frequency domain back to the spatial domain. The TB is effectively based on significant residual coefficients and non- significant residual coefficient values. The result of operation of the module 444 is a block of residual samples, represented by an arrow 448. The residual samples 448 are equal in size to the corresponding CU. The residual samples 448 are supplied to a summation module 450. At the summation module 450 the residual samples 448 are added to a decoded PU (represented as 452) to produce a block of reconstructed samples, represented by an arrow 456. The reconstructed samples 456 are supplied to a reconstructed sample cache 460 and an in-loop filtering module 488. The in-loop filtering module 488 produces reconstructed blocks of frame samples, represented as 492. The frame samples 492 are written to a frame buffer 496.

[000114] The reconstructed sample cache 460 operates similarly to the reconstructed sample cache 356 of the video encoder 114. The reconstructed sample cache 460 provides storage for reconstructed sample needed to intra predict subsequent CUs without the memory 206 (for example by using the data 232 instead, which is typically on-chip memory). Reference samples, represented by an arrow 464, are obtained from the reconstructed sample cache 460 and supplied to a reference sample filter 468 to produce filtered reference samples indicated by arrow 472. The filtered reference samples 472 are supplied to an intra-frame prediction module 476. The module 476 produces a block of intra-predicted samples, represented by an arrow 480, in accordance with the intra prediction mode parameter 458 signalled in the bitstream 133 and decoded by the entropy decoder 420.

[000115] When intra prediction is indicated in the bitstream 133 for the current CU, the intra- predicted samples 480 form the decoded PU 452 via a multiplexor module 484.

[000116] When inter prediction is indicated in the bitstream 133 for the current CU, a motion compensation module 434 produces a block of inter-predicted samples, represented as 438, using a motion vector and reference frame index to select and filter a block of samples from a frame buffer 496. The inter-predicted samples 438 form the decoded PU 452 via a multiplexor module 484. The block of samples 498 is obtained from a previously decoded frame stored in the frame buffer 496. For bi-prediction, two blocks of samples are produced and blended together to produce samples for the decoded PU 452. The frame buffer 496 is populated with filtered block data 492 from an in-loop filtering module 488. As with the in-loop filtering module 368 of the video encoder 114, the in-loop filtering module 488 applies any, at least, or all of the DBF, the ALF and SAO filtering operations. The in-loop filtering module 368 produces the filtered block data 492 from the reconstructed samples 456. [000117] When current picture referencing is indicated in the bitstream 133 for the current CU, a current picture referencing module 490 selects a block of samples from the reconstructed sample cache 460, according to the block vector of the current CU, as CPR samples 492. The CPR samples 492 are selected by the multiplexor 484 to form the decoded PU 452.

[000118] Fig. 5 is a schematic block diagram showing a collection 500 of available divisions or splits of a region into one or more sub-regions in the tree structure of versatile video coding.

The divisions shown in the collection 500 are available to the block partitioner 310 of the encoder 114 to divide each CTU into one or more CUs according to a coding tree, as determined by the Lagrangian optimisation, as described with reference to Fig. 3.

[000119] Although the collection 500 shows only square regions being divided into other, possibly non-square sub-regions, it should be understood that the diagram 500 is showing the potential divisions but not requiring the containing region to be square. If the containing region is non-square, the dimensions of the blocks resulting from the division are scaled according to the aspect ratio of the containing block. Once a region is not further split, that is, at a leaf node of the coding tree, a CU occupies that region. The particular subdivision of a CTU into one or more CUs by the block partitioner 310 is referred to as the‘coding tree’ of the CTU. The process of subdividiong regions into sub-regions terminates when the resulting sub-regions reach a a minimum CU size. In addition to constraining CUs to prohibit sizes smaller than for example 4><4, CUs are constrained to have a minimum width or height of four. Other minimums, both in terms of width and height or in terms of width or height are also possible. The process of subdivision may also terminate prior to the deepest level of decomposition, resulting in a CU larger than the minimum CU size. It is possible for no splitting to occur, resulting in a single CU occupying the entirety of the CTU.

[000120] At the leaf nodes of the coding tree exist CUs, with no further subdivision. For example, a leaf node 510 (or‘no split’) contains one CU. At the non-leaf nodes of the coding tree exist either a split into two or more further nodes, each of which could either contain a leaf node that thus one CU or contain further splits into smaller regions.

[000121] A quad-tree split 512 divides the containing region into four equal-size regions as shown in Fig. 5. Compared to HEVC, versatile video coding (VVC) achieves additional flexibility with the addition of a horizontal binary split 514 and a vertical binary split 516. Each of the splits 514 and 516 divides the containing region into two equal-size regions. The division is either along a horizontal boundary (514) or a vertical boundary (516) within the containing block.

[000122] Further flexibility is achieved in versatile video coding with the addition of a ternary horizontal split 518 and a ternary vertical split 520. The ternary splits 518 and 520 divide the block into three regions, bounded either horizontally (518) or vertically (520) along ¼ and ¾ of the containing region width or height. The combination of the quad tree, binary tree, and ternary tree is referred to as‘QTBTTT’ or alternatively as a multi-tree (MT).

[000123] Compared to HEVC, which supports only the quad tree and thus only supports square blocks, the QTBTTT results in many more possible CU sizes, particularly considering possible recursive application of binary tree and/or ternary tree splits. The potential for unusual (for example, non-square) block sizes may be reduced by constraining split options to eliminate splits that would result in a block width or height either being less than four samples or in not being a multiple of four samples. Generally, the constraint would apply in considering luma samples. However, the constraint may also apply separately to the blocks for the chroma channels, potentially resulting in differing minimum block sizes for luma versus chroma, for example when the frame data is in the 4:2:0 chroma format. Each split produces sub-regions with a side dimension either unchanged, halved or quartered, with respect to the containing region. Then, since the CTU size is a power of two, the side dimensions of all CUs are also powers of two.

[000124] Fig. 6 is a schematic flow diagram illustrating a data flow 600 of a QTBTTT (or ‘coding tree’) structure used in versatile video coding. The QTBTTT structure is used for each CTU to define a division of the CTU into one or more CUs. The QTBTTT structure of each CTU is determined by the block partitioner 310 in the video encoder 114 and encoded into the bitstream 115 or decoded from the bitstream 133 by the entropy decoder 420 in the video decoder 134. The data flow 600 further characterises the permissible combinations available to the block partitioner 310 for dividing a CTU into one or more CUs, according to the divisions shown in Fig. 5.

[000125] Starting from the top level of the hierarchy, that is at the CTU, zero or more quad-tree divisions are first performed. Specifically, a Quad-tree (QT) split decision 610 is made by the block partitioner 310. The decision at 610 returning a‘ 1’ symbol indicates a decision to split the current node into four sub-nodes according to the quad-tree split 512. The result is the generation of four new nodes, such as at 620, and for each new node, recursing back to the QT split decision 610. Each new node is considered in raster (or Z-scan) order. Alternatively, if the QT split decision 610 indicates that no further split is to be performed (returns a‘0’ symbol), quad-tree partitioning ceases and multi-tree (MT) splits are subsequently considered.

[000126] Firstly, an MT split decision 612 is made by the block partitioner 310. At 612, a decision to perform an MT split is indicated. Returning a‘0’ symbol at decision 612 indicates that no further splitting of the node into sub-nodes is to be performed. If no further splitting of a node is to be performed, then the node is a leaf node of the coding tree and corresponds to a CU. The leaf node is output at 622. Alternatively, if the MT split 612 indicates a decision to perform an MT split (returns a‘G symbol), the block partitioner 310 proceeds to a direction decision 614.

[000127] The direction decision 614 indicates the direction of the MT split as either horizontal (Ή’ or‘0’) or vertical (‘V’ or‘G). The block partitioner 310 proceeds to a decision 616 if the decision 614 returns a‘0’ indicating a horizontal direction. The block partitioner 310 proceeds to a decision 618 if the decision 614 returns a‘ G indicating a vertical direction.

[000128] At each of the decisions 616 and 618, the number of partitions for the MT split is indicated as either two (binary split or‘BT’ node) or three (ternary split or‘TT’) at the BT/TT split. That is, a BT/TT split decision 616 is made by the block partitioner 310 when the indicated direction from 614 is horizontal and a BT/TT split decision 618 is made by the block partitioner 310 when the indicated direction from 614 is vertical.

[000129] The BT/TT split decision 616 indicates whether the horizontal split is the binary split 514, indicated by returning a‘O’, or the ternary split 518, indicated by returning a‘G. When the BT/TT split decision 616 indicates a binary split, at a generate HBT CTU nodes step 625 two nodes are generated by the block partitioner 310, according to the binary horizontal split 514. When the BT/TT split 616 indicates a ternary split, at a generate HTT CTU nodes step 626 three nodes are generated by the block partitioner 310, according to the ternary horizontal split 518.

[000130] The BT/TT split decision 618 indicates whether the vertical split is the binary split 516, indicated by returning a‘O’, or the ternary split 520, indicated by returning a‘G. When the BT/TT split 618 indicates a binary split, at a generate VBT CTU nodes step 627 two nodes are generated by the block partitioner 310, according to the vertical binary split 516. When the BT/TT split 618 indicates a ternary split, at a generate VTT CTU nodes step 628 three nodes are generated by the block partitioner 310, according to the vertical ternary split 520. For each node resulting from steps 625-628 recursion of the data flow 600 back to the MT split decision 612 is applied, in a left-to-right or top-to-bottom order, depending on the direction 614. As a consequence, the binary tree and ternary tree splits may be applied to generate CUs having a variety of sizes.

[000131] Figs. 7A and 7B provide an example division 700 of a CTU 710 into a number of CUs. An example CU 712 is shown in Fig. 7A. Fig. 7A shows a spatial arrangement of CUs in the CTU 710. The example division 700 is also shown as a coding tree 720 in Fig. 7B.

[000132] At each non-leaf node in the CTU 710 of Fig. 7A, for example nodes 714, 716 and 718, the contained nodes (which may be further divided or may be CUs) are scanned or traversed in a‘Z-order’ to create lists of nodes, represented as columns in the coding tree 720. For a quad-tree split, the Z-order scanning results in top left to right followed by bottom left to right order. For horizontal and vertical splits, the Z-order scanning (traversal) simplifies to a top-to-bottom scan and a left-to-right scan, respectively. The coding tree 720 of Fig. 7B lists all nodes and CUs according to the applied scan order. Each split generates a list of two, three or four new nodes at the next level of the tree until a leaf node (CU) is reached.

[000133] Having decomposed the image into CTUs and further into CUs by the block partiti oner 310, and using the CUs to generate each residual block (324) as described with reference to Fig. 3, residual blocks are subject to forward transformation and quantisation by the video encoder 114. The resulting TBs 336 are subsequently scanned to form a sequential list of residual coefficients, as part of the operation of the entropy coding module 338. An equivalent process is performed in the video decoder 134 to obtain TBs from the bitstream 133.

[000134] Fig. 8A shows CUs of a CTU 800 with a vertical ternary split at the top level of the coding tree, and no further splits. The resulting CUs are CUs 802, 804, and 806, of size 32x 128, 64x 128, and 32x 128 respectively. The CUs 802, 804, and 806 are located within the CTU at offsets (0, 0), (32, 0), and (96, 0), respectively. For each CU a corresponding PU of the same size exists, and in the CTU 800 the corresponding PUs span multiple VPDUs. One or more TUs are also associated with each CU. When the CU size is equal to one of the transform sizes, one TU is associated with the CU and has a size equal to a transform of the corresponding size. The resulting TUs for each CU are as follows: CU 802 has two 32x64 TUs, CU 804 has two 64x64 TUs, and CU 806 has two 32x64 TUs. Due to the placement of the two 64x64 TUs of the CU 804 it is not possible to divide the CTU 800 into four VPDUs for processing in a pipelined manner. [000135] Fig. 8B shows a CTU 840 having an alternative arrangement of TUs associated with the CUs of the coding tree of Fig. 8 A. When the CU size is larger than any of the transform sizes, multiple TUs are arranged in a‘tiled’ manner to occupy the entirety of the CU. Tiling uses the largest available transform that‘fits’ within the CU, given width and height constraints. As with Fig. 8 A, a 32x 128 CU 860 and a 32x 128 CU 1046 use two 32x64 TUs in a tiled manner. Moreover, TUs are also prohibited from crossing a boundary 850 that divides the CTU 840 into four VPDUs. As a consequence, a 64x 128 CU 862 uses four 32x64 TUs arranged in a tiled manner, as 32x64 is the largest transform size available for the CU 862 that ‘fits’ inside the CU without crossing the boundary 850.

[000136] The arrangement of TUs of Fig. 8B enables processing of inter-predicted CUs to be performed in a pipelined manner (VPDU by VPDU) after entropy decoding. Pipelined processing may be performed because an inter-predicted CU may be processed piecemeal, since each portion of the CU depends on the reference pictures and the motion vector is common to the entire CU. For intra-predicted CUs, dependencies on neighbouring samples from previous CUs prohibit the VPDU-by-VPDU style of processing. The dependency is solved by only allowing intra-predicted CUs to exist when the top level split is a quadtree split. When the top level split is a quadtree split, each of the resulting four nodes occupies one VPDU. Each of the four nodes may be processed in its entirety before progressing to the next VPDU. Simulation results show that the restriction allowing intra-predicted CUs to exist when the top level split is a quadtree split has no impact when coding in‘all intra’ configuration under JVET common test conditions (CTC), and the lack of coding performance impact is due to the limited use for intra prediction in large blocks, i.e. blocks exceeding 64x64 samples. When coding typical video content, blocks exceeding 64x64 samples are encountered at low bit rates and use inter prediction. Consequently, only allowing intra-predicted CUs underneath a top level quadtree split confines the resulting CUs to within each VPDU and eliminates reference sample dependencies on future VPDUs, solving pipeline issues with negligible coding performance impact.

[000137] Another constraint that achieves the same effect is to only allow intra predicted CUs in regions where the top level split is binary in one direction and a second level split is binary in the opposite direction. Then, the two resulting regions are each one VPDU in size with one region fully processed before progressing to the next one. The restrictions on usage of intra prediction may be implemented as a conformance constraint. When implemented as a conformance constraint, the‘pred_mode_flag’ syntax element of each CU is able to express via its binarisation a selection of either intra (‘ G) or inter prediction (‘0’), however the allowed selection is constrained in only values of pred_mode that accord with the above criteria may be used. For example, pred_mode_flag may only be‘ in a CU underneath a quadtree split.

When implemented as a signalling constraint, the‘pred_mode_flag’ syntax element is only able to express (binarise) the selection of prediction modes that are allowed for the CU, given parent splits of the CU in the coding tree of the CTU. For example, in a CU that is not underneath a quadtree split pred_mode_flag may not be‘ G (intra), in which case pred_mode flag only has one possible value (‘0’ for inter) and may be omitted.

[000138] Fig. 9A is shows a coding tree division of a coding tree unit (CTU) 9100 into five coding units (CUs). The CUs are CU0 9110, of size 32x 128 and at location (0, 0) in the CTU 9100, CU1 9112, of size 64x32 and at location (32, 0) in the CTU 9100, CU2 9114, of size 64x64 and at location (32, 32) in the CTU 9100, CU3 9116, of size 64x32 and at location (32, 96) in the CTU 9100, and CU4 9118, of size 32x 128 and at location (96, 0) in the

CTU 9100. A boundary 9120 divides the CTU 9100 into VPDUs labelled VPDU0-3 in Fig. 9A. Regarding VPDU order, when the top-level split is a vertical split, as is the case in Fig. 9A, the VPDUs are ordered top-left, bottom-left, top-right, bottom-right. When the top-level split is a horizontal split or a quadtree split, the VPDUs within a CTU are ordered top-left, top-right, bottom-left, bottom-right. The two orders result in VPDU order that is closer to the underlying CU order. Notwithstanding the order, as shown in Fig. 9A, the CU order results in re-entry of earlier VPDUs For example, CU0 9110 spans VPDU0 and VPDU1, then CU1 9112 returns to VPDUO, while also spanning into VPDU2.

[000139] Due to various possible top-level splits of a coding tree versus the fixed arrangement of VPDUs in each CTU, a heuristic for determining the transition from one VPDU to the next is useful. For processing of samples in a given VPDU to commence, all associated information (coding tree and resulting block structure, prediction modes and residual coefficients) is used. Once the CU overlapping the lower right sample of the VPDU has been parsed, sufficient information to process the overall VPDU has been obtained by the video decoder 134. Thus, the transition from one VPDU to the next corresponds to the CUs overlapping the lower right samples of VPDUO-2, i.e. sample 9122 for VPDUO, sample 9124 for VPDUl, and sample 9126 for VPDU2. For VPDU3, completion of parsing the coding tree is sufficient to determine the boundary, as completion of the coding tree corresponds to the already known transition from one CTU to the next CTU.

[000140] Fig. 9B shows the transform units resulting from the coding trees of the CTU 9100 of Fig. 9A. For each CU in the CTU 9100 one or more TUs are inferred to exist based on the coding tree and the boundary 9120. The largest TU is used that occupies the CU without crossing the boundary 9120. Consequently, CUO 9110 has TU0 and TUI, each of size 32x64, CU1 9112 has TU2 and TU3, each of size 32x32, CU2 9114 has TU4-TU7, each of size 32x32, CU3 9116 has TU8 and TU9, each of size 32x32, and CU4 9118 has TU10 and TUI 1, each of size 32x64.

[000141] Fig. 9C shows a conventional coding order of a CTU 9300 having the same coding tree and thus the same transform units as shown in Fig. 9B. The CTU 9300 has VPDUs VPDU0 to VPDU3. The conventional coding order is to progress from one CU to another in an order resulting from traversal of the coding tree, i.e. a hierarchical Z-order scan of the coding tree. Within each CU, the TUs are traversed in a Z-order scan. The resulting order is shown as arrows in Fig. 9C (e.g. arrow 9310) and corresponds to the enumeration of the transform units TU0-TU11. The TU order of Fig. 9C results in re-entry of earlier VPDUs before the earlier VPDUs are fully processed. For example, TUI of VPDU1 is processed before TU2 and TU4 of VPDU0.

[000142] Fig. 9D shows a conventional coding order of the transform units of Fig. 9C in a bitstream portion 9400. The bitstream portion 9400 shows the relationship with the four VPDUs of the CTU 9300. In particular, a pipelined decoder that processes data on a VPDU basis receives some TUs before they are needed. For example, to process VPDU0, TU0, TU2, and TU4 need to be parsed, however in doing so TUI and TU3 are also parsed. The residual coefficients of TUI and TU3 thus need to be buffered in the entropy decoder 420 while processing VPDU0 for later use when processing VPDU1. The entropy decoder 420 can be considered as operating at the CTU level. However, downstream modules (inverse quantisation module 428, inverse transform module 444, inter prediction modules such as 376, 380) operate at the VPDU level. The order of the bitstream 9400 and/or operation of the decoder 420 at CTU level accordingly imply additional memory consumption in the entropy decoder 420 to hold the residual coefficients for later use.

[000143] Fig. 9E shows a coding order of the transform units of Fig. 9B in a CTU 9500. In the example of Fig. 9E, transform units TU0 to TUI 1 are coded in a consecutive order with respect to the four VPDUs (VPDU0-VPDU3) of the CTU. The ordering of parsing the TUs, as shown by arrows, e.g. an arrow 9510, no longer accords with the conventional Z-order traversal of TUs and Z-order traversal CUs in the coding tree. The order is such that all TUs in one VPDU are processed before advancing to the next VPDU. For example, TU0, TU2 and TU4 are processed prior to TUI or TU3. [000144] Fig. 9F shows a coding order of the transform units of Fig. 9E in a bitstream portion 9600. The transform units are divided such that the transform units of each VPDU of the CTU are coded adjacently. The CUs are coded according to a Z-order traversal of the coding tree, and thus are able to exploit all split options available. In particular, ternary splits are allowed at the top level of the coding tree and nested binary splits in the same direction are permitted from the top level down. The allowed splits are useful for achieving high

compression performance, especially for ultra high definition (UHD) content at low bitrates. In parsing each CU, the prediction mode is obtained. However, only the TUs within the same VPDU as the top-left sample of the CU are parsed immediately following the prediction mode being obtained. Any TUs in a CU that belong to a subsequent VPDU are deferred.

[000145] In Fig. 9F, deferred TUs are indicated using dashed lines. As a consequence of the order in the bitstream 9600, instead of buffering the residual coefficients of TUs that are to be used in a subsequent VPDU, coding of such TUs is delayed until the subsequent VPDU is reached. Therefore the buffering of the TUs for use in a subsequent VPDU in the decoder 420 is not needed. Instead, buffering of the size and location of the deferred TUs for each VPDU is needed (‘TU metadata’).

[000146] One example is shown as TU metadata 9610, indicating TUI, determined from parsing CU0 prediction mode but deferred until VPDUl. The TU metadata 9610 requires a relatively small amount of memory compared to buffering the residual coefficients themselves. Each TU width and height is one of 4, 8, 16, 32, or 64, the five values require a maximum of three bits. Accounting for both with width and height gives a total of six bits. Each TU may be located at any point on a 4><4 grid, so within a 64x64 VPDU, eight bits are needed to hold the TU location. Thus, the metadata of one TU can be held in two bytes.

[000147] Fig. 9G shows a conventional order of two coding units and corresponding transform units in a CTU 9700. The coding tree of the CTU 9700 includes a top level horizontal binary split that results in two 128x64 regions. The upper 128x64 is further split with a horizontal binary split resulting in two 128x32 regions. The upper 128x32 region is not further split, resulting in CU0 with two TUs, identified as CU0 (TU0) 9710 and CU0 (TUI) 9712. The lower 128x32 region is split with a vertical binary split resulting in a left 64x32 region that is not further split and occupied by CU1 (TU0) 9714, with the right 64x32 region not further considered. The CTU 9700 has VDPUs VDPUO to VDPU3. Since the top-level split of the coding tree of the CTU 9700 is a horizontal binary split, the VPDUs VPDUO-3 are ordered as top-left, top-right, bottom-left, and bottom-right. [000148] In a conventional coding order of the CTU 9700, a bitstream contains residual coefficients for TUs in the following order: CU0 (TU0) 9710, CU0 (TUI) 9712, CU1

(TU0) 9714. A decoder processing the CTU 9700 on a VPDU-basis parses CU0 (TUI) 9712 while processing VPDU0 9702 and needs to buffer the associated residual coefficients before progressing to VPDU1 9704. In the example of Fig. 9G, the WC standard only requires the upper-left 32x32 residual coefficients of the 64x32 coefficients of CU0 (TUI) 9712 to be coded. However, with coding tree structures resulting in blocks that do not exceed 32 samples in width or height, all coefficients of the resulting TUs need to be coded. CUs and associated TUs in the remainder of the CTU 9700, i .e. in VPDU2 9706 and VPDU3 9708, are also processed in the same conventional order.

[000149] Fig. 9H shows a coding order of two coding units and corresponding transform units in a CTU according to the arrangements described. In Fig. 9H, the transform units are coded in the order of VPDUs of the CTU 9800. Similarly to the CTU 9700, the CTU has VPDUs VPDU0 to VPDU3. The CTU 9800 has the same coding tree as the CTU 9700. However, in Fig. 9H the resulting TUs are coded in an order aligned to the VPDU processing order. Since the top-level split of the coding tree of the CTU 9800 is a horizontal binary split, VPDU0-3 are ordered as top-left, top-right, bottom-left, and bottom-right. When CU0 is determined as being 128x32 in size and located at (0, 0) in the CTU 9800 during processing of VPDU0 9802, CU0 (TU0) 9810 is parsed immediately whereas parsing of CU0 (TUI) 9814 is deferred. The CU1 is determined as being 64x32 in size and located at (0, 32) in the CTU 9800, after which CU1 (TU0) 9812 is parsed. The samples of VPDU0 9802 are able to be determined as the prediction modes of all CUs or portions thereof in this VPDU and all residual coefficients are available. Processing progresses from VPDU0 9802 to VPDU1 9804. The residual coefficients of CU0 (TUI) 9814 are parsed upon progression to VPDU1 9804.

[000150] In deferring parsing of the CU0 (TUI) 9814 until after CU1 (TU0) 9812, the need to simultaneously buffer the associated residual coefficients of each TU is avoided. A buffer size of one VPDU of residual coefficients would traditionally have been needed firstly for residual coefficients of VPDU0 9802, and then for residual coefficients of VPDU1 9804. Similar deferral of TU parsing is applied for VPDU2 9806 and VPDU3 9808. Deferral of TUs is possible where a CU can be partially processed and revisited later. Partial processing of CUs can be implemented when inter prediction is in use, as the required data for producing predicted samples of the CU is the reference frame and not samples from the current frame. When intra prediction is in use, samples neighbouring the current CU are used, which may not be available when partial processing takes place. One implementation allows intra prediction to be used if partial processing of CUs is not performed, such as when a top-level quadtree split occurs in the coding tree of a CTU. A top-level quadtree split divides the CTU into four 64x64 regions.

Each of the four 64x64 regions corresponds with one VPDU and each of of the 64x64 regions are fully processed before progressing from one VPDU to the next VPDU.

[000151] Worst case scenarios in terms of TU metadata occur when the coding tree leads to 4x4 size CUs in VPDUs after the present VPDU. However, due to the CU processing order in the coding tree, a subsequent VPDU cannot be fully occupied with CUs of size 4x4 before processing progresses to that VPDU.

[000152] Fig. 91 shows an example CTU 9900 with a top-level split being a binary split (horizontal direction). The horizontal binary split divides the CTU 9900 into a top region 9910, processed first, and a bottom region 9912, processed second. Each of the top region 9910 and the bottom region 9912 may be further divided into one or more CUs. TU buffering occurs from a VPDU0 (9902) to a VPDU1 (9904) and a VPDU2 (9906) to a VPDU3 (9908). TU buffering does not occur from VPDU1 (9904) to VPDU2 (9906) as the binary split partitions the CTU 9900 into two independent regions. The worst case of buffering future TUs occurs when the largest area of the CUs determined by the video decoder 134 overlaps subsequent VPDUs. When a ternary split occurs underneath and in the same direction as the parent top- level binary split with no further splits, the resulting CUs are CU0 9920, CU1 9922, and CU2 9924. TU metadata of the TUs s of CU0 9920 and CU2 9922 that occupy a dotted region 9930 is buffered to control subsequent parsing in the video decoder 134. From the area of the dotted region 9930, the saving in residual coefficient buffering is equal to three quarters the area of one VPDU.

[000153] The buffering requirement would be further increased if the region corresponding to CU2 9924 were subject to additional horizontal binary splits, resulting in additional CUs of width 128, that is spanning two VPDUs. With a maximum TU width of 64, the coding tree of the CTU 9900 results in relatively few TUs, that is, two TUs per CU as each CU has a width of 128. The worst case in terms of number of TUs to be postponed for later parsing occurs when the smallest TU size occurs in the later VPDU. Were the region corresponding to each CU subject to a vertical binary split and then the right side subject to numerous splits, such that the right side of each region was decomposed down to CUs of size 4x4, the worst case in terms of TU count would be realised. Thus, the buffering requirement in the example scenario is 4x4 TU metadata covering a 48x64 area of a subsequent VPDU, or 12x 16 or 192 TUs, or 384 bytes. With additional splits of the region corresponding to the CU2 9924, the buffering requirement approaches that of 4x44 TUs requirement to cover one VPDU, that is 256 TUs, or 512 bytes.

[000154] Another example of a severe case can be derived from the case shown in Fig. 9A. If CU1 9112, CU2 9114, and CU3 9116 are each split with a vertical binary split, and the right half result from each split is further split into CUs of size 4x4, then an area equivalent to one VPDU of TU metadata needs buffering, i.e. metadata for 256 TUs or 512 bytes. An increase in severity occurs if CU0 9110 is split using a horizontal binary split and the lower resulting half split into CUs of size 4x4, leading to an additional half a VPDU of buffered TU metadata, or 128 TUs metadata or an extra 256 bytes. Additional vertical binary splits within CU4 9116 would result in TU metadata buffering, however once CU4 is reached, the memory for buffering TU metadata associated with CU0 9110, CU1 9112, and CU2 9114 has been released, so no increase in worst-case memory requirement for metadata is seen. With 768 bytes

(512+256) of metadata for 384 TUs of size 4x4, the TUs can be arranged in a bitstream such that the residual coefficients are parsed in time for use by inverse quantisation and inverse transformation of a VPDU-based pipelined decoder architecture.

[000155] In contrast, if each TU was buffered, including consideration of the chroma channels, each TU would require storage for twenty-four (16+4+4=24) residual coefficients. Using two bytes per residual coefficient, or forty-eight (48) bytes per TU, a total of 768x48 would be required. Accordingly, the total cost would be in the region of 37Kbytes. Although a coding tree with fewer splits would result in fewer CUs and thus fewer TUs, the resulting TUs would be accordingly larger. Given the top level coding tree splits resulting in a worst case, there would be no reduction in the overall memory requirement for residual coefficient buffering due to tree structure due to further split combinations in the coding tree. When TU metadata is buffered, the amount of memory required for TU metadata does vary according to the number of TUs, however the worst case is unchanged. Overall, a memory saving of approximately 36Kbytes is achieved by reordering TUs according to VPDU order.

[000156] Fig. 10 shows a method 1000 of determining and encoding a coding tree of a CTU into the bitstream 115. In the method 1000, transform sizes are selected such that the resulting bitstream 115 may be decoded and processed in VPDU-based regions and with residual coefficients grouped in the bitstream 115 according to the designated VPDU. In the method 1000, a transform size is selected such that each transform unit can be processed in the entirely within a region defined according to a processing grid, i.e. a VPDU. Also, in the method 1000, TUs associated with a CU in a first VPDU but located within a subsequent VPDU are stored with other TUs from the subsequent VPDU in the bitstream 115.

[000157] The method 1000 may be embodied by apparatus such as a configured FPGA, an ASIC, or an ASSP. Additionally, the method 1000 may be performed by video encoder 114 under execution of the processor 205. As such, the method 1000 may be stored on computer- readable storage medium and/or in the memory 206. The frame data 113 is divided into CTU- sized regions and the method 1000 is performed on each resulting CTU-sized region.

[000158] The method 1000 commences with the processor 205 at a determine coding tree step 1005.

[000159] At the determine coding tree step 1005 the block partitioner 310, under execution of the processor 205, determines the coding tree of a CTU. The coding tree decomposes a CTU into one or more CUs 312 according to a series of splits, as described with reference to Figs. 5 & 6, and using the examples of Figs. 7A and 7B. The block partitioner 310 tests different combinations of splits in order to arrive at a particular coding tree that enables the CTU to be coded with a relatively high compression ratio, while maintaining fidelity of the decoded image, as described with reference to Fig. 3. The step 1005 thus determines a size of each coding unit (CU) by determining the coding tree. The step 1005 also determines the prediction mode of each CU of the determined coding tree, with an intra prediction mode determined when intra prediction is to be used and a motion vector determined when inter prediction is to be used. Control in the processor 205 progresses from the step 1005 to a determine VPDU ordering step 1010.

[000160] At the determine VPDU ordering step 1010 the video encoder 114, under execution of the processor 205, determines a processing order for the VPDUs in a CTU. The VPDU processing order is based on the top-level split of the coding tree of the CTU as follows:

[000161] Top-level split: No split (510), Quadtree split (512), Horizontal binary split (514), or horizontal ternary split (518) mean that the VPDU processing order is: Top-left, top-right, bottom-left, bottom-right.

[000162] Vertical Top-level split: binary split (516) or vertical horizontal split (520) mean that the VPDU processing order is: Top-left, bottom-left, top-right, bottom-right. [000163] Lists of TU metadata are initialised as empty for VPDU1-VPDU3 at step 1010. Control in the processor 205 progresses from the step 1010 to an encode coding tree step 1015.

[000164] At the encode coding tree step 1015 the video encoder 114, under execution of the processor 205, commences a Z-order recursion through the coding tree of the step 1005 in a depth-first manner, as described with reference to Fig. 6. The recursion continues until a leaf node (a coding unit) is encountered in the coding tree,. When a leaf node is reached control in the processor progresses from the step 1015 to a select coding unit step 1020. The leaf node is retained in the memory 206 so that recursion may continue from the leaf node.

[000165] At the select coding unit step 1020 the video encoder 114, under execution of the processor 205, determines the size and location of the current coding unit resulting from the step 1015. For example, CU0 of Fig. 9G is determined to have a size of 128x32 and a location of (0, 0). Control in the processor 205 progresses from the step 1020 to a VPDU boundary test step 1025.

[000166] At the VPDU boundary test step 1025 the video encoder 114, under execution of the processor 205, determines if the current coding unit delineates or overlaps a boundary between one VPDU and the next. The boundary is deemed to be the point in the bitstream 115 at which prediction modes of all blocks overlapping a given VPDU become known. By virtue of the Z- order traversal of the coding tree of the step 1015, the test performed is that the current coding unit overlaps the lowermost rightmost sample of a given VPDU. If a CU overlaps the lowermost rightmost sample of a given VPDU, all of the CUs covering the given VPDU have been decoded, at least to the extent of determining their prediction mode and associated information (intra prediction mode or motion vector). In the example of Fig. 9A, the sample 9122 delineates VPDU0 from VPDUl, the sample 9124 delineates VPDUl from VPDU2, and a sample 9126 delineates VPDU2 from VPDU3. If the current CU overlaps with one or more of the lowermost rightmost samples of the VPDUs (“Yes” at step 1025), control in the processor 205 progresses from the step 1025 to an encode buffered TUs step 1030.

Otherwise, the step 1025 returns“No” and control in the processor 205 progresses from the step 1025 to a generate TUs of CU step 1035.

[000167] At the encode buffered TUs step 1030, the entropy encoder 338, under execution of the processor 205, encodes the residual coefficients into the bitstream 115 of any TU deferred from earlier VPDU(s) to the current VPDU(s) according to the current CU overlap with VPDU boundary points, e.g. 9122, 9124, and 9126. Where the current CU overlaps more than one VPDU boundary point the TUs for each boundary are encoded into the bitstream 115. Control in the processor 205 progresses from step 1030 to the generate TUs of CU step 1035.

[000168] At the generate TUs of CU step 1035 the video encoder 114, under execution of the processor 205, determines the TUs of the current CU. Each determined TU has a size and location and is assigned to one of the VPDUs of the CTU. The TUs are determined such that the largest available transform size (generally side length up to 64 is available) is used The determination of the TUs is further constrained so that the resulting TUs do not span boundaries between adjacent VPDUs. Fig. 9B shows an example of the generated TUs for CUs of a given coding tree. The step 1035 is further described with reference to a method 1200 of Fig. 12. Control in the processor 205 progresses from step 1035 to a quantise and apply forward transform step 1040.

[000169] At the apply forward transforms and quantise step 1040 the transform module 326 and the quantiser module 334, under execution of the processor 205, apply a transform to transform the difference 324. The application produces residual coefficients 336 for each TU of the step 1035. Control in the processor 205 progresses from step 1040 to an encode TU(s) step 1045.

[000170] Control in the processor 205 progresses from step 1040 to an encode TU(s) step 1045.

[000171] At the encode TU(s) step 1045 the entropy encoder 338, under execution of the processor 205, encodes the residual coefficients for TUs contained within the current VPDU into the bitstream 115. Firstly, a‘root coded block flag’ is coded indicating the presence of at least one significant residual coefficient resulting from the quantisation of the step 1250 for any of the TUs associated with the CU. The TUs associated with the CU include those located in the current VPDU and located in subsequent VPDUs. The root coded block flag is coded once for all TUs of the CU and signals significance for any of the transforms of the CU, across all colour channels, that is for any TB of any the TU of the CU. Provided at least one significant residual coefficient is present for any transform across any colour channel of the CU, within each colour channel a separate coded block flag is coded for each transform applied in the colour channel. Each coded block flag indicates the presence of at least one significant residual coefficient in the corresponding transform block. For transforms with at least one significant residual coefficient and located in the current VPDU, a significance map and magnitudes and signs of significant coefficients are also coded. Control in the processor 205 progresses from step 1045 to an add TU(s) to reorder buffer step 1050. [000172] At the add TU(s) to reorder buffer step 1050 the video encoder 114, under execution of the processor 205, determines the TUs generated at the step 1035. Metadata for each TU that is located in a subsequent VPDU is assigned to the TU reorder buffer for the VPDU to which the TU belongs. The metadata of a TU includes the size and location of the TU. In the video encoder 114, the residual coefficients of the TU are also buffered for use at the step 1030 in another iteration of the method 1000. Control in the processor 205 progresses from step 1050 to an intra mode test step 1060.

[000173] In one arrangement of the method 1000, at the step 1045, the coded block flag for each TU in the CU, i.e. including TUs belonging to subsequent VPDUs, is coded in the bitstream 115 at the point where the residual of TBs belonging to the current VPDU is coded in the bitstream 115, i.e., adjacent to or nearby the root coded block flag of the CU. Only TBs having at least one significant residual coefficient, i.e. a coded block flag value of one, are added to the reorder buffer at the step 1050.

[000174] Accordingly, steps 1030 to 1050 can operate to detect if a CU crosses a number of processing regions and to encode TUs in an order to allow pipelined processing, and a corresponding decrease in memory for decoding, based on the detection.

[000175] At the intra mode test 1060 the prediction mode of the selected CU is tested by the processor 205. If the prediction mode is intra prediction (“Yes” at step 1060), control in the processor 205 progresses to a perform intra prediction step 1065. Otherwise the prediction mode is inter prediction or current picture referencing (“No at step 1060”), control in the processor 205 progresses to a perform motion compensation step 1070.

[000176] At the perform intra prediction step 1065 the intra-frame prediction module 364, under execution of the processor 205, generates an intra predicted block of samples (366). The intra predicted block of samples 366 is generated using filtered reference samples 362 according to an intra prediction mode for each PB of the selected CU. When multiple TUs are associated with the CU due to the step 1045, the intra reconstruction step is applied at each TU boundary internal to the selected CU. The reference sample cache 356 is updated with the reconstructed samples at each TU boundary inside the CU, in addition to the reconstructed samples at each CU boundary. Reconstruction at TU boundaries inside the CU allows the residual of TUs above or left of a current TU inside the CU to contribute to the reference samples for generating the part of the PB collocated with the current TU, reducing distortion and improving compression efficiency. Control in the processor 205 then progresses from the step 1065 to a reconstruct CU step 1075.

[000177] At the perform motion compensation step 1070 the motion compensation module 380, under execution of the processor 205, produces a filtered block samples 382. The block if samples 382 is produced by fetching one or two blocks of samples 374 from the frame buffer 372. For each block of samples, the frame is selected according to a reference picture index and the spatial displacement relative to the selected CU is specified according to a motion vector. Where two blocks are used, the resulting filtered blocks are blended together. The reference picture indices and motion vector(s) are determined in a method 1100 described in relation to Fig. 11. Where the referenced block is from the current frame, i.e. current picture referencing is in use, the block is fetched from the reference sample cache 356. Control in the processor 205 progresses from the step 1070 to the reconstruct CU step 1075.

[000178] At the reconstruct CU step 1075 the summation module 352, under execution of the processor 205, produces the reconstructed samples 354 by adding the residual samples 350 and the PU 320 for inter-predicted or intra-predicted CUs. For skip mode CUs there is no residual and so the reconstructed samples 354 are derived from the PU 320. The reconstructed samples 354 are available for reference by subsequent intra predicted CUs in the current frame and are written to the frame buffer 372, after in-loop filtering is applied (that is, application of the in loop filters 368), for reference by inter predicted CUs in subsequent frames. The deblocking filtering of the in-loop filters 368 is applied to the interior boundaries of the CU. In other words, the filtering is applied to boundaries between TUs inside the CU, resulting from tiling due both the CU size and due to pipeline processing region boundaries. Control in the processor 205 progresses from step 1075 to a last CU test step 1085.

[000179] At the last CU test step 1085 the processor 205 tests if the selected CU is the last CU in the CTU. If not (“No” at step 1085), control in the processor 205 returns to the step 1015. If the selected CU is the last one in the CTU in the CU scan order (“Yes” at step 1085), that is the depth-first Z-order scan, the method 1000 terminates. After the method 1000 terminates, either the next CTU is encoded, or the video encoder 114 progresses to the next image frame of the video.

[000180] The VPDUs represent processing regions of the CTU. Accordingly, the determination that the CU crosses a boundary at step 1025 indicates that CU overlaps more than one processing region of the CTU. Further, the determination means that transform units of the CU are located in different processing regions of the CTU, for example first and second TUs of the CU may be first and second processing regions (VPDUs) respectively. Additional TUs of the CU may also be in one of the first or second processing regions, or a TU of another coding unit may be in the first (or second) processing region. The method 1000 operates firstly to encode TUs for the first processing region, then to iterate to step 1020 to encode TUs for the second processing region after the first processing region, once the selected coding unit of the step 1020 overlaps a VPDU boundary point, for example 9122, 9124, or 9126. Accordingly, in some implementations, encoding a CU will involve more than one iteration of steps 1020 to 1085.

[000181] Fig. 11 shows the method 1100 for decoding the CUs of a CTU from the

bitstream 133. In the method 1100, transform sizes are selected such that the method 1100 may be performed using a VPDU-based pipelined architecture. Moreover, the video decoder 134 only needs to buffer decoded residual coefficients for the VPDU currently being processed in the pipeline. Based on the frame dimensions and the CTU size the number of CTUs in a frame is determined by the video decoder 134, and the method 1100 is invoked to decode each CTU from the bitstream 115 to produce each output frame (i.e. 135). The method 1100 may be embodied by apparatus such as a configured FPGA, an ASIC, or an ASSP. Additionally, the method 1100 may be performed by video decoder 134 under execution of the processor 205.

As such, the method 1100 may be stored on a computer-readable storage medium and/or in the memory 206. The method 1100 commences with the processor 205 at a decode coding tree step 1110.

[000182] At the decode coding tree step 1110 the entropy decoder 420, under execution of the processor 205, begins decoding the coding tree of a CTU from the bitstream 133. The coding tree decomposes a CTU into one or more CUs according to a series of splits, as described with reference to Figs. 5 & 6, and using the example of Figs. 7A and 7B. The coding tree decoded from the bitstream 133 is the coding tree determined at the step 1005 of Fig. 10. Step 1110 effectively determines a size of each coding unit (CU) by decoding the CTU using the coding tree. Split flags of the coding tree are decoded until a leaf node, i.e. a coding unit, is encountered. Upon reaching a leaf node the present node in the coding tree is stored for resumption of traversing the coding tree at a later point. Traversal of the coding tree accords with the description of Fig. 6. Control in the processor 205 progresses from step 1310 to a determine VPDU ordering step 1115. [000183] At the determine VPDU ordering step 1115 the video decoder 134, under execution of the processor 205, determines a processing order for the VPDUs in a CTU, in accordance with the ordering determined at the step 1010 in the method 1000. The VPDU processing order is based on the top-level split of the coding tree of the CTU as follows:

[000184] No split (510), Quadtree split (512), Horizontal binary split (514), or horizontal ternary split (518) mean a processing order of top-left, top-right, bottom-left, bottom -right.

[000185] Vertical binary split (516) or vertical horizontal split (520) mean a processing or parsing order of top-left, bottom-left, top-right, bottom-right.

[000186] Accordingly, similarly to the step 1015, the processing order is determined based upon a split of the coding tree unit.

[000187] Lists of TU metadata are initialised as empty for VPDU1-VPDU3. A list of TU metadata is a list of TUs that are to be encoded for a given VPDU, based on CUs encountered in earlier VPDUs. As VPDU0 occurs at the start of the CTU, there are no earlier VPDUs and hence no associated TU metadata. Control in the processor 205 progresses from step 1115 to a select coding unit step 1120.

[000188] At the select CU step 1120 the video decoder 134, under execution of the

processor 205, selects one CU of the decoded coding tree. The CU is selected according to an iteration through the coding tree according to the order in which syntax associated with the coding tree is present in the bitstream 134, i.e. as described with reference to Fig. 6. The selected CU has a particular size and location in the image frame, and hence a location relative to the top-left corner of the containing CTU. Thus, the selected CU may be said to occupy a given area within the containing CTU. The location of the selected CU in the coding tree is stored in the memory 206 so that iteration of the coding tree can resume from the same point on a subsequent invocation of the step 1120. Control in the processor 205 progresses from step 1320 to a VPDU boundary test step 1125.

[000189] At the VPDU boundary test step 1125 the video decoder 134, under execution of the processor 205, determines if the current coding unit delineates or overlaps a boundary between one VPDU and the next, in the VPDU processing order. For example, step 1125 may involve determining that the CU overlaps first and second VDPUs (processing regions). In the arrangements described the boundary between a current and a next VPDU is deemed to be the point in the bitstream 133 at which prediction modes of all blocks overlapping the current VPDU become known. By virtue of the Z-order traversal of the coding tree of the step 1115, the test performed at step 1125 is whether the current coding unit overlaps the lowermost rightmost sample of a given VPDU. In the example of Fig. 9A, the sample 9122 delineates VPDUO (current) from VPDU1 (next), the sample 9124 delineates VPDU1 (current) from VPDU2 (next), and the sample 9126 delineates VPDU2 (current) from VPDU3 (next). If the current CU overlaps with one or more of the lowermost rightmost samples of the VPDUs (“Yes” at step 1125), control in the processor 205 progresses from the step 1025 to a decode buffered TUs step 1130. Otherwise (“No” at step 1125) control in the processor 205 progresses from the step 1125 to a generate TUs of CU step 1135.

[000190] At the decode buffered TUs step 1130, the entropy decoder 420, under execution of the processor 205, decodes the residual coefficients from the bitstream 133 of each deferred TU as indicated in the TU metadata for the current VPDU. The entropy decoder 420 parses TUs that were deferred from earlier VPDU(s) to the now-current VPDU(s) according to the current CU overlap with VPDU boundary points, e.g. 9122, 9124, and 9126. Where the current CU overlaps more than one VPDU boundary point the TUs for each boundary are encoded into the bitstream 115. Accordingly, step 1130 relate to decoding TUs that are located within a different processing region of the current TU, being a previously processed region. Control in the processor 205 progresses from step 1130 to a generate TUs of CU step 1135.

[000191] At the generate TUs of CU step 1135 the video decoder 134, under execution of the processor 205, determines the TUs of the current CU. Each determined TU has a size and location and is assigned to one of the VPDUs of the CTU. The TUs are determined such that the largest available transform size, independently in width and height, (generally side length up to 64 is available) is used. Determination of the TUs is further constrained in that the resulting TUs do not span boundaries between adjacent VPDUs. Fig. 9B provides an example of the generated TUs for CUs of a given coding tree. The step 1035 is further described with reference to the method 1200 of Fig. 12. Control in the processor 205 progresses from step 1135 to an inverse quantise and apply inverse transforms step 1140.

[000192] At the inverse quantise and apply inverse transforms step 1140 the dequantiser module 428 and the inverse transform module 444, under execution of the processor 205, inverse quantise residual coefficients to produce scaled transform coefficients 440. At step 1140 the selected transforms of either or both the step 1135 or the step 1130 are applied to transform the scaled transform coefficients 440 to produce residual samples 448. As with the step 1040, application of the transform is performed in a tiled manner according to the determined transform size. Moreover, by virtue of the transform size selected at the step 1135, individual transforms do not cover regions that span across two or more VPDUs. As with the method 1000, practical implementations, particularly hardware implementations utilising a pipeline architecture but also some software implementations, benefit from transforms being contained entirely within distinct VPDUs. An example software implementation that benefits for the arrangements described is a multi-core implementation that may use the same pipeline architecture for improved data locality. Control in the processor 205 progresses from step 1140 to a decode TU(s) step 1 145

[000193] At the decode TU(s) step 1145 the entropy decoder 420, under execution of the processor 205, decodes the residual coefficients for TUs contained within the current VPDU into the bitstream 133. Firstly, a‘root coded block flag’ is decoded. The root coded block flag indicates the presence of at least one significant residual coefficient resulting from the quantisation of the step 1040 for any of the TUs associated with the CU, i .e. the TUs located in the current VPDU and located in subsequent VPDUs. The root coded block flag is coded once for the CU and signals significance for any of the transforms of the CU, across all colour channels, that is for any TB of any the TU of the CU Provided at least one significant residual coefficient is present for any transform across any colour channel of the CU, within each colour channel a separate coded block flag is coded for each transform applied in the colour channel. Each coded block flag indicates the presence of at least one significant residual coefficient in the corresponding transform block. For transforms with at least one significant residual coefficient, a significance map and magnitudes and signs of significant coefficients are also decoded. Control in the processor 205 progresses from step 1145 to an add TU(s) to reorder buffer step 1150.

[000194] At the add TU(s) to reorder buffer step 1150 the video decoder 134, under execution of the processor 205, determines the TUs generated at the step 1135. Metadata for each of the TUs that is located in a subsequent VPDU is assigned to the TU reorder buffer for the VPDU to which it belongs. The metadata of a TU includes the size and location of the TU. The coefficients of the TUs located in a subsequent VPDU are not stored in the buffer. The metadata can be used at step 1130 in a further iteration of the method 1100. Accordingly, decoding TUs in iterations of the method 1100 can include generating metadata for a TU in a different processing region (VDPU). Control in the processor 205 progresses from step 1150 to an intra mode test step 1160. [000195] In one arrangement of the method 1100, at the step 1145, the coded block flag for each TU in the CU, i.e. including those belonging to subsequent VPDUs, is decoded from the bitstream 133 at the point where the residual of TBs belonging to the current VPDU is coded in the bitstream 133, i.e., adjacent to or nearby to the root coded block flag of the CU. Only TBs having at least one significant residual coefficient, i.e. having a coded block flag value of one, are added to the reorder buffer at the step 1150. When the video decoder 134 receives only zero-valued coded block flags for TBs belonging to subsequent VPDUs, processing of the entire CU may commence earlier, as there is no need to decode any residual coefficients to complete decoding of the CU. Moreover, when the CU is inter predicted and coded using a ‘skip mode’, the root coded block flag is known to be zero, and hence the coded block flags of each TU associated with the CU are also known to be zero, without any need to decode such coded block flags from the bitstream 133.

[000196] At the intra mode test 1160 the determined prediction mode of the selected CU is tested by the processor 205. If the prediction mode is intra prediction (“Yes” at step 1160), control in the processor 205 progresses to a perform intra prediction step 1 165. Otherwise the prediction mode is inter prediction (“No” at step 1160), control in the processor 205 progresses to a decode motion parameters step 1370.

[000197] At the perform intra prediction step 1165 the intra-frame prediction module 476, under execution of the processor 205, generates an intra predicted block of samples (480) using fdtered reference samples 472 according to an intra prediction mode for each PB of the selected CU. When multiple TUs are associated with the CU due to the step 1135, the intra

reconstruction process is applied at each TU boundary internal to the selected CU. The reconstructed sample cache 460 is updated with the reconstructed samples at each TU boundary inside the CU, in addition to the reconstructed samples at each CU boundary. Reconstruction at TU boundaries inside the CU allows the residual of TUs above or left of a current TU inside the CU to contribute to the reference samples for generating the part of the PB collocated with the current TU, reducing distortion and improving compression efficiency. As a consequence of confining intra predicted coding units to be within a VPDU and to avoid depending on samples of subsequent VPDUs, each intra predicted CU may be processed in the CU’s entirety using reference samples obtained from the current or previous VPDUs. Control in the processor 205 then progresses from the step 1165 to a reconstruct partial CU step 1180.

[000198] At the determine motion parameters step 1170 the entropy decoder 420, under execution of the processor 205, determines the motion vector(s) for the selected CU and for the CUs corresponding to the decoded buffered TUs of the step 1130. A list of candidate motion vectors is created (a‘merge list’) using spatially and temporally neighbouring blocks. A merge index is decoded from the bitstream 133 to select one of the candidates from the merge list. When the selected CU was coded using skip mode, the skip flag becomes the motion vector for the CU. When the selected CU was coded using inter prediction, a motion vector delta is decoded from the bitstream 133 and added to the candidate that was selected according to the decoded merge index. Control in the processor then progresses from the step 1170 to a perform motion compensation step 1175.

[000199] At the perform motion compensation step 1175 the motion compensation module 434, under execution of the processor 205, produces a filtered block samples 438, by fetching one or two blocks of samples 498 from the frame buffer 496. For each block of samples, the frame is selected according to a reference picture index and the spatial displacement relative to the selected CU is specified according to a motion vector. Where two blocks are used, the resulting filtered blocks are blended together. The reference picture indices and motion vector(s) are decoded from the bitstream 115 and were determined in the method 1100. Control in the processor 205 progresses from the step 1175 to the reconstruct CU step 1180.

[000200] The frame buffer outputs frame data 135, as generated using the prediction block and decoded residual samples of the coding units of the bitstream 133.

[000201] At the reconstruct CU step 1180 the summation module 352, under execution of the processor 205, produces the reconstructed samples 354 by adding the residual samples 350 and the PU 320 for inter-predicted or intra-predicted CUs. For skip mode CUs there is no residual and so the reconstructed samples 354 are derived from the PU 320. The reconstructed samples are available for reference by subsequent intra predicted CUs in the current frame and are written to the frame buffer 372, after in-loop filtering is applied (that is, application of the in loop filters 368), for reference by inter predicted CUs in subsequent frames. The deblocking filtering of the in-loop filters 368 is applied to the interior boundaries of the CU, that is, the boundaries between TUs inside the CU, resulting from tiling due both the CU size and due to VPDU boundaries. Control in the processor 205 progresses from step 1180 to a last CU test step 1185.

[000202] At the last CU test step 1185 the processor 205 tests if the selected CU is the last CU in the CTU in the CU scan order, being a depth-first Z-order scan. If not (“No” at step 1185), control in the processor 205 returns to the step 1110. If the selected CU is the last CU in the CTU (“Yes” at step 1185) the method 1100 terminates. After the method 1100 terminates, either the next CTU is decoded, or the video decoder 134 progresses to the next image frame of the bitstream.

[000203] The method 1100 is implemented for each CU in a CTU. Similarly to the method 1000, decoding a full CU can involve a number of iterations of the method 1100. In one example, first and second TUs corresponding to first and second CUs respectively and located in a first processing area are decoded. After the first and second TUs are decoded, a further TU of the first CU located in a second processing region (VPDU) is decoded. Accordingly, a number of iterations of the method 1100 may be required to decode a CU, dependent upon the location of TUs of the CU and the location and order of the processing regions (VPDUs).

[000204] Referring back to Fig. 10, in one arrangement of the method 1000, the step 1005 operates such that intra prediction is only used when there will be no instances of TUs being deferred to later VPDUs. Avoiding deferral of TUs to later VPDUs requires that all CUs in a current VPDU can be processed fully before any CU in the next VPDU is processed. TUs are not deferred when the top level split of the coding tree is a quadtree split, or when the CU is within a top-level binary split in one direction, followed by a binary split in the opposing direction, resulting in all CUs underneath the binary split in the opposing direction belonging to one of two VPDUs. Intra prediction depends on samples of neighbouring CUs, confining the block progression from one VPDU to the next, with no need to revisit earlier partially processed VPDUs. Accordingly, processing to operate on VPDU-sized regions is enabled for intra prediction. Thus, intra prediction modes are only searched for candidate CUs underneath a top- level quadtree split, or underneath a top-level binary split and a second binary split in the opposing direction, that is when a monotonic VPDU progression requirement is in effect.

[000205] In another arrangement of the methods 1000 and 1100, the steps 1015 and 1110 are modified such that intra prediction is only an available value of the‘pred_mode’ syntax element for a CU underneath a top-level quadtree split, or underneath a top-level binary split and a second binary split in the opposing direction. As a consequence, the bitstream 133 does not need to contain the overhead of allowing signalling of intra prediction for CUs where intra prediction is prohibited due to a monotonic VPDU progression requirement.

[000206] In yet another arrangement of the method 1000, the step 1005 operates such that searching regions underneath a top-level ternary split is restricted to a maximum of one additional split (the“restricted coding tree search”) the restriction to a maximum of one additional split means that each region can be either one CU or can be split only once more according to the splits shown in Fig. 5, into a set of CUs. Restricting the split depth underneath a top-level ternary split restricts the search space available to the video encoder 114 and hence reduces the time required to encode each CTU. In particular, the coding loss resulting from the restricted coding tree search is proportionately less than the encoder runtime reduction resulting from the restricted coding tree search.

[000207] In general, the reduction in runtime due to the restricted coding tree search is beneficial for non-realtime encoders, as realtime encoders are likely to already implement a variety of search-space reduction optimisations. Moreover, the entropy encoder 338 and the entropy decoder 420 may binarise the coding tree split (using an“mtt_type” syntax element) such that after one additional split underneath a top-level ternary split, the MT split 612 is inferred as‘O’, i.e. do not split. The binarized coding tree split results in generation of leaf nodes or CUs (i.e. instances of 622) in the subregions of each split underneath the top-level ternary split. Inference of a‘do not split’ option in encoding and decoding a bitstream results in a bitstream syntax that can only express available split options when the restricted coding tree search is in effect, thereby improving compression performance of the system 100.

[000208] Fig. 12 is a flow chart diagram of the method 1200 for generating a list of transform units for a coding unit, each transform units being associated with one VPDU of the CTU. The method 1200 may be embodied by apparatus such as a configured FPGA, an ASIC, or an ASSP Additionally, the method 1200 may be performed by video decoder 134 at step 1035 or the video encoder 114 at step 1135 under execution of the processor 205. As such, the

method 1200 may be stored on a computer-readable storage medium and/or in the memory 206. The method 1200 commences with the processor 205 at a determine coding unit area step 1210.

[000209] At the determine coding unit area step 1210, the video encoder 114 or the video decoder 134, under execution of the processor 205, determines the area occupied by a selected CU in the current CTU. Each node of a coding tree is said to occupy an area within a CTU, with a recursive subdivision of the areas occurs as the coding tree is decomposed, until leaf nodes are reached. Leaf nodes which correspond to coding units and each have a particular non overlapping area in the CTU. The decomposition of the coding tree into coding units is described with reference to Fig. 6 and an example shown in Figs. 7A and 7B. The area of a given coding unit may be described in terms of the Cartesian location of the top-left luma sample within the CU and the width and height of the CU in luma samples. Control in the processor 205 progresses from step 1210 to a VPDU boundary overlap test step 1220. [000210] At the VPDU boundary overlap test step 1220 the video encoder 114 or the video decoder 134, under execution of the processor 205, tests if the coding unit overlaps a VPDU boundary. VPDU boundaries in a 64x64 pipeline architecture are defined along a 64x64 grid, aligned to the top-left of the frame. Other boundary grids are also possible, such as 32x32, provided the boundary grid is smaller than the CTU size, which is typically 128 x 128. Were the boundary grid to be equal (or larger) in size than the CTU size, then there would be no need to reorder TUs as all the TUs inside a CTU could be processed at the granularity of the pipelined architecture.

[000211] If a CU spans two or more VPDUs, smaller TUs are used, such that each TU is contained within one VPDU for a pipeline implementation to process the CU. In such a case, the CU spanning multiple VPDUs is processed partially, i.e. the partial or whole CUs occupying a first VPDU are processed followed by further partial or while CUs occupying a subsequent VPDU, and so on until all VPDUs of the CTU have been processed. A CU is said to span the VPDU boundary if either the horizontal boundary between adjacent VPDUs or the vertical boundary between adjacent VPDUs is spanned. If the CU top-left luma sample, modulus the VPDU height, plus the CU height exceeds the VPDU height, the boundary between vertically adjacent VPDUs is spanned by the CU. If the CU top-left luma sample, modulus the VPDU width, plus the CU width exceeds the VPDU width, the boundary between horizontally adjacent VPDUs is spanned by the CU. the method for determining whether the CU spans a VDPU is expressed as pseudocode in Equation (1)

VPDU span = CUx % VPDUw + CU_W > VPDUw ? 1 : 0 OR CU_Y % VPDUH + CU_H >

VPDUH ? 1 : 0 (1)

[000212] In Equation (1)‘%’ represents the modulo operator. In Equation (1) a VPDU_span of ‘ 1’ indicates the CU spans a horizontal or vertical boundary and‘0’ indicates no spanning. VPDUW is the width of a VPDU, VPDUH is the height of a VPDU, CUx is the X location of the top-left luma sample of the CU in the frame, CU_Y is the Y location of the top-left luma sample of the CU in the frame, CUw is the width of the CU in luma samples and CU_H is the height of the CU in luma samples.

[000213] If the CU spans a horizontal or vertical VPDU boundary (“Yes” at step 1220) control in the processor 205 progresses from step 1220 to a VPDU-based transform size step 1240. Otherwise (“No” at step 1120) control in the processor 205 progress to a coding unit transform size 1230. [000214] At the coding unit transform size step 1230 the video encoder 114 or the video decoder 134, under execution of the processor 205, determines a transform size for the CU. As the CU is contained within one VPDU, if the VPDU size is 64x64 (also the size of the largest available transform), the transform size is set equal to the CU size. Control in the

processor 205 progresses from step 1230 to a generate TU list step 1250.

[000215] At the VPDU-based transform size step 1240 the video encoder 114 or the video decoder 134, under execution of the processor 205, determines a transform size for the CU.

The transform size is selected such that each resulting TU is contained within one VPDU. Each TU being contained within one VPDU is achieved by ensuring the boundary between TUs is aligned to the VPDU boundary. The transform size is determined according to the pseudocode of Equations (2) and (3)

TUw = VPDUw - CUx % VPDUw (2)

TUH = VPDUH - CUY % VPDUH (3)

[000216] In Equations (2) and (3)‘%’ represents the modulo operator.

[000217] Control in the processor 205 progresses from step 1240 to the generate TU list step 1250. At the generate TU list step 1250 the video encoder 114 or the video decoder 134, under execution of the processor 205, generates a list of TUs for the CU. All TUs have the same size, as set by the step 1230 or the step 1240, and the TUs are tiled to occupy the entirety of the CU. Moreover, each TU is assigned to the VPDU in which the TU is contained. The method 1200 terminates upon completion of step 1250.

[000218] The system 100, in utilising the methods 1000 and 1100, performs buffering of residual coefficients in the video encoder 114. The residual coefficients are buffered in the encoder 114 such that the residual coefficients are stored in a bitstream in the order in which the residual coefficients will be used in the video decoder 134 when processing each CTU as a division into multiple, for example four, VPDUs. Decoding each frame in units of VPDUs rather than in units of CTUs at each module of Fig. 4 allows, typically, a memory buffering reduction of 75%. Moreover, as a consequence of reordering the TUs in a bitstream, the video decoder 134 does not need to buffer residual coefficients in the entropy decoder 420 for use in subsequent modules that operate using VPDU-sized processing granularity. [000219] Fig. 13A shows a collection 13000 of reference areas in a current picture referencing according to a first ordering of VPDUs. Fig. 13B shows a collection 13100 of reference areas in a current picture referencing according to a second ordering of VPDUs. In the context of the present disclosure, a reference area relates to samples of previous CUs available for current picture referencing. Figs. 13A and 13B (and similarly 13C-13E) show frames comprising two- dimensional arrays of CTUs. Using CPR, each coding unit references coding units previously decoded from the bitstream.

[000220] Figs. 13A shows reference areas when a current CTU 13012 has a horizontal split or a quadtree split at the top level of the coding tree of a current CTU 13012. Fig. 13B shows reference areas when a current CTU 13112 has a vertical split at the top level of the coding tree of the current CTU 13112. Four cases are shown in each of Figs. 13A and 13B, corresponding to a division of the respective CTUs, i.e. 13012 and 13112, into four VPDUs of size 64x64. Each of Figs. 13 A and 13B includes VDPUs VDPUO to VDPU3.

[000221] Coding units using current picture referencing are permitted to reference blocks of samples from a previous CTU, i.e. 13010 (VDPUO 13002 of Fig. 13A) or 13110 (VDPU 13102 of Fig. 13B) in the coding order of CTUs in the frame data 113. The previous CTU, i.e. 13010 or 13110, is either located adjacently and to the left of the current CTU, i.e. 13012 (Fig. 13A) or 13112 (Fig. 13B), or is located at the rightmost location of the previous row of CTUs in the two-dimensional grid of CTUs that forms the frame data 113.

[000222] In the example of Fig. 13 A, the top-level split in the coding tree of the current CTU 13012 is a horizontal split, either binary or ternary, or a quadtree split. The VPDUs are ordered as VPDU0 13002, in the top-left quadrant of the current CTU 13012, VPDUl 13004, in the top-right quadrant of the current CTU 13012, VPDU2 13006, in the lower-left quadrant of the current CTU 13012, and VPDU3 13008, in the lower-right quadrant of the current

CTU 13012. In the example of Fig. 13B, the top-level split in the coding tree of the current CTU 13112 is a vertical split, either binary or ternary. The VPDUs are ordered as

VPDUO 13102, in the top-left quadrant of the current CTU 13112, VPDU 1 13104, in the lower- left quadrant of the current CTU 13112, VPDU2 13106, in the top-right quadrant of the current CTU 13112, and VPDU3 13108, in the lower-right quadrant of the current CTU 13112.

[000223] The video encoder 114 and the video decoder 134 may use a sample buffer size of one CTU, divided into VPDU-sized sections for the scenarios of Figs. 13A and 13B. When processing the current CTU commences, the sample buffer corresponding to VPDUO is used for storing samples of the current CTU and the sample buffer corresponding to VPDUsl-3 are available for use in CPR. As the VPDU processing progresses according to the processing order, corresponding portions of the sample buffer are no longer used for holding samples of the previous CTU and are instead used for holding samples of the current CTU. In Figs. 13 A and 13B, VPDU-sized regions of the previous and current CTU that are no longer available for use by CPR, given the top level split of the coding tree and the current VPDU being processed, are marked with an‘X’. VPDU-sized regions that are available for use by CPR are filled with a dot pattern. As shown in Figs. 13A and 13B, the total storage requirement is always three previous VPDUs in the VPDU processing order and the VPDU currently being processed, resulting in a memory requirement of one CTU of samples.

[000224] Examples involving frame edges, frames divided into tiles of CTUs, and partially coded CTUs are described with reference to Figs. 13C to 13E.

[000225] Fig. 13C shows reference areas in a frame 13200 of CTUs. In the frame 13200 the CTUs are grouped into four tiles, tileO to tile3, with each tile allowed to have partial CTUs only along the leftmost column of the leftmost tiles and the lowermost row of the lowermost tiles. The tiles are marked with thicker boundaries and the CTUs are numbered 0 to 27. Tile 0 (13235) contains CTUs 0-5, tile 1 contains CTUs 6-13, tile 2 contains CTUs 14-19, and tile 3 contains CTUs 20-27.

[000226] A coding unit 13210 in CTU #3 references VPDUs from a previous CTU 2. CTU 3 is the leftmost CTU in the second row of CTUs in the frame 13200 and CTU #2 is the rightmost CTU in the first row of CTUs in the frame 13200. Irrespective of the placement of CTUs 2 and 3 in the frame 13200, a block vector 13215 references a block 13220 of samples in the previous CTU #2 by referencing an area 13230 adjacent to and left of the current CTU (CTU #3). The vector 13215 references the block 13220, even though the location of the referenced CTU, i.e. CTU #2, is spatially further away from CTU 3 than the area 13230. The area 13230 provides a region which may be referenced by CUs in CTU 3 that appears to provide a duplicate (or ‘shadow’) of the samples contained in CTU #2, without incurring additional storage. CTU #2 is addressed by adding a second block vector 13240. The second block vector 13240 has a Y component equal to negative the CTU height of a CU 13225 in CTU 2, i.e. -128 luma samples and an X component equal to the tile or frame width in luma samples to the CU 13225. The second block vector is not included in encoding, decoding or cost evaluation of the block vector 13215. [000227] The availability of samples within the area 13230 and CTU #3 accords with the VPDU-based availability as described with reference to Figs. 13A and 13B. The use of the area 13230 as a referenceable‘shadow’ of CTU #2 ensures that block vectors of CPR-coded CUs in the CTU #3 have comparable costs to block vectors of other CTUs, such as a CU in CTU #4 referencing samples from CTU #3. The property of similar cost of block vectors in coding units avoids introducing a bias away from using CPR in coding units in CTUs aligned to the left of a frame or tile, such as CTU #3. The similar cost is achieved regardless of the containing CTU’s placement with respect to frame or tile edges. Bias is avoided because a direct reference from CTU #3 to CTU #2 would require a larger block vector. Larger block vectors typically incur a higher coding cost and are thus less likely to be selected compared to other prediction modes under a Lagrangian rate-distortion optimisation. The absence of a bias in using CPR for particular blocks may also provide a subjective benefit as the bias would otherwise influence the selection of prediction modes in CUs in a fixed set of CTUs for a frame. Bias away from selecting CPR in the fixed set of CTUs may result in use of prediction modes incurring higher distortion than would be present had CPR been used. Compression artefacts resulting from the bias would be present in the fixed set of CTUs only.

[000228] Implementing the approach of referencing the area 13230 requires detecting that a CTU occupies a boundary area and, if so, for each block vector, determining a suitable offset vector to locate the previously coded CTU (reference CTU) adjacent to and left of the current CTU. CTUs along the left edge of a frame or tile occupy a boundary area. The offset vector is determined by subtracting the CTU height from the Y component of the block vector and adding the tile or frame width to the X component of the block vector. For the frame 13200, CTUs using the shadow mechanism to reference samples from the previous CTU are CTUs #3, #17, #10, and #24. The CTUs of Fig. 13C numbered #0, #6, #14, and #20 are each one of tiles 0-3, respectively. Earlier CTUs, as numbered in Fig. 13C, need to be considered as unavailable for use in CPR to enable independent and parallel decoding of each tile.

[000229] Fig. 13D shows reference areas in a frame 13300 of CTUs. The CTUs are grouped into tiles. In the example of Fig. 13D, each tile is allowed to have partial CTUs only along the rightmost column of each tile and the lowermost row of each tile. The frame 13300 has four tiles, with tile 0 containing CTUs 0-7, tile 1 containing CTUs 8-15, tile 2 containing CTUs lb- 23, and tile 3 containing CTUs 24-31. The tiles of the frame 13300 are configured such that the left column of tiles, i.e. tiles 0 and 2, have a width that is a non-integer number of CTUs. Tiles 0 and 2 have a width of three and a half (3 ½) CTUs. Use of a width at a finer granularity than integer multiples of CTU width enables finer placement of the tile boundary. Supporting widths at a granularity of 64 luma samples, i.e. half (½) of a 128-sample wide CTU, means that tile boundaries are aligned to VPDU boundaries. Accordingly, the finer granularity of tile width compared to CTU width and height is achieved without introducing processing at a finer granularity than the VPDU size. The rightmost column of CTUs in the rightmost column of tiles, i.e. CTUs #11 and #15 in tile 1 and CTUs #27 and #31 in tile 3 of the frame 13300, have a width constrained by the overall width of the frame 13300, and consequently need not be divisible into an integer number of VPDUs. The bottom row of CTUs in the lower two tiles, i.e. CTUs #20-23 of tile 2 and CTUs #28-31 of tile 3 are split according to the frame height into areas that are not in integer number of (square) VPDUs.

[000230] In the example of Fig. 13D, a coding unit 13310 in CTU #4 uses CPR to access a reference block 13320 in an area 13330. The area 13330‘shadows’ CTU #3, that is, a block vector 13315 selects the reference block 13320 relative to the coding unit 13310 in the area 13330. However the samples are fetched from the reference block 13325 in CTU #3, addressed by adding a second block vector 13340. The second block vector 13340 has a Y component equal to negative the CTU height, i.e. -128 luma samples in the example of Fig.

13D and an X component equal to the tile or frame width in luma samples. The second block vector is not included in encoding, decoding or cost evaluation of the block vector 13315.

[000231] In the example of Fig. 13D, the CTUs #20-23 and #28-31 each have an implicit horizontal binary split, indicated by a boundary line 13360. The horizontal binary split is implicit because the split is the result of the height of CTUs #20-23 and #28-31 being less than the full height of a CTU, i.e. 128 luma samples. The area beneath the boundary line 13360 (for example 13370 of CTU 21) does not map into square-shaped VPDUs and thus requires special handling by a pipelined processing architecture. Although a pipelined architecture may employ non-square VPDUs, for example not exceeding the area of a 64*64 VPDU, non-square VPDUs are difficult to address using a block vector, as the non-square region has a different‘stride’ compared to square regions. A‘stride’ is a value to used to convert an (x, y) sample co ordinate to a memory address. A‘stride’ is generally of the form: address = x + (y * stride). Should access to both square and non-square VPDUs be supported by current picture referencing, additional addressing logic is needed to accommodate the different stride of the square regions being accessed compared to non-square regions. In particular, the area 13370 is not available for reference by CPR-coded CUs. The regions above the boundary line 13360 in CTUs #20-23 and #28-31 are able to reference samples from previous VPDUs for CUs using CPR. For example, a CU 13350 in VPDU0 of CTU 22 is shown referencing a block in VPDU1 of CTU #21. [000232] Fig. 13E shows a frame 13400 of CTUs with the CTUs grouped into tiles, with tiles allowed to have partial CTUs along the leftmost and the rightmost column and topmost row of each tile. The frame 13400 is shown in two separate components, 13490 and 13495, for ease of reference. The frame 13400 includes four tiles, with tile 0 containing CTU #0-7, tile 1 containing CTU #8-16, tile 2 containing CTU #16-23, and tile 3 containing CTU #24-31. A standard CTU (e.g. CTU 1 is 128x128 samples in size. The tile configuration of the

frame 13400 is such that tiles (other than those along the right edge or bottom edge of the frame 13400) are permitted to have width or height as a multiple of VPDUs, for example 64 luma samples rather than a multiple of CTU size, i.e. 128 luma samples. Where a tile, other than tiles along the right or bottom edge of the frame 13400, has a non-integer width or height in CTUs, the adjacent tile, i.e. the tile to the right or below the current tile, has a leftmost CTU column or topmost CTU row truncated to one VPDU in width or height. This constraint results in fixed CTU boundaries regardless of additional boundaries introduced by the chosen tile widths or heights. Tiles 0 and 2 have a width of 448 luma samples, or three and a half (3 ½) CTU width. Tiles 1 and 3 have a width of 416 luma samples, resulting in three and a quarter (3.25 = ½ + 2 + ¾) CTU widths in respective columns of CTUs within the tile.

[000233] The first CTU column of tiles 1 and 3 of Fig. 13E are one VPDU width or 64 samples in width due to the rightmost column of tiles 0 and 2 being at a one VPDU offset from the 128x 128-sized CTU grid. The rightmost CTU column of tiles 1 and 3 is 96 samples in width, due to the overall tile width of 416 luma samples. A width of 96 luma samples results in a coding trees for the corresponding CTUs (i.e. CTU #11, #15, #27, and #31) with one implicit vertical binary split. The implicit vertical binary splits divides the CTUs into a 64x 128 section, for which VPDU-based processing is applied and the availability of reference samples for CPR is constrained by VPDU-based invalidation as the subsequent CTU is processed, i.e. CTUs #12, #24, and #28. The implicit vertical binary split also results in a 32x 128 region in the CTUs #11, #15, #27, and #31, for example a region 13470. Regions such as the region 13470 are treated as a special case, for which division into square-shaped VPDUs no longer applies. Although the region 13470 has an area of 32x 128=16384 samples and is thus equal in area to a 64x64 VPDU, the different width of the region 13470 compared to the remainder of the CTU #11 complicates addressing, due to use of a different stride, i.e. 32, in the region 13470 compared to the remainder of the CTU.

[000234] To avoid the complexity of addressing samples in the region 13470, the second block vector 13440 has an X component that excludes access to the region 13470 by being set equal to the width of the tile rounded down to the nearest multiple of an integer number of VPDU width, i.e. multiple of 64. Tile 1 and 3, having a width of 416 luma samples, have the X component of the second block vector 13440 set as 6x64=384 luma samples.

[000235] The two columns of tiles, i.e. tiles 0 & 2 in 13490 and tiles 1 & 3 in 13495, are shown separated in Fig. 13E to facilitate illustration of an area 13430, corresponding to CTU #11 and referenceable by CPR-coded CUs in CTU #12, for example a coding unit 13410. For example, a CPR-coded coding unit 13410 addresses a reference block 13420 in the area 13430 according to a block vector 13415, with the underlying block 13425 accessed to provide reference samples, via addition of a second block vector 13440. As shown in Fig. 13E the area 13430 comprises a shadow of a portion of the previous CTU 11 based on VDPU size. The second block vector 13440 has a Y component equal to the CTU height of 128 samples and an X component equal to the number of contained VPDUs.

[000236] A case exists where a block vector references a block spanning the current and previous CTU, i.e. at the frame or tile edge in the case of boundary area CTUs. In the case, edges are likely to be discontinuous due to the placement of the leftmost edge of the current CTU and the rightmost edge of the previous CTU. For inter prediction, a sample extension process is applied where references to samples outside the frame are filled with samples resulting from a sample extension process. The sample extension process uses samples at the edge of the frame to provide values for samples reference outside of the frame area. The sample extension process fulfils sample accesses outside of the frame by using the nearest edge sample from the frame to provide the sample value. The reference sample extension process may be used for CPR where the block vector results in a reference block that includes at least one sample from the current CTU. The result of using the reference sample extension process for CPR is a behaviour analogous to that of inter prediction. When the block vector reference results in no access to the current CTU, remaining allowed references according to the defined reference area are to the previous CTU.

[000237] Fig. 14 shows a method 1400 for encoding a coding unit using current picture referencing to a CTU in the above row of CTUs. The method 1400 may be embodied by apparatus such as a configured FPGA, an ASIC, or an ASSP. Additionally, the method 1400 may be performed by video encoder 114 under execution of the processor 205. As such, the method 1400 may be stored on a computer-readable storage medium and/or in the memory 206. The method 1400 is performed to evaluate use of CPR for each CU in the coding tree of a CTU. A‘block vector is produced by operation of the method 1400 intended to improve compression performance. However other prediction modes, such as intra prediction or inter prediction, may be chosen instead should the other methods provide improved performance, as measured using a Lagrangian rate-distortion measurement. The method 1400 commences at a generate candidate block vectors step 1410.

[000238] At the generate candidate block vectors step 1410 the video encoder 114, under execution of the processor 205, generates candidate block vectors for evaluation of the CPR mode for prediction for a current CU. Candidate block vectors may only result in access of samples in allowed reference regions, for example as shown in Figs. 13A and 13B. Control in the processor 205 progresses from step 1410 to an iterate candidate block vector step 1415.

[000239] At the iterate candidate block vector step 1415 the video encoder 114, under execution of the processor 205, selects one of the candidate block vectors of the step 1410. On subsequent performances of the step 1415 a different block vector is selected, resulting in an iteration over the candidate block vectors. Control in the processor 205 progresses from step 1415 to a neighbour CTU location test step 1420.

[000240] At the neighbour CTU location test step 1420 the video encoder 114, under execution of the processor 205, determines if the current CTU is a boundary CTU, i.e. a CTU aligned to the left edge of a frame or tile. If the current CTU is a boundary CTU (“Yes” at step 1420), control in the processor 205 progresses to an add second block vector step 1430. Otherwise, if the current CTU is not a boundary CTU (“No” at step 1420) control in the processor 205 progresses to a fetch reference block step 1440.

[000241] At the add second block vector step 1430 the video encoder 114, under execution of the processor 205, determines an offset block vector for reference samples from at least one previously decoded block in a previously decoded CTU and adds the offset block vector to the candidate block vector to produce an access block vector such that the resultant“shadow’ block is adjacent to and left of the current CTU. The offset block vector, e.g. 13440, allows referencing the previous CTU when the previous CTU is in the above row of CTUs compared to the row of the current CTU, using a candidate block vector that addresses the previous CTU as if the previous CTU were to the left of the current CTU and adjacent the current CTU. The offset block vector has a Y component equal to negative of the CTU height, i.e. -128 luma samples, and an X component equal to the frame or tile width. Alternatively, the offset block vector can relate to absolute distances between the previous CTU and the current CTU or another mathematical method of locating the shadow block adjacent and left of the current block. Control in the processor 205 progresses from step 1430 to a fetch reference block step 1440.

[000242] At the fetch reference block step 1440 the video encoder 114, under execution of the processor 205, fetches a block of samples from the reference sample cache 356 according to the candidate block vector (for non-leftmost CTUs of a tile or frame) or the access block vector (for leftmost CTUs of a tile or frame). The fetched reference samples result in a candidate reference block. Control in the processor 205 progresses from step 1440 to a form coding unit step 1450.

[000243] At the form coding unit step 1450 the video encoder 114, under execution of the processor 205, determines a residual for the fetched reference samples. The residual is determined by applying quantisation, e.g. using the quantiser 334, and may involve either application or skipping application of a forward transform, e.g. using the forward transform 326. Control in the processor 205 progresses from step 1450 to a block vector evaluation test step 1460.

[000244] At the block vector evaluation test step 1460 the video encoder 114, under execution of the processor 205, evaluates the candidate block vector. Evaluation of a candidate block vector is performed by producing a rate-distortion measurement according to the coding cost of the candidate block vector, the coding cost of the associated residual and the resulting distortion. The resulting distortion relates to the difference of fetched reference block and inverse quantised and transformed residual to samples of the original frame data 113. A selected block vector is retained in the memory 206, the selection based the lowest cost measurement encountered in the present iteration over candidate block vectors. If the measurement indicates a cost that is low enough to satisfy a predetermined requirement or threshold (“Yes” at step 1460), the candidate block vector is selected. Control in the processor 205 progresses from step 1460 to an encode block vector step 1470. Otherwise, if further candidate block vectors are available and the cost does not satisfy the predetermined requirement (“No” at step 1460), control in the processor 205 progresses from step 1460 to the iterate candidate block vectors step 1415. At step further candidate block vectors may be tested until none are available. When no further candidate block vectors are available, the block vector encountered with lowest cost for the current CU is selected and control in the processor 205 progresses from step 1460 to the encode block vector step 1470.

[000245] At the encode block vector step 1470 the entropy encoder 338, under execution of the processor 205, encodes the selected block vector into the bitstream 115. Generally, the block vector is encoded in the bitstream as a motion vector referencing a picture in the reference picture list that is designated as the current picture. The method 1400 terminates on execution of step 1470 and the processor 205 progresses to the next coding unit.

[000246] Fig. 15 shows a method 1500 for decoding a coding unit using current picture referencing to a CTU in the above row of CTUs. The method 1500 may be embodied by apparatus such as a configured FPGA, an ASIC, or an ASSP. Additionally, the method 1500 may be performed by video decoder 134 under execution of the processor 205. As such, the method 1500 may be stored on a computer-readable storage medium and/or in the memory 206. The method 1500 operates in a similar manner to the method 1400. The method 1500 commences at a decode block vector step 1510.

[000247] At the decode block vector step 1510 the entropy decoder 420, under execution of the processor 205, decodes a block vector for a CU. The block vector is decoded using the motion vector coding syntax and identified as a block vector due to use of a particular reference picture in the reference picture list designated as the current picture. Control in the processor 205 progresses from step 1510 to a neighbour CTU location test step 1520.

[000248] At the neighbour CTU location test step 1520 the video decoder 134, under execution of the processor 205, determines if the current CTU is a boundary CTU. A boundary CTU is a CTU aligned to (located at) the left edge of a frame or tile. If the current CTU is a boundary CTU (“Yes” at step 1520), control in the processor 205 progresses to an add second block vector step 1530. Otherwise, if the current CTU is not a boundary CTU (“No” at step 1520) control in the processor 205 progresses to a fetch reference block step 1540.

[000249] At the add second block vector step 1530 the video decoder 134, under execution of the processor 205, determines an offset block vector and adds the offset block vector to the candidate block vector to produce an access block vector. The offset block vector is for reference samples from at least one previously decoded block in a previously decoded CTU. The offset block vector, e.g. 13440, allows referencing the previous CTU when the previous CTU is in the above row of CTUs compared to the row of the current CTU, using a candidate block vector that addresses the previous CTU as if the previous CTU were to the left of the current CTU. The offset block vector has a Y component equal to negative of the CTU height, i.e. -128 luma samples, and an X component equal to the frame or tile width. Alternatively, the offset block vector can relate to absolute distances between the previous CTU and the current CTU or another mathematical method of locating the shadow block adjacent and left of the current block. Control in the processor 205 progresses from step 1530 to a fetch reference block step 1540.

[000250] At the fetch reference block step 1540 the video decoder 134, under execution of the processor 205, fetches a block of samples from the reference sample cache 460. The block of samples is fetched according to the candidate block vector (for non-leftmost CTUs of a tile or frame) or the access block vector (for leftmost CTUs of a tile or frame). The fetched reference samples result in a candidate reference block. Control in the processor 205 progresses from step 1540 to a form coding unit step 1550.

[000251] At the form coding unit step 1550 the video decoder 134, under execution of the processor 205, decodes the coding unit by decoding a residual for the coding unit from the bitstream 133, performing inverse quantisation and inverse transform and adding the result to the fetched reference samples. The method 1500 terminates upon execution of step 1550 and the processor 205 progresses to the next coding unit to execute the method 1500 again.

[000252] If video encoder 114 and the video decoder 134 implement the methods 1400 and 1500 respectively, current picture referencing can be applied to coding units in CTUs along the left edge of a frame, and along the left edge of tiles within a frame. Accordingly, a prediction block can be generated to form the coding unit as described in relation to Fig. 3. Moreover, application of CPR is such that the previously encoded or decoded CTU in the same frame or tile is available for use as reference without excessive block vector magnitude. Finally, the available reference areas in the current and previous CTU are defined at a granularity of VPDUs, i.e. 64x64 regions. Restricting available reference areas in the current and previous CTU on a VPDU basis restricts the memory requirement of the reference sample cache 356 and the reconstructed sample cache 460 to one CTU of samples. Moreover, as encoding or decoding progresses, utilisation of the reference sample cache 356 and the reconstructed sample cache 460 remains constant at one CTU of samples as one VPDU from the previous CTU is invalidated each time processing a new VPDU in the current CTU commences.

INDUSTRIAL APPLICABILITY

[000253] The arrangements described are applicable to the computer and data processing industries and particularly for the digital signal processing for the encoding a decoding of signals such as video and image signals, achieving high compression efficiency without excessive cost in terms of memory consumption, silicon area due to affording the possibility of pipelined implementations with a processing region size smaller than the largest supported block size, or CTU size. In some implementations, the arrangements described are useful for the VVC standard, as use of the VPDU-level parsing (as implemented at steps 1030 and 1130 for example) allows pipeline processing of video encoding and decoding, thereby simplifying hardware requirements. . Additionally, memory requirements for decoding video data may be reduced. Further, as described above in relation to Figs. 14 and 15 generating an access block for left-most CTUs allows CPR to be used for the left-most CTUs while limiting vector block magnitude. Additionally, granularity and memory requirements may be improved.

[000254] The foregoing describes only some embodiments of the present invention, and modifications and/or changes can be made thereto without departing from the scope and spirit of the invention, the embodiments being illustrative and not restrictive.

Claims

1. A method of decoding, from a bitstream, a coding unit in a current coding tree unit (CTU) to produce a frame, the coding unit referencing coding units previously decoded from the bitstream and the frame including a two-dimensional array of CTUs, the method comprising:

decoding a block vector for the coding unit from the bitstream, the coding unit located at a CTU at a left edge of the two-dimensional array of CTUs;

determining an offset block vector, the offset block vector locating the previously coded CTU to be adjacent to and left of the current CTU;

producing a prediction block for the coding unit by fetching reference samples from the previous CTU according to the sum of the decoded block vector and the determined offset block vector; and

forming the coding unit using the prediction block and decoded residual samples of the coding unit to produce a frame.

2. The method according to claim 1, wherein the offset vector has a Y component equal to negative of a CTU height.

3. The method according to claim 1, wherein the offset vector has an X component equal to a width of the frame.

4. The method according to claim 1, wherein the previously coded CTU is in a different row to the coding unit.

5. The method according to claim 1, wherein the determined offset locates a portion of the previously coded CTU to be adjacent to and left of the current CTU.

6. A non-transitory computer readable medium having a computer program stored thereon to implement a method of decoding, from a bitstream, a coding unit in a current coding tree unit (CTU) to produce a frame, the coding unit referencing coding units previously decoded from the bitstream and the frame including a two-dimensional array of CTUs, the program comprising:

code for decoding a block vector for the coding unit from the bitstream, the coding unit located at a CTU at a left edge of the two-dimensional array of CTUs;

code for determining an offset block vector, the offset block vector locating the previously coded CTU to be adjacent to and left of the current CTU;

code for producing a prediction block for the coding unit by fetching reference samples from the previous CTU according to the sum of the decoded block vector and the determined offset block vector; and

code for forming the coding unit using the prediction block and decoded residual samples of the coding unit to produce a frame.

7. A system, comprising:

a memory; and

a processor, wherein the processor is configured to execute code stored on the memory for implementing a method of decoding, from a bitstream, a coding unit in a current coding tree unit (CTU) to produce a frame, the coding unit referencing coding units previously decoded from the bitstream and the frame including a two-dimensional array of CTUs, the method comprising:

8. A video decoder configured to decode, from a bitstream, a coding unit in a current coding tree unit (CTU) to produce a frame, the coding unit referencing coding units previously decoded from the bitstream and the frame including a two-dimensional array of CTUs, by implementing a method comprising: