WO2020000016A1

WO2020000016A1 - Method, apparatus and system for encoding and decoding a transformed block of video samples

Info

Publication number: WO2020000016A1
Application number: PCT/AU2019/050342
Authority: WO
Inventors: Christopher James ROSEWARNE; Andrew James Dorrell
Original assignee: Canon Kabushiki Kaisha; Canon Australia Pty Limited
Priority date: 2018-06-29
Filing date: 2019-04-17
Publication date: 2020-01-02
Also published as: AU2018204775A1; TW202002653A

Abstract

A system and method of decoding a transform block in an image frame from a bitstream. The method comprises determining an aspect ratio for a transform block of the bitstream and determining, in the bitstream, a presence of a non-separable secondary transform (NSST) selection for the transform block based on the determined aspect ratio. The method further comprises decoding, from the bitstream, the NSST selection for the transform block; and decoding the transform block in the image frame by applying the decoded NSST selection to the transform block.

Description

METHOD, APPARATUS AND SYSTEM FOR ENCODING AND DECODING A TRANSFORMED BLOCK OF VIDEO SAMPLES

REFERENCE TO RELATED APPLICATION(S)

[0001] This application claims the benefit of priority from Australian Patent Application No. 2018204775, filed on 29 June 2018, hereby incorporated by reference in its entirety as if fully set forth herein.

TECHNICAL FIELD

[0002] The present invention relates generally to digital video signal processing and, in particular, to a method, apparatus and system for encoding and decoding a transformed block of video samples. The present invention also relates to a computer program product including a computer readable medium having recorded thereon a computer program for encoding and decoding a transformed block of video samples.

BACKGROUND

[0003] Many applications for video coding currently exist, including applications for transmission and storage of video data. Many video coding standards have also been developed and others are currently in development. Recent developments in video coding standardisation have led to the formation of a group called the“Joint Video Experts Team” (JVET). The Joint Video Experts Team (JVET) includes members of Study Group 16, Question 6 (SG16/Q6) of the Telecommunication Standardisation Sector (ITU-T) of the International Telecommunication Union (ITU), known as the“Video Coding Experts Group” (VCEG), and members of the International Organisations for Standardisation / International Electrotechnical Commission Joint Technical Committee 1 / Subcommittee 29 / Working Group 11 (ISO/IEC

JTC1/SC29/WG11), also known as the“Moving Picture Experts Group” (MPEG).

[0004] The Joint Video Experts Team (JVET) issued a Call for Proposals (CfP), with responses analysed at its l0^th meeting in San Diego, USA. The submitted responses demonstrated video compression capability significantly outperforming that of the current state-of-the-art video compression standard, i.e.:“high efficiency video coding” (HEVC). On the basis of this outperformance it was decided to commence a project to develop a new video compression standard, to be named‘versatile video coding’ (VVC). VVC is anticipated to address ongoing demand for ever-higher compression performance, especially as video formats increase in capability (e.g., with higher resolution and higher frame rate) and address increasing market demand for service delivery over WANs, where bandwidth costs are relatively high. At the same time, VVC must be implementable in contemporary silicon processes and offer an acceptable trade-off between the achieved performance versus the implementation cost (for example, in terms of silicon area, CPU processor load, memory utilisation and bandwidth).

[0005] Video data includes a sequence of frames of image data, each of which include one or more colour channels. Generally, there is one primary colour channel and two secondary colour channels. The primary colour channel is generally referred to as the‘luma’ channel and the secondary colour channel(s) are generally referred to as the‘chroma’ channels. Although video data is typically displayed in an RGB (red-green-blue) colour space, the colour space has a high degree of correlation between the three respective components. The video data representation seen by an encoder or a decoder is often using a colour space such as YCbCr. YCbCr concentrates luma in a Y (primary) channel and chroma in Cb and Cr (secondary) channels. Moreover, the Cb and Cr channels may be sampled at a lower rate compared to the luma channel, for example half horizontally and half vertically - known as a‘4:2:0 chroma format’ .

[0006] WC is a‘block based’ codec, in which frames are divided into blocks and the blocks are processed in a particular order. For each block a prediction of the contents of the block is generated and a representation of the difference (or‘residual’ in the spatial domain) between the prediction and the actual block contents seen as input to the encoder is formed. The difference may be coded as a sequence of residual coefficients, resulting from the application of the forward transform, such as a Discrete Cosine Transform (DCT) or other transform, to the block of residual values. Moreover, a secondary transform may also be applied to achieve further compression efficiency. The resulting coefficients may be are quantised, resulting in a loss of precision in exchange for a reduction in the compressed size of the residual data.

Application of the secondary transform, while desirable from the perspective of further reducing the coding cost of the residual coefficients, introduces complexity into the design both with respect to the secondary transform itself and due to the increased complexity in the encoder to select one from multiple available transform matrices to apply as secondary transform. SUMMARY

[0007] It is an object of the present invention to substantially overcome, or at least ameliorate, one or more disadvantages of existing arrangements.

[0008] One aspect of the present disclosure provides a method of decoding a transform block in an image frame from a bitstream, the method comprising: determining an aspect ratio for a transform block of the bitstream; determining, in the bitstream, a presence of a non-separable secondary transform (NSST) selection for the transform block based on the determined aspect ratio; decoding, from the bitstream, the NSST selection for the transform block; and decoding the transform block in the image frame by applying the decoded NSST selection to the transform block.

[0009] According to another aspect, the presence of the NSST selection is determined if the aspect ratio is 2: 1.

[00010] According to another aspect, the presence of the NSST selection is determined if the aspect ratio is less than or equal to a predetermined threshold, the aspect ratio expressed as a longest side of the transform block to a shortest side of the transform block.

[00011] Another aspect of the present disclosure provides a non-transitory computer-readable medium having a computer program stored thereon to implement a method of decoding a transform block in an image frame from a bitstream, the program comprising: code for determining an aspect ratio for a transform block; code for determining, in the bitstream, a presence of a non-separable secondary transform (NSST) selection for the transform block based on the determined aspect ratio; code for decoding, from the bitstream, the NSST selection for the transform block; and code for decoding the transform block in the image frame by applying the decoded NSST selection to the transform block.

[00012] Another aspect of the present disclosure provides a system, comprising: a memory; and a processor, wherein the processor is configured to execute code stored on the memory for implementing a method of decoding a transform block in an image frame from a bitstream, the method comprising: determining an aspect ratio for a transform block; determining, in the bitstream, a presence of a non-separable secondary transform (NSST) selection for the transform block based on the determined aspect ratio; decoding, from the bitstream, the NSST selection for the transform block; and decoding the transform block in the image frame by applying the decoded NSST selection to the transform block.

[00013] Another aspect of the present disclosure provides a video encoder configured to decode a transform block in an image frame from a bitstream, comprising: a memory; a processor configured to execute code stored on the memory to: determine an aspect ratio for a transform block; determine, in the bitstream, a presence of a non-separable secondary transform (NSST) selection for the transform block based on the determined aspect ratio; decode, from the bitstream, the NSST selection for the transform block; and decode the transform block in the image frame by applying the decoded NSST selection to the transform block.

[00014] Other aspects are also disclosed.

BRIEF DESCRIPTION OF THE DRAWINGS

[00015] At least one example embodiment of the present invention will now be described with reference to the following drawings, in which:

[00016] Fig. l is a schematic block diagram showing a video encoding and decoding system;

[00017] Figs 2 A and 2B form a schematic block diagram of a general purpose computer system upon which one or both of the video encoding and decoding system of Fig. 1 may be practiced;

[00018] Fig 3 is a schematic block diagram showing functional modules of a video encoder;

[00019] Fig. 4 is a schematic block diagram showing functional modules of a video decoder;

[00020] Fig. 5 is a schematic block diagram showing the available divisions of a block into one or more blocks in the tree structure of versatile video coding;

[00021] Fig. 6 is a schematic illustration of a dataflow to achieve permitted divisions of a block into one or more blocks in a tree structure of versatile video coding;

[00022] Fig. 7 is an example division of a coding tree unit (CTU) into a number of coding units (CUs);

[00023] Fig. 8A is a diagram showing intra prediction modes; [00024] Fig. 8B is a table showing a mapping from intra prediction modes to secondary transform sets for a transform block;

[00025] Fig. 9 is a schematic block diagram showing the inverse secondary transform module of the video encoder of Fig. 3 or the video decoder of Fig. 4;

[00026] Fig. 10 is a schematic block diagram showing the set of transform blocks available in the versatile video coding standard;

[00027] Fig. 11 is a flow chart diagram of a method for selectively applying a non-separable secondary transform to encode a transform block of residual coefficients into a bitstream; and

[00028] Fig. 12 is a flow chart diagram of a method for decoding a transform block of residual coefficients from a bitstream by selectively applying a non-separable secondary transform.

DETAILED DESCRIPTION INCLUDING BEST MODE

[00029] Where reference is made in any one or more of the accompanying drawings to steps and/or features, which have the same reference numerals, those steps and/or features have for the purposes of this description the same function(s) or operation(s), unless the contrary intention appears.

[00030] Fig. l is a schematic block diagram showing function modules of a video encoding and decoding system 100. The system 100 may utilise coefficient scanning methods to improve compression efficiency and/or achieve reduced implementation cost. The system 100 includes a source device 110 and a destination device 130. A communication channel 120 is used to communicate encoded video information from the source device 110 to the destination device 130. In some arrangements, the source device 110 and destination device 130 may either or both comprise respective mobile telephone handsets or“smartphones”, in which case the communication channel 120 is a wireless channel. In other arrangements, the source device 110 and destination device 130 may comprise video conferencing equipment, in which case the communication channel 120 is typically a wired channel, such as an internet connection. Moreover, the source device 110 and the destination device 130 may comprise any of a wide range of devices, including devices supporting over-the-air television broadcasts, cable television applications, internet video applications (including streaming) and applications where encoded video data is captured on some computer-readable storage medium, such as hard disk drives in a file server.

[00031] As shown in Fig. 1, the source device 110 includes a video source 112, a video encoder 114 and a transmitter 116. The video source 112 typically comprises a source of captured video frame data (shown as 113), also referred to as an image frame, such as an imaging sensor, a previously captured video sequence stored on a non-transitory recording medium, or a video feed from a remote imaging sensor. The video source 112 may also be the output of a computer graphics card, e.g. displaying the video output of an operating system and various applications executing upon a computing device, for example a tablet computer.

Examples of source devices 110 that may include an imaging sensor as the video source 112 include smart-phones, video camcorders, professional video cameras, and network video cameras. The video encoder 114 converts (or‘encodes’) the captured frame data (indicated by an arrow 113) from the video source 112 into a bitstream (indicated by an arrow 115) as described further with reference to Fig. 3. The bitstream 115 is transmitted by the

transmitter 116 over the communication channel 120 as encoded video data (or“encoded video information”). It is also possible for the bitstream 115 to be stored in a non-transitory storage device 122, such as a“Flash” memory or a hard disk drive, until later being transmitted over the communication channel 120, or in-lieu of transmission over the communication channel 120.

[00032] The destination device 130 includes a receiver 132, a video decoder 134 and a display device 136. The receiver 132 receives encoded video data from the communication channel 120 and passes received video data to the video decoder 134 as a bitstream (indicated by an arrow 133). The video decoder 134 then outputs decoded image frame data (indicated by an arrow 135) to the display device 136. Examples of the display device 136 include a cathode ray tube, a liquid crystal display, such as in smart-phones, tablet computers, computer monitors or in stand-alone television sets. It is also possible for the functionality of each of the source device 110 and the destination device 130 to be embodied in a single device, examples of which include mobile telephone handsets and tablet computers.

[00033] Notwithstanding the example devices mentioned above, each of the source device 110 and destination device 130 may be configured within a general purpose computing system, typically through a combination of hardware and software components. Fig. 2A illustrates such a computer system 200, which includes: a computer module 201; input devices such as a keyboard 202, a mouse pointer device 203, a scanner 226, a camera 227, which may be configured as the video source 112, and a microphone 280; and output devices including a printer 215, a display device 214, which may be configured as the display device 136, and loudspeakers 217. An external Modulator-Demodulator (Modem) transceiver device 216 may be used by the computer module 201 for communicating to and from a communications network 220 via a connection 221. The communications network 220, which may represent the communication channel 120, may be a wide-area network (WAN), such as the Internet, a cellular telecommunications network, or a private WAN. Where the connection 221 is a telephone line, the modem 216 may be a traditional“dial-up” modem. Alternatively, where the connection 221 is a high capacity (e.g., cable or optical) connection, the modem 216 may be a broadband modem. A wireless modem may also be used for wireless connection to the communications network 220. The transceiver device 216 may provide the functionality of the transmitter 116 and the receiver 132 and the communication channel 120 may be embodied in the connection 221.

[00034] The computer module 201 typically includes at least one processor unit 205, and a memory unit 206. For example, the memory unit 206 may have semiconductor random access memory (RAM) and semiconductor read only memory (ROM). The computer module 201 also includes an number of input/output (I/O) interfaces including: an audio-video interface 207 that couples to the video display 214, loudspeakers 217 and microphone 280; an I/O interface 213 that couples to the keyboard 202, mouse 203, scanner 226, camera 227 and optionally a joystick or other human interface device (not illustrated); and an interface 208 for the external modem 216 and printer 215. The signal from the audio-video interface 207 to the computer monitor 214 is generally the output of a computer graphics card. In some implementations, the modem 216 may be incorporated within the computer module 201, for example within the interface 208. The computer module 201 also has a local network interface 211, which permits coupling of the computer system 200 via a connection 223 to a local-area communications network 222, known as a Local Area Network (LAN). As illustrated in Fig. 2A, the local communications network 222 may also couple to the wide network 220 via a connection 224, which would typically include a so-called“firewall” device or device of similar functionality. The local network interface 211 may comprise an Ethernet™ circuit card, a Bluetooth™ wireless arrangement or an IEEE 802.11 wireless arrangement; however, numerous other types of interfaces may be practiced for the interface 211. The local network interface 211 may also provide the functionality of the transmitter 116 and the receiver 132 and communication channel 120 may also be embodied in the local communications network 222. [00035] The I/O interfaces 208 and 213 may afford either or both of serial and parallel connectivity, the former typically being implemented according to the Universal Serial Bus (USB) standards and having corresponding USB connectors (not illustrated). Storage devices 209 are provided and typically include a hard disk drive (HDD) 210. Other storage devices such as a floppy disk drive and a magnetic tape drive (not illustrated) may also be used. An optical disk drive 212 is typically provided to act as a non-volatile source of data. Portable memory devices, such optical disks (e.g. CD-ROM, DVD, Blu ray DiscTM), USB-RAM, portable, external hard drives, and floppy disks, for example, may be used as appropriate sources of data to the computer system 200. Typically, any of the HDD 210, optical drive 212, networks 220 and 222 may also be configured to operate as the video source 112, or as a destination for decoded video data to be stored for reproduction via the display 214. The source device 110 and the destination device 130 of the system 100, or the source device 110 and the destination device 130 of the system 100 may be embodied in the computer system 200.

[00036] The components 205 to 213 of the computer module 201 typically communicate via an interconnected bus 204 and in a manner that results in a conventional mode of operation of the computer system 200 known to those in the relevant art. For example, the processor 205 is coupled to the system bus 204 using a connection 218. Likewise, the memory 206 and optical disk drive 212 are coupled to the system bus 204 by connections 219. Examples of computers on which the described arrangements can be practised include IBM-PC’s and compatibles, Sun SPARC stations, Apple Mac™ or alike computer systems.

[00037] Where appropriate or desired, the video encoder 114 and the video decoder 134, as well as methods described below, may be implemented using the computer system 200 wherein the video encoder 114, the video decoder 134 and methods to be described, may be

implemented as one or more software application programs 233 executable within the computer system 200. In particular, the video encoder 114, the video decoder 134 and the steps of the described methods are effected by instructions 231 (see Fig. 2B) in the software 233 that are carried out within the computer system 200. The software instructions 231 may be formed as one or more code modules, each for performing one or more particular tasks. The software may also be divided into two separate parts, in which a first part and the corresponding code modules performs the described methods and a second part and the corresponding code modules manage a user interface between the first part and the user. [00038] The software may be stored in a computer readable medium, including the storage devices described below, for example. The software is loaded into the computer system 200 from the computer readable medium, and then executed by the computer system 200. A computer readable medium having such software or computer program recorded on the computer readable medium is a computer program product. The use of the computer program product in the computer system 200 preferably effects an advantageous apparatus for implementing the video encoder 114, the video decoder 134 and the described methods.

[00039] The software 233 is typically stored in the HDD 210 or the memory 206. The software is loaded into the computer system 200 from a computer readable medium, and executed by the computer system 200. Thus, for example, the software 233 may be stored on an optically readable disk storage medium (e.g., CD-ROM) 225 that is read by the optical disk drive 212.

[00040] In some instances, the application programs 233 may be supplied to the user encoded on one or more CD-ROMs 225 and read via the corresponding drive 212, or alternatively may be read by the user from the networks 220 or 222. Still further, the software can also be loaded into the computer system 200 from other computer readable media. Computer readable storage media refers to any non-transitory tangible storage medium that provides recorded instructions and/or data to the computer system 200 for execution and/or processing. Examples of such storage media include floppy disks, magnetic tape, CD-ROM, DVD, Blu-ray DiscTM, a hard disk drive, a ROM or integrated circuit, USB memory, a magneto-optical disk, or a computer readable card such as a PCMCIA card and the like, whether or not such devices are internal or external of the computer module 201. Examples of transitory or non-tangible computer readable transmission media that may also participate in the provision of the software, application programs, instructions and/or video data or encoded video data to the computer module 401 include radio or infra-red transmission channels as well as a network connection to another computer or networked device, and the Internet or Intranets including e-mail transmissions and information recorded on Websites and the like.

[00041] The second part of the application programs 233 and the corresponding code modules mentioned above may be executed to implement one or more graphical user interfaces (GUIs) to be rendered or otherwise represented upon the display 214. Through manipulation of typically the keyboard 202 and the mouse 203, a user of the computer system 200 and the application may manipulate the interface in a functionally adaptable manner to provide controlling commands and/or input to the applications associated with the GUI(s). Other forms of functionally adaptable user interfaces may also be implemented, such as an audio interface utilizing speech prompts output via the loudspeakers 217 and user voice commands input via the microphone 280.

[00042] Fig. 2B is a detailed schematic block diagram of the processor 205 and a

“memory” 234. The memory 234 represents a logical aggregation of all the memory modules (including the ITDD 209 and semiconductor memory 206) that can be accessed by the computer module 201 in Fig. 2A.

[00043] When the computer module 201 is initially powered up, a power-on self-test (POST) program 250 executes. The POST program 250 is typically stored in a ROM 249 of the semiconductor memory 206 of Fig. 2A. A hardware device such as the ROM 249 storing software is sometimes referred to as firmware. The POST program 250 examines hardware within the computer module 201 to ensure proper functioning and typically checks the processor 205, the memory 234 (209, 206), and a basic input-output systems software (BIOS) module 251, also typically stored in the ROM 249, for correct operation. Once the POST program 250 has run successfully, the BIOS 251 activates the hard disk drive 210 of Fig. 2A. Activation of the hard disk drive 210 causes a bootstrap loader program 252 that is resident on the hard disk drive 210 to execute via the processor 205. This loads an operating system 253 into the RAM memory 206, upon which the operating system 253 commences operation. The operating system 253 is a system level application, executable by the processor 205, to fulfil various high level functions, including processor management, memory management, device management, storage management, software application interface, and generic user interface.

[00044] The operating system 253 manages the memory 234 (209, 206) to ensure that each process or application running on the computer module 201 has sufficient memory in which to execute without colliding with memory allocated to another process. Furthermore, the different types of memory available in the computer system 200 of Fig. 2A must be used properly so that each process can run effectively. Accordingly, the aggregated memory 234 is not intended to illustrate how particular segments of memory are allocated (unless otherwise stated), but rather to provide a general view of the memory accessible by the computer system 200 and how such is used.

[00045] As shown in Fig. 2B, the processor 205 includes a number of functional modules including a control unit 239, an arithmetic logic unit (ALU) 240, and a local or internal memory 248, sometimes called a cache memory. The cache memory 248 typically includes a number of storage registers 244-246 in a register section. One or more internal busses 241 functionally interconnect these functional modules. The processor 205 typically also has one or more interfaces 242 for communicating with external devices via the system bus 204, using a connection 218. The memory 234 is coupled to the bus 204 using a connection 219.

[00046] The application program 233 includes a sequence of instructions 231 that may include conditional branch and loop instructions. The program 233 may also include data 232 which is used in execution of the program 233. The instructions 231 and the data 232 are stored in memory locations 228, 229, 230 and 235, 236, 237, respectively. Depending upon the relative size of the instructions 231 and the memory locations 228-230, a particular instruction may be stored in a single memory location as depicted by the instruction shown in the memory location 230. Alternately, an instruction may be segmented into a number of parts each of which is stored in a separate memory location, as depicted by the instruction segments shown in the memory locations 228 and 229.

[00047] In general, the processor 205 is given a set of instructions which are executed therein. The processor 205 waits for a subsequent input, to which the processor 205 reacts to by executing another set of instructions. Each input may be provided from one or more of a number of sources, including data generated by one or more of the input devices 202, 203, data received from an external source across one of the networks 220, 202, data retrieved from one of the storage devices 206, 209 or data retrieved from a storage medium 225 inserted into the corresponding reader 212, all depicted in Fig. 2A. The execution of a set of the instructions may in some cases result in output of data. Execution may also involve storing data or variables to the memory 234.

[00048] The video encoder 114, the video decoder 134 and the described methods may use input variables 254, which are stored in the memory 234 in corresponding memory

locations 255, 256, 257. The video encoder 114, the video decoder 134 and the described methods produce output variables 261, which are stored in the memory 234 in corresponding memory locations 262, 263, 264. Intermediate variables 258 may be stored in memory locations 259, 260, 266 and 267.

[00049] Referring to the processor 205 of Fig. 2B, the registers 244, 245, 246, the arithmetic logic unit (ALU) 240, and the control unit 239 work together to perform sequences of micro- operations needed to perform“fetch, decode, and execute” cycles for every instruction in the instruction set making up the program 233. Each fetch, decode, and execute cycle comprises:

(a) a fetch operation, which fetches or reads an instruction 231 from a memory

location 228, 229, 230;

(b) a decode operation in which the control unit 239 determines which instruction has been fetched; and

(c) an execute operation in which the control unit 239 and/or the ALU 240 execute the instruction.

[00050] Thereafter, a further fetch, decode, and execute cycle for the next instruction may be executed. Similarly, a store cycle may be performed by which the control unit 239 stores or writes a value to a memory location 232.

[00051] Each step or sub-process in the methods of Figs. 12 and 13, to be described, is associated with one or more segments of the program 233 and is typically performed by the register section 244, 245, 247, the ALU 240, and the control unit 239 in the processor 205 working together to perform the fetch, decode, and execute cycles for every instruction in the instruction set for the noted segments of the program 233.

[00052] Fig. 3 is a schematic block diagram showing functional modules of the video encoder 114. Fig. 4 is a schematic block diagram showing functional modules of the video decoder 134. Generally, data passes between functional modules within the video encoder 114 and the video decoder 134 in groups of samples or coefficients, such as divisions of blocks into sub-blocks of a fixed size, or as arrays. The video encoder 114 and video decoder 134 may be implemented using a general-purpose computer system 200, as shown in Figs. 2A and 2B, where the various functional modules may be implemented by dedicated hardware within the computer system 200, by software executable within the computer system 200 such as one or more software code modules of the software application program 233 resident on the hard disk drive 205 and being controlled in its execution by the processor 205. Alternatively the video encoder 114 and video decoder 134 may be implemented by a combination of dedicated hardware and software executable within the computer system 200. The video encoder 114, the video decoder 134 and the described methods may alternatively be implemented in dedicated hardware, such as one or more integrated circuits performing the functions or sub functions of the described methods. Such dedicated hardware may include graphic processing units (GPUs), digital signal processors (DSPs), application-specific standard products (ASSPs), application- specific integrated circuits (ASICs), field programmable gate arrays (FPGAs) or one or more microprocessors and associated memories. In particular, the video encoder 114 comprises modules 322-386 and the video decoder 134 comprises modules 420-496 which may each be implemented as one or more software code modules of the software application program 233.

[00053] Although the video encoder 114 of Fig. 3 is an example of a versatile video coding (VVC) video encoding pipeline, other video codecs may also be used to perform the processing stages described herein. The video encoder 114 receives captured image frame data 113, such as a series of frames, each frame including one or more colour channels. A block

partitioner 310 firstly divides the image frame data 113 into regions generally referred to as ‘coding tree units’ (CTUs), generally square in shape and configured such that a particular size for the CTUs is used. The size of the coding tree units may be 64x64, 128x128, or 256x256 luma samples for example. The block partitioner 310 further divides each CTU into one or more coding units (CUs), with the CUs having a variety of sizes, which may include both square and non-square aspect ratios. Thus, a current block 312,‘coding unit’ (CU), is output from the block partitioner 310, progressing in accordance with an iteration over the one or more blocks of the CTU. However, the concept of a CU is not limited to the block partitioning resulting from the block partitioner 310. The video decoder 134 may also be said to produce CUs which, due to use of lossy compression techniques, are typically an approximation of the blocks from the block partitioner 310. Moreover, the video encoder 114 produces CUs having the same approximation as seen in the video decoder 134, enabling exact knowledge of the sample data available to block prediction methods in the video decoder 134. The options for partitioning CTUs into CUs are further described below with reference to Figs. 5 and 6.

[00054] The coding tree units (CTUs) resulting from the first division of the image frame data 113 may be scanned in raster scan order and are grouped into one or more‘slices’ . As the frame data 113 typically includes multiple colour channels, the CTUs and CUs are associated with the samples from all colour channels that overlap with the block area defined from operation of the block partitioner 310. A CU may be said to comprise one or more coding blocks (CBs), with each CB occupying the same block area as the CU but being associated with each one of the colour channels of the frame data 113. Due to the potentially differing sampling rate of the chroma channels compared to the luma channel, the dimensions of CBs for chroma channels may differ from those of CBs for luma channels. When using the 4:2:0 chroma format, CBs of chroma channels of a CU have dimensions of half of the width and height of the CB for the luma channel of the CU.

[00055] In iterating over all CUs resulting from the block partitioner 310, the video encoder 114 produces a‘prediction unit’ (PU), indicated by an arrow 320, for each block, for example the block 312. The PU 320 is a prediction of the contents of the associated CU 312. A subtracter module 322 produces a difference, indicated as 324 (or‘residual’, referring to the difference being in the spatial domain), between the PU 320 and the CU 312. The

difference 324 is a block-size difference between corresponding samples in the PU 320 and the CU 312. The difference 324 is transformed, quantised and represented as a transform unit (TU), indicated by an arrow 336. The PU 320 is typically chosen as the‘best’ resulting one of many possible candidate PUs. A candidate PU is a PU resulting from one of the prediction modes available to the video encoder 114. Each candidate PU results in a corresponding transform unit. The transform unit 336 is a quantised and transformed representation of the difference 324. When combined with the predicted PU in the decoder 114, the transform unit 336 reduces the difference between decoded CUs and the original blocks 312 at the expense of additional signalling in a bitstream.

[00056] Each candidate PU thus has an associated coding cost (rate) and an associated difference (or‘distortion’). The coding rate (cost) is typically measured in bits. The coding distortion of a block is typically estimated as a difference in sample values, such as a sum of absolute differences (SAD) or a sum of squared differences (SSD). The estimate resulting from each candidate PU is determined by a mode selector 386 using the difference 324 to determine an intra prediction mode (represented by an arrow 388). Estimation of the coding costs associated with each candidate prediction mode and corresponding residual coding can be performed at significantly lower cost than entropy coding of the residual. Accordingly, a number of candidate modes can be evaluated to determine an optimum mode in a rate-distortion sense.

[00057] Determining an optimum mode is typically achieved using a variation of Lagrangian optimisation. Selection of the intra prediction mode 388 typically involves determining a coding cost for the residual data resulting from application of a particular intra prediction mode. The coding cost may be approximated by using a‘sum of transformed differences’ whereby a relatively simple transform, such as a Hadamard transform, is used to obtain an estimated transformed residual cost. For implementations using relatively simple transforms, provided the costs resulting from the simplified estimation method are monotonically related to the actual costs that would otherwise be determined from a full evaluation, the simplified estimation method may be used to make the same decision (i.e. intra prediction mode) with a reduction in complexity in the video encoder 114. To allow for possible non-monotonicity in the relationship between estimated and actual costs, possibly resulting from further mode decisions available for the coding of residual data, the simplified estimation method may be used to generate a list of best candidates. The list of best candidates may be of an arbitrary number. A more complete search may be performed using the best candidates to establish optimal mode choices for coding the residual data for each of the candidates, allowing a final selection of the intra prediction mode along with other mode decisions.

[00058] The other mode decisions include an ability to skip the primary and secondary transforms, known as‘transform skip’. Skipping the transforms is suited to residual data that lacks adequate correlation for reduced coding cost via expression as transform basis functions. Certain types of content, such as relatively simple computer generated graphics may exhibit similar behaviour. Another mode decision associated with the mode selector module 386 is selection of an NSST (Non-Separable Secondary Transform) index. The NSST index is represented by an arrow 390. Presence of the NSST index 390 in the bitstream indicates a selection of a particular NSST for the corresponding transform block. The NSST index 390 has four possible values. A value of zero for the NSST index 390 indicates that the NSST is bypassed (i.e. not performed). The NSST index 390 having a value from one to three indicates which one of three possible NSST transform matrices is to be applied. Selection of the NSST index is a costly operation and is performed as described with reference to Fig. 10.

[00059] Lagrangian or similar optimisation processing can be employed to both select an optimal partitioning of a CTU into CUs (by the block partitioner 310) as well as the selection of a best prediction mode from a plurality of possibilities. Through application of a Lagrangian optimisation process of the candidate modes in a mode selector module 386, the intra prediction mode with the lowest cost measurement is selected as the best mode. The best mode is the selected intra prediction mode 388 and is also encoded in the bitstream 115 by an entropy encoder 338. The selection of the intra prediction mode 388 by operation of the selector module 386 extends to operation of the block partitioner 310. For example, candidates for selection of the intra prediction mode 388 may include modes applicable to a given block and additionally modes applicable to multiple smaller blocks that collectively are collocated with the given block. In such cases, the process of selection of candidates implicitly is also a process of determining the best hierarchical decomposition of the CTU into CUs.

[00060] The entropy encoder 338 supports both variable-length coding of syntax elements and arithmetic coding of syntax elements. Arithmetic coding is supported using a context-adaptive binary arithmetic coding process. Arithmetically coded syntax elements consist of sequences of one or more‘bins’ . Bins, like bits, have a value of‘0’ or‘ 1’ . However bins are not encoded in the bitstream 115 as discrete bits. Bins have an associated likely value and an associated probability, known as a‘context’. When the actual bin to be coded matches the likely value, a ‘most probable symbol’ (MPS) is coded. Coding a most probable symbol is relatively inexpensive in terms of consumed bits. When the actual bin to be coded mismatches the likely value, a‘least probable symbol’ (LPS) is coded. Coding a least probable symbol has a relatively high cost in terms of consumed bits. The bin coding techniques enable efficient coding of bins where the probability of a‘0’ versus a‘ G is skewed. For a syntax element with two possible values (i.e., a‘flag’), a single bin is adequate. For syntax elements with many possible values, a sequence of bins is needed. Then, the presence of later bins in the sequence may be determined based on the value of earlier bins in the sequence. Additionally, each bin may be associated with more than one context, with the selection of a particular context dependent on earlier bins in the syntax element, the bin values of neighbouring syntax elements (i.e. those from neighbouring blocks) and the like. Each time a bin is coded, the context is updated to adapt to the new bin value. As such, the binary arithmetic coding scheme is said to be adaptive.

[00061] Also supported by the encoder 114 are bins that lack a context (‘bypass bins’). Bypass bins are coded assuming an equiprobable distribution between a‘0’ and a‘ 1’. The absence of a context saves memory and reduces complexity, and thus bypass bins are used where the distribution of values for the particular bin is not skewed. One example of an entropy coder employing context and adaption is known in the art as CABAC (context adaptive binary arithmetic coder) and many variants of this coder have been employed in video coding.

[00062] The entropy encoder 338 encodes the intra prediction mode 388 using a combination of context-coded and (optionally) bypass-coded bins. Typically, a list of‘most probable modes’ is generated in the video encoder 114. The list of most probable modes is typically of a fixed length, such as three or six modes, and may include modes encountered in earlier blocks. A context-coded bin encodes a flag indicating if the intra prediction mode is one of the most probable modes. If the intra prediction 388 mode is one of the most probable modes, further signalling is encoded indicative of which most probable mode corresponds with the intra prediction mode 388. Otherwise, the intra prediction mode 388 is encoded as a‘remaining mode’, using an alternative syntax to express intra prediction modes other than those present in the most probable mode list. The entropy encoder 338 also encodes the NSST index 390 for particular coding units or transform blocks, as described with reference to Figs. 10 and 11.

[00063] A multiplexer module 384 outputs the PU 320 according to the determined best prediction mode 388, selecting from the tested candidate prediction modes. The candidate prediction modes need not include every conceivable prediction mode supported by the video encoder 114. Prediction modes fall broadly into two categories. A first category is‘intra-frame prediction’ (or‘intra prediction’). In intra-frame prediction a prediction for a block is produced using other samples drawn from the current frame. The second category is‘inter-frame prediction’ (or‘inter prediction’). In intra-frame prediction a prediction for a block is produced using samples from a frame preceding the current frame in the order of coding frames in the bitstream (which may differ from the order of the frames when captured or displayed). Within each category (i.e. intra- and inter-prediction), different techniques may be applied to generate the PU. For example, intra-prediction may use values from adjacent rows and columns of previously reconstructed samples, in combination with a direction to generate a PU according to a prescribed filtering process. Alternatively, the PU may be described using a small number of parameters. Inter-prediction methods may vary in the number of motion parameters and their precision. Motion parameters typically comprise a reference frame offset plus a translation for one or two reference frames, but may include more frames, special frames, or complex affine parameters such as scaling and rotation. In addition, a pre-determined motion refinement process may be applied to generate dense motion estimates based on referenced sample blocks.

[00064] Having determined and selected a best PU 320, and subtracted the PU 320 from the original sample block at the subtractor 322, a residual with lowest coding cost 324 is obtained and subjected to lossy compression. The lossy compression process comprises the steps of transformation, quantisation and entropy coding. A transform module 326 applies a first transform to the difference 324, converting the difference 324 to the frequency domain and producing intermediate transform coefficients represented by an arrow 328. The first transform is typically separable, transforming each block as a set of rows and then as a set of columns by applying a transform such as a one-dimensional (1D) discrete cosine transform (DCT). The intermediate transform coefficients 328 are passed to a secondary transform module 330. [00065] The secondary transform module 330 operates on a subset of the intermediate transform coefficients 328, such as the intermediate transform coefficients occupying the upper left 4x4 or 8x8 area of the overall block. Other transform coefficients in the intermediate transform coefficients 328 are passed through the module 330 unchanged. The secondary transform module 330 applies one of a variety of transforms on the subset of the intermediate transform coefficients 328 to produce transform coefficients represented by an arrow 332. The secondary transform module 330 applies a forward secondary transform selected in a manner analogous to that of an inverse secondary transform module 344, to be further described with reference to Fig. 9. In particular, the intra prediction mode 388 and the non-separable secondary transform index 390 are used to select a particular secondary transform. The transforms available to the secondary transform module 330 are typically non-separable and thus cannot be performed in two stages (i.e. rows and columns) as is the case for the transform module 326. The selection of the applied transform may depend, at least in part, on the prediction mode, for example, as described with reference to Fig. 8B. Additionally, the video encoder 114 may consider the selection of the applied transform at the module 330 as a test of further candidates for selection based on rate/distortion cost evaluation.

[00066] The transform coefficients 332 are passed to a quantiser module 334. At the module 334, quantisation in accordance with a‘quantisation parameter’ is performed to produce residual coefficients, represented by the arrow 336. The quantisation parameter is constant for a given transform block and thus results in a uniform scaling for the production of residual coefficients for a transform block. A non-uniform scaling is also possible by application of a ‘quantisation matrix’, whereby the scaling factor applied for each residual coefficient is derived from a combination of the quantisation parameter and the corresponding entry in a scaling matrix, typically having a size equal to that of the transform block. The quantisation matrix is costly to signal thus is coded only infrequently (if at all) in the bitstream 115. Coding of the quantisation matrix requires converting the two dimensional matrix of scaling factors into a list of scaling factors to be entropy encoded into the bitstream 115. The existing Z-order scan may be reused for converting the matrix of scaling factors, avoiding the overhead associated with supporting an additional scan pattern for the infrequently performed operation of encoding a quantisation matrix. The residual coefficients 336 are supplied to an entropy encoder 338 for encoding in the bitstream 115. Typically, the residual coefficients of a transform block are scanned to produce an ordered list of values, according to a scan pattern. The scan pattern generally scans the transform block as a sequence of 4x4‘sub-blocks’, providing a regular scanning operation at the granularity of 4x4 sets of residual coefficients, with the arrangement of sub-blocks dependent on the size of the transform block. Additionally, the prediction mode and the corresponding block partitioning are also encoded in the bitstream 115. As described above, the video encoder 114 needs access to a frame representation corresponding to the frame representation seen in the video decoder 134. Thus, the residual coefficients 336 are also inverse quantised by a dequantiser module 340 to produce inverse transform coefficients, represented by an arrow 342. The inverse transform coefficients 342 are passed through an inverse secondary transform module 344. The inverse secondary transform module 344 applies the selected secondary transform to produce intermediate inverse transform coefficients, as represented by an arrow 346. The intermediate inverse transform coefficients 346 are supplied to an inverse transform module 348 to produce residual samples, represented by an arrow 350, of the transform unit. A summation module 352 adds the residual samples 350 and the PU 320 to produce reconstructed samples (indicated by an arrow 354) of the CU. The reconstructed samples 354 are passed to a reference sample cache 356 and an in-loop filters module 368. The reference sample cache 356, typically implemented using static RAM on an ASIC (thus avoiding costly off-chip memory access) provides minimal sample storage needed to satisfy the dependencies for generating intra-frame prediction blocks for subsequent CUs in the frame.

The minimal dependencies typically include a Tine buffer’ of samples along the bottom of a row of CTUs, for use by the next row of CTUs and column buffering the extent of which is set by the height of the CTU. The reference sample cache 356 supplies reference

samples (represented by an arrow 358) to a reference sample filter 360. The sample filter 360 applies a smoothing operation to produce filtered reference samples (indicated by an arrow 362). The filtered reference samples 362 are used by an intra-frame prediction module 364 to produce an intra-predicted block of samples, represented by an arrow 366. For each candidate intra prediction mode the intra-frame prediction module 364 produces a block of samples, i.e. 366.

[00067] The in-loop filters module 368 applies several filtering stages to the reconstructed samples 354. The filtering stages include a‘deblocking filter’ (DBF) which applies smoothing aligned to the CU boundaries to reduce artefacts resulting from discontinuities. Another filtering stage present in the in-loop filters module 368 is the‘adaptive loop filter’ (ALF), which applies a Wiener-based adaptive filter to further reduce distortion. A further available filtering stage in the in-loop filters module 368 is the‘sample adaptive offset’ (SAO) filter. The SAO filter operates by firstly classifying reconstructed samples into one or multiple categories and, according to the allocated category, applying an offset at the sample level. Filtered samples 370 are output from the in-loop filters module 368. The filtered samples 370 are stored in a frame buffer 372. The frame buffer 372 typically has the capacity to store several (e g. up to 16) pictures and thus is stored the memory 206. As such, access to the frame buffer 372 is costly in terms of memory bandwidth. The frame buffer 372 provides reference

frames (represented by an arrow 374) to a motion estimation module 376 and a motion compensation module 380.

[00068] The motion estimation module 376 estimates a number of‘motion vectors’ (indicated as 378), each being a Cartesian spatial offset from the location of the present CU, referencing a block in one of the reference frames in the frame buffer 372. A filtered block of reference samples (represented as 382) is produced for each motion vector. The filtered reference samples 382 form further candidate modes available for potential selection by the mode selector 386. Moreover, for a given CU, the PU 320 may be formed using one reference block (‘uni -predicted’) or may be formed using two reference blocks (‘bi-predicted’). For the selected motion vector, the motion compensation module 380 produces the PU 320 in accordance with a filtering process supportive of sub-pixel accuracy in the motion vectors. As such, the motion estimation module 376 (which operates on many candidate motion vectors) may conceivably perform a simplified filtering process compared to that of the motion compensation module 380 (which operates on the selected candidate only) to achieve reduced computational complexity.

[00069] Although the video encoder 114 of Fig. 3 is described with reference to versatile video coding (VVC), other video coding standards or implementations may also employ the processing stages of modules 310-386. The frame data 113 (and bitstream 115) may also be read from (or written to) memory 206, the hard disk drive 210, a CD-ROM, a Blu-ray disk™ or other computer readable storage medium. Additionally, the frame data 113 (and bitstream 115) may be received from (or transmitted to) an external source, such as a server connected to the communications network 220 or a radio-frequency receiver.

[00070] The video decoder 134 is shown in Fig. 4. Although the video decoder 134 of Fig. 4 is an example of a versatile video coding (VVC) video encoding pipeline, other video codecs may also be used to perform the processing stages described herein. As seen in Fig. 4, the bitstream 133 is input to the video decoder 134. The bitstream 133 may be read from memory 206, the hard disk drive 210, a CD-ROM, a Blu-ray disk™ or other non-transitory computer readable storage medium. Alternatively, the bitstream 133 may be received from an external source such as a server connected to the communications network 220 or a radio- frequency receiver. The bitstream 133 contains encoded syntax elements representing the captured frame data to be decoded.

[00071] The bitstream 133 is input to an entropy decoder module 420. The entropy decoder module 420 extracts syntax elements from the bitstream 133 and passes the values of the syntax elements to other modules in the video decoder 134. The entropy decoder module 420 applies a CAB AC algorithm to decode syntax elements from the bitstream 133. The decoded syntax elements are used to reconstruct parameters within the video decoder 134. Parameters include residual coefficients (represented by an arrow 424) and mode selection information such as an intra prediction mode 458 and an NSST index 454. The mode selection information also includes information such as motion vectors, and the partitioning of each CTU into one or more CUs. Parameters are used to generate PUs, typically in combination with sample data from previously decoded CUs.

[00072] The residual coefficients 424 are input to a dequantiser module 428. The dequantiser module 428 performs inverse scaling on the residual coefficients 424 to create reconstructed intermediate transform coefficients (represented by an arrow 432) according to a quantisation parameter. Should use of a non-uniform inverse quantisation matrix be indicated in the bitstream 133, the video decoder 134 reads a quantisation matrix from the bitstream 133 as a sequence of scaling factors and arranges the scaling factors into a matrix according to the Z- order scan used for coding of residual coefficients. Then, the inverse scaling uses the quantisation matrix in combination with the quantisation parameter to create the reconstructed intermediate transform coefficients. Use of the Z-order scan for residual coefficients to also scan quantisation matrix scaling factors avoids the presence of an additional scan pattern and associated memory and complexity burden for a scan that is infrequently performed. The reconstructed intermediate transform coefficients 432 are passed to an inverse secondary transform module 436. The inverse secondary module 436 performs a‘secondary inverse transform’ to produce reconstructed transform coefficients, represented by an arrow 440. The secondary transform is performed according to a determined transform block size, as described with reference to Fig. 9. The reconstructed transform coefficients 440 are passed to an inverse transform module 444. The module 444 transforms the coefficients from the frequency domain back to the spatial domain. The transform block is effectively based on significant residual coefficients and non-significant residual coefficient values. The result of operation of the module 444 is a block of residual samples, represented by an arrow 448. The residual samples 448 are equal in size to the corresponding CU. The residual samples 448 are supplied to a summation module 450. At the summation module 450 the residual samples 448 are added to a decoded PU 452 to produce a block of reconstructed samples, represented by an arrow 456.

The reconstructed samples 456 are supplied to a reconstructed sample cache 460 and an in-loop filtering module 488. The in-loop filtering module 488 produces reconstructed blocks of frame samples, represented as 492. The frame samples 492 are written to a frame buffer 496.

[00073] A reconstructed sample cache 460 operates similarly to the reconstructed sample cache 356 of the video encoder 114. The reconstructed sample cache 460 provides storage for reconstructed sample needed to intra predict subsequent CUs without the memory 206 (e g., by using the data 232 instead, which is typically on-chip memory). Reference samples, represented by an arrow 464, are obtained from the reconstructed sample cache 460 and supplied to a reference sample filter 468 to produce filtered reference samples indicated by arrow 472. The filtered reference samples 472 are supplied to an intra-frame prediction module 476. The module 476 produces a block of intra-predicted samples, represented by an arrow 480, in accordance with an intra prediction mode parameter 458 signalled in the bitstream 133 and decoded by the entropy decoder 420.

[00074] When intra prediction is indicated in the bitstream 133 for the current CU, the intra- predicted samples 480 form the decoded PU 452 via a multiplexor module 484.

[00075] When inter prediction is indicated in the bitstream 133 for the current CU, a motion compensation module 434 produces a block of inter-predicted samples 438 using a motion vector and reference frame index to select and filter a block of samples from a frame buffer 496. The block of samples 498 is obtained from a previously decoded frame stored in the frame buffer 496. For bi-prediction, two blocks of samples are produced and blended together to produce samples for the decoded PU 452. The frame buffer 496 is populated with filtered block data 492 from an in-loop filtering module 488. As with the in-loop filtering module 368 of the video encoder 114, the in-loop filtering module 488 applies any, at least, or all of the DBF, the ALF and SAO filtering operations. The in-loop filtering module 368 produces the filtered block data 492 from the reconstructed samples 456.

[00076] Fig. 5 is a schematic block diagram showing a collection 500 of available divisions or splits of a block into one or more blocks in the tree structure of versatile video coding. The divisions shown in the collection 500 are available to the block partitioner 310 of the encoder 114 to divide each CTU into one or more CUs according to the Lagrangian optimisation, as described with reference to Fig. 3.

[00077] Although the collection 500 shows only square blocks being divided into other, possibly non-square blocks, it should be understood that the diagram 500 is showing the potential divisions but not constraining the containing block to be square. If the containing block is non-square, the dimensions of the blocks resulting from the division are scaled according to the aspect ratio of the containing block. The particular subdivision of a CTU into one or more CUs by the block partitioner 310 is referred to as the‘coding tree’ of the CTU. In the context of the present disclosure, a leaf node is a node at which the process of subdivision terminates. The process of subdivision must terminate when the region corresponding to the leaf node is equal to a minimum coding unit size. Leaf nodes resulting in coding units of the minimum size exist at the deepest level of decomposition of the coding tree. The process of subdivision may also terminate prior to the deepest level of decomposition, resulting in coding units larger than the minimum coding unit size.

[00078] At the leaf nodes of the coding tree exist CUs, with no further subdivision. For example, a leaf node 510 contains one CU. At the non-leaf nodes of the coding tree exist either a split into two or more further nodes, each of which could either contain one CU or contain further splits into smaller regions.

[00079] A quad-tree split 512 divides the containing region into four equal-size regions as shown in Fig. 5. Compared to HEVC, versatile video coding achieves additional flexibility with the addition of a horizontal binary split 514 and a vertical binary split 516. Each of the splits 514 and 516 divides the containing region into two equal-size regions. The division is either along a horizontal boundary (514) or a vertical boundary (516) within the containing block.

[00080] Further flexibility is achieved in versatile video coding with the addition of a ternary horizontal split 518 and a ternary vertical split 520. The ternary splits 518 and 520 divide the block into three regions, bounded either horizontally (518) or vertically (520) along ¼ and ¾ of the containing region width or height. The combination of the quad tree, binary tree, and ternary tree is referred to as‘QTBTTT’ or alternatively as a multi-tree (MT). [00081] Compared to HEVC, which supports only the quad tree and thus only supports square blocks, the QTBTTT results in many more possible CU sizes, especially considering possible recursive application of binary tree and/or ternary tree splits. The potential for unusual (for example, non-square) block sizes may be reduced by constraining split options to eliminate splits that would result in a block width or height either being less than four samples or in not being a multiple of four samples. Generally, the constraint would apply in considering luma samples. However, the constraint may also apply separately to the blocks for the chroma channels, potentially resulting in differing minimum block sizes for luma versus chroma, for example when the frame data is in the 4:2:0 chroma format.

[00082] Fig. 6 is a schematic flow diagram illustrating a data flow 600 of a QTBTTT (or ‘coding tree’) structure used in versatile video coding. The QTBTTT structure is used for each CTU to define a division of the CTU into one or more CUs. The QTBTTT structure of each CTU is determined by the block partitioner 310 in the video encoder 114 and encoded into the bitstream 115 or decoded from the bitstream 133 by the entropy decoder 420 in the video decoder 134. The data flow 600 further characterises the permissible combinations available to the block partitioner 310 for dividing a CTU into one or more CUs, according to the divisions shown in Fig. 5.

[00083] Starting from the top level of the hierarchy, that is at the CTU, zero or more quad-tree divisions are first performed. Specifically, a Quad-tree (QT) split decision 610 is made by the block partitioner 310. The decision at 610 returning a‘ G symbol indicates a decision to split the current node into four sub -nodes according to the quad-tree split 512. The result is the generation of four new nodes, such as at 620, and for each new node, recursing back to the QT split decision 610. Each new node is considered in raster (or Z-scan) order. Alternatively, if the QT split decision 610 indicates that no further split is to be performed (returns a‘0’ symbol), quad-tree partitioning ceases and multi-tree (MT) splits are subsequently considered.

[00084] Firstly, an MT split decision 612 is made by the block partitioner 310. At 612, a decision to perform an MT split is indicated. Returning a‘0’ symbol at decision 612 indicates that no further splitting of the node into sub-nodes is to be performed. If no further splitting of a node is to be performed, then the node is a leaf node of the coding tree and corresponds to a coding unit (CU). The leaf node is output at 622. Alternatively, if the MT split 612 indicates a decision to perform an MT split (returns a‘ G symbol), the block partitioner 310 proceeds to a direction decision 614. [00085] The direction decision 614 indicates the direction of the MT split as either horizontal (Ή’ or‘0’) or vertical (‘V’ or‘l’). The block partiti oner 310 proceeds to a decision 616 if the decision 614 returns a‘0’ indicating a horizontal direction. The block partitioner 310 proceeds to a decision 618 if the decision 614 returns a‘ indicating a vertical direction.

[00086] At each of the decisions 616 and 618, the number of partitions for the MT split is indicated as either two (binary split or‘BT’ node) or three (ternary split or‘TT’) at the BT/TT split. That is, a BT/TT split decision 616 is made by the block partitioner 310 when the indicated direction from 614 is horizontal and a BT/TT split decision 618 is made by the block partitioner 310 when the indicated direction from 614 is vertical.

[00087] The BT/TT split decision 616 indicates whether the horizontal split is the binary split 514, indicated by returning a‘O’, or the ternary split 518, indicated by returning a‘ . When the BT/TT split decision 616 indicates a binary split, at a generate HBT CTU nodes step 625 two nodes are generated by the block partitioner 310, according to the binary horizontal split 514. When the BT/TT split 616 indicates a ternary split, at a generate HTT CTU nodes step 626 three nodes are generated by the block partitioner 310, according to the ternary horizontal split 518.

[00088] The BT/TT split decision 618 indicates whether the vertical split is the binary split 516, indicated by returning a‘O’, or the ternary split 520, indicated by returning a‘ When the BT/TT split 618 indicates a binary split, at a generate VBT CTU nodes step 627 two nodes are generated by the block partitioner 310, according to the vertical binary split 516. When the BT/TT split 618 indicates a ternary split, at a generate VTT CTU nodes step 628 three nodes are generated by the block partitioner 310, according to the vertical ternary split 520. For each node resulting from steps 625-628 recursion of the data flow 600 back to the MT split decision 612 is applied, in a left-to-right or top-to-bottom order, depending on the direction 614. As a consequence, the binary tree and ternary tree splits may be applied to generate CUs having a variety of sizes.

[00089] Figs. 7A and 7B provide an example division 700 of a CTU 710 into a number of coding units (CUs). An example CU 712 is shown in Fig. 7A. Fig. 7A shows a spatial arrangement of CUs in the CTU 710. The example division 700 is also shown as a coding tree 720 in Fig. 7B. [00090] At each non-leaf node in the CTU 710 of Fig. 7A, for example nodes 714, 716 and 718, the contained nodes (which may be further divided or may be CUs) are scanned or traversed in a‘Z-order’ to create lists of nodes, represented as columns in the coding tree 720. For a quad-tree split, the Z-order scanning results in top left to right followed by bottom left to right order. For horizontal and vertical splits, the Z-order scanning (traversal) simplifies to a top to bottom and left to right scan, respectively. The coding tree 720 of Fig. 7B lists all nodes and CUs according to the applied scan order. Each split generates a list of 2, 3 or 4 new nodes at the next level of the tree until a leaf node (CU) is reached.

[00091] Having decomposed the image into CTUs and further into CUs by the block partitioner 310, and using the CUs to generate each residual block (324) as described with reference to Fig. 3, residual blocks are subject to forward transformation and quantisation by the encoder 114. The resulting transform blocks (TBs) 336 are subsequently scanned to form a sequential list of residual coefficients, as part of the operation of the entropy coding module 338. An equivalent process is performed in the video decoder 134 to obtain transform blocks from the bitstream 133.

[00092] Fig. 8A shows a set 800 of intra prediction modes for a transform block that can be indicated using the intra prediction modes 388 and 458. In Fig. 8 A, 67 intra prediction modes are defined. Mode 0 is a‘planar’ intra prediction mode, mode l is a‘DC’ intra prediction mode and modes 2-66 are‘angular’ intra prediction modes. The planar intra prediction mode (mode 0) populates a prediction block with samples according to a plane, i.e. having an offset and a gradient horizontally and vertically. The plane parameters are obtained from neighbouring reference samples, where available. Likewise, the DC intra prediction mode (mode 1) populates the prediction block with an offset, also using neighbouring reference samples (where available).

[00093] The angular intra prediction modes (modes 2 to 66) populate the block by producing a texture aligned to one of 65 directions, or‘angles’. For clarity, only a subset of the 65 angles are shown in Fig. 8A, being modes 2, 18, 34, 50, and 66. For each mode, neighbouring reference samples are used to produce a texture that populates the prediction block in the direction indicated by the arrow for the angular intra prediction mode. Additional angles, not explicitly shown in Fig. 8A, are at intermediate positions (i.e. modes 3-17, 19-33, 35-49, and 51-65). A first symmetry is evident from Fig. 8A along angular mode 34 and in a diagonal direction from the top-left to the bottom right of the prediction block. From the first symmetry, modes 2 to 34 are shown to correspond to modes 66 down to 34, with a transposition along this diagonal axis.

[00094] Fig. 8B shows a table 850 describing a mapping 840 from intra prediction modes to secondary transform sets for a transform block. The intra prediction mode (388 or 458) selects one mode out of sixty-seven possible modes. Mode 0 is the‘planar mode’ of Fig. 8 A. In planar mode, a prediction block of samples is a plane, having offset and horizontal and vertical gradient established from neighbouring reference samples from previously reconstructed coding units. Mode 1 is the‘DC’ mode, in which a prediction block of samples is set to a single value, also being derived from neighbouring reference samples from previously reconstructed coding units. Where neighbouring reference samples are not available (e.g. as the current prediction block abuts the frame boundary) the reference samples are either not used or a substitute value is used.

[00095] Modes 2-66 are the‘angular’ intra prediction modes. Angular intra prediction modes populate the prediction block by projecting samples derived from the neighbouring reference samples onto the prediction block according to one of sixty-five angles (or directions). A different number of angular modes may be used, for example the high efficiency video coding standard used thirty-three angular modes. The angles possess a symmetry, as described with reference to Fig. 8A, along the middle angle (intra prediction mode 34), which corresponds to a diagonal down-right direction. As such, the mapping 840 reduces the intra prediction mode of 67 discrete values to a set value of 35 discrete values.

[00096] As a consequence of the projection of reference samples onto the prediction block according to an angle associated with a given angular intra prediction mode, the residual samples for the prediction block typically also have a directional bias. Previous attempts to exploit the directional bias tended towards replacing the primary transform with an alternative transform, having basis functions that were dependent upon the intra prediction mode and attempted further compress the residual in a mode-dependent manner. The previous approaches tended to be prohibitively complex, especially as the approaches typically entailed replacing the separable primary transform with a non-separable primary transform. As such, complexity was greatly increased, especially for larger block sizes. The use of non-separable transforms as a secondary transform stage limits the complexity increase by reducing the area of application of the non-separable transform to a region not exceeding the upper-left 8x8 samples in each transform block. The upper left 8x8 region tends to contain a large proportion of significant residual coefficients of greater magnitude. Accordingly, the upper left 8x8 region also benefits disproportionately from transforming a subregion of the entire transform block in a mode- dependent manner compared to transforming the entire transform block in a mode-dependent manner. The secondary transform operates upon residual coefficients are then transformed via the primary transform (from the decoder perspective). Accordingly, the basis functions of the secondary transform are somewhat different to those typical of other transforms such as a discrete cosine transform for example. The set index is one parameter contribution to the selection of a particular transform matrix for use to perform the secondary transform.

[00097] Fig. 9 is a schematic block diagram showing an inverse secondary transform module 900. The inverse secondary transform module 900 represents the inverse secondary transform module 344 of the video encoder 114 of Fig. 3 or the inverse secondary transform module 436 of the video decoder 134 of Fig. 4. The inverse secondary transform module 900 operates identically in the video encoder 114 and the video decoder 134. Moreover, the forward secondary transform module 330 operates in an accordance with the inverse secondary transform module 344. In execution of the module 900, an intra prediction mode 908

(corresponding to the intra prediction mode 388 or 458) is provided to a set table module 910. The set table module 910 determines a set index, represented by an arrow 914, for the intra prediction mode 308 according to the mapping 840 of Fig. 8B. Typically, the set index 914 is a value from 0 to 34, although different ranges may be used.

[00098] Also input to the secondary transform module is an NSST index 304 (corresponding to the NSST index 390 or 454). The NSST index 304 is a value from 0 to 3, with a value of 0 indicating that the non-separable secondary transform is to be bypassed. A block size, indicated by an arrow 918, selects between NSSTs of size 4x4 or NSSTs of size 8x8. The selection between the two sizes is dependent on the transform block size and described with reference to Fig. 10. The NSST index 304, along with the set index 914 and the block size 918 are supplied to a matrix coefficient table module 920 to select matrix coefficients, represented as an arrow 925. The matrix coefficient table module selects the matrix coefficients 925 from 3x35 sets of matrix coefficients stored in the matrix coefficient table 920. The matrix coefficients 925 are supplied to a matrix multiply module 930. The matrix multiply module 930 operates on incoming coefficients 942 (corresponding to coefficients 342 or 432) to produce coefficients after the secondary transform. The coefficients generated by the module 930, indicated by an arrow 946, correspond to the coefficients 346 or 440. [00099] In its conceptually simplest form, matrix multiplication may be used to implement the transform module 900, the matrix multiplication being provided with a set of multiplication coefficients. Alternative formulations of the secondary transform may be implemented using hypercubic Givens rotations, resulting in a transform structure of lower complexity, by virtue of the butterfly structure. Complexity associated with multiplication is further restrained by limiting the application of the secondary transform, as described with reference to Fig. 10. Notwithstanding the attempts at complexity reduction, the need for the video encoder 114 to select a value of NSST index imposes a complexity burden. The complexity burden is present even in implementations using a two-stage search process, for example an approximate search followed by a precise search.

[000100] Fig. 10 shows a set 1000 of transform blocks available in the versatile video coding standard. Fig. 10 also shows the application of the secondary transform to a subset of residual coefficients from a transform block. A variety of transform block sizes are shown in Fig. 10. Flowever, one transform block size available in the VVC standard that is not shown in Fig. 10 is 64x64.

[000101] A 4x4 transform block 1010 has a 4x4 NSST 1052 (shown with dark shading) applied to all 16 residual coefficients. The 4x4 NSST 1052 is applied to regions, as far as possible, in the upper-left 8x8 coefficients of a transform block in a tiled manner (i.e. at each 4x4 sub block) for transforms having a width or height of 4 samples. Thus, the 4x4 NSST 1052 is applied to a 4x8 transform block 1020 and an 8x4 transform block 1012, i.e. in sub-blocks having dark shading, with a dashed outline

[000102] For larger transform sizes an 8x8 NSST 1050 (shown with light shading) is available for application to residual coefficients in the upper-left 8x8 region of the transform block. The 8x8 NSST 1050 is applied to an 8x8 transform block 1022, a 16x8 transform block 1024, an 8x16 transform block 1032, a 16x16 transform block 1034, a 32x16 transform block 1036, a 16x32 transform block 1044, a 32x32 transform block 1046, in each case in the region shown with light shading and a dashed outline. The 8x8 NSST 1050 is also applied in the upper-left 8x8 region of a 64x64 transform block (not shown in Fig. 10). Other sizes of transform block for which the 4x4 NSST 1052.

[000103] According to the methods described, a restriction is placed on the application of NSST to prohibit application to more elongated blocks, blocks for which the aspect ratio (as measured from the longest side relative to the shorter side) exceeds a predetermined threshold do not use the secondary transform. The predetermined threshold is typically as aspect ratio of 2: 1 (longest side relative to shortest side). As such, use of the secondary transform is prohibited for transform blocks of size 4x16 (i.e. 1030), 4x32 (i.e. 1040), 8x32 (i.e. 1042), 16x4

(i.e. 1014), 32x4 (i.e. 1016), and 32x8 (i.e. 1026). Collectively, the prohibited sizes form a NSST prohibited set 1060. Shading for the NSST 1050 is shown in the blocks 1014, 1016,

1030 and 1040. Shading for the NSST 1052 is shown in the blocks 1026 and 1042. However, in the arrangements described in relation to Fig. 11, the blocks 1014, 1016, 1030, 1040, 1042 and 1026 are not processed by the non-separable secondary transform module 330 and an NSST index 390 is not determined or encoded into the bitstream 115. The arrangements described accordingly remove the computational overhead that would otherwise be associated with applying a non-separable secondary transform to the blocks 1014, 1016, 1030, 1040, 1042 and 1026.

[000104] In the arrangements described the aspect ratio is measured from the longest side relative to the shortest side and NSST is determined based on the aspect ratio satisfies (is less than or equal to) the threshold. In other arrangements, the aspect ratio can be measured from the shortest side relative to the longest side, and the threshold would be of form 1 :2. The threshold is typically determined based upon factors such as computational capability of the module 201, expected processing time, expected throughput and the like.

[000105] Transforms belonging to the NSST prohibited set 1060 are deemed to receive insufficient compression gain from application of the NSST to justify the added complexity in the mode search performed in the video encoder 114. Transform sizes shown in Fig. 10 other than those in the NSST prohibited set 1060 (and including the 64x64 transform block size not shown in Fig. 10) are the transform block sizes for which NSST can be applied. When the NSST is not be applied the NSST index is equal to zero and the value need not be coded in the bitstream.

[000106] Fig. 11 is a flow chart diagram of a method 1100 for selectively applying a non- separable secondary transform (NSST) to encode a transform block of residual coefficients into the bitstream 115. The method 1100 may be embodied by apparatus such as a configured FPGA, an ASIC, or an ASSP. Additionally, the method 1100 may be performed by video encoder 114 under execution of the processor 205. As such, the method 1100 may be stored on computer-readable storage medium and/or in the memory 206. The method 1100 commences with the processor 205 at a determine block structure step 1110. [000107] At the determine block structure step 1110 the block partitioner 310, under execution of the processor 205, firstly divides the frame data 113 into a sequence of coding tree units having a fixed size. Each coding tree unit is further divided into one or more coding units in a manner adaptive to the samples in the coding tree unit. As described with reference to Fig. 3, as a result of the operation of the block partitioner 310, a set of coding units 312 is determined for a coding tree unit. Each of coding units 312 is the result of hierarchical splits in a coding tree, as described with reference to Figs. 5 and 6, and exemplified in Fig. 7. Each coding unit has an associated prediction unit 320, and the prediction unit 320 in turn has an associated intra prediction mode 388. Moreover, each coding unit 312 also has an associated transform unit 336. The transform unit 336 in turn has an associated transform block for each colour channel of the frame data 113. Control in the processor 205 progresses from the step 1110 to a determine transform block size step 1120.

[000108] At the determine transform block size step 1120 the block partitioner 310, under execution of the processor 205, determines the size of a transform block for each colour channel of the frame data 113 according to a current coding unit. If the coding unit size in luma samples corresponds to one of the transform block sizes of Fig. 10 (including a 64x64 transform block size), a single transform unit is associated with the coding unit. If none of the transform block sizes of Fig. 10 (including the 64x64 transform block size) match the coding unit size in lume samples, multiple transform units are arranged in a tiled manner to occupy the coding unit. The largest available transform block is used to minimise the number of‘tiled’ transform units. For example, a 128x128 coding unit may be represented using four 64x64 transform units. For the luma channel, a transform block exists of the same size as the transform unit in luma samples. For each chroma channel, when a 4:2:0 chroma format is in use, a transform block exists of half the width and height of the luma channel transform block, the reduction in size being a consequence of the chroma subsampling. For example, a 32x16 transform unit is associated with one 32x16 luma transform block and two 16x8 chroma transform blocks.

Control in the processor 205 progresses from the step 1120 to a determine intra prediction mode step 1125.

[000109] At the determine intra prediction mode step 1125 the mode selector 386, under control of the processor 205, selects an intra prediction mode 388 for the prediction unit associated with the current coding unit. The intra prediction mode 388 is typically determined using the modes of Fig. 8A and Fig. 8B as candidate modes. The selection is typically performed in two passes. In a first pass, all intra prediction modes for the luma prediction block are tested. For each intra prediction mode, the residual cost is approximate, for example using a‘sum of absolute transformed differences’ method (test), such as a Hadamard transform. From the executed method a list of‘best’ (lowest distortion) candidate prediction modes is derived. A full test of the residual coding is performed on the list of candidate prediction modes. This full test of residual coding is further described in relation to steps 1130 to 1190. Control in the processor 205 progresses from the step 1125 to an apply primary transform step 1130.

[000110] At the apply primary transform step 1130 the transform module 326, under execution of the processor 205, performs a separable transform (a set of one-dimensional transforms performed horizontally and vertically) spanning the entire transform block on the

difference 324, to produce the intermediate transform coefficients 328. Control in the processor 205 progresses from the step 1130 to an NSST applied test step 1140.

[000111] At the NSST applied test step 1140 the dimensions of the transform block are used to determine an aspect ratio of the transform unit, and use the aspect ratio to determine if a NSST is to be applied to the intermediate transform coefficients 328. The aspect ratio is expressed in terms of the longest side dimension to the shorter (or equal) side dimension. For example, block sizes of 4x8 and 8x4 are both considered to have an aspect ratio of 2: 1 (rather than 1 :2 and 2: 1 respectively) The determined measure of aspect ratio is used to determine the degree of‘elongation’ of the transform block. Blocks having an aspect ratio exceeding 2: 1 are deemed to be‘highly elongated’, and belong to the set 1060. Should the current transform block have an aspect ratio of more than 2: 1, thereby belonging to the set 1060, the step 1140 returns“No”. Upon returning“No” at step 1140, indication of a selection for NSST will be encoded into the bitstream 115 and control in the processor 205 progresses to a quantise coefficients step 1180. Otherwise, if step 1140 returns“Yes” (the aspect ratio is 2: 1 or less), indication of a selection for NSST will be encoded into the bitstream 115 and control in the processor 205 progresses to a determine NSST index step 1150.

[000112] At the determine NSST index step 1150 the mode selector 386, under control of the processor 205, sets different NSST index values (zero to three) for the current transform block. The step 1150 executes to set four different index values for the current transform block, so that the full range of potential NSST indices is set. The method 1100 continues from step 1150 to a determine NSST matrix step 1160. [000113] Step 1160 operates to determine a corresponding NSST matrix for each of the NSST indices set at step 1150. A value of zero for the NSST index 390 indicates that the NSST is bypassed, and no NSST matrix is set. Values from one to three indicate selection of one of three possible transform matrices for the transform block. A corresponding NSST matrix is selected for each of the NSST index values at the determine NSST matrix step 1160 for the intra prediction mode 388, as described with reference to Fig. 9. The method 1100 continues from step 1160 to an apply secondary transform step 1170.

[000114] Using the NSST matrix determined at step 1160 for each NSST index, an evaluation of the secondary transform is applied at the step 1170 to produce the transform coefficients 332. When the intra prediction mode is greater than 34 (mode 34 being a 45 degree diagonal down- left mode and modes 35-66 being symmetrical with modes 33 down to 2), the block of coefficients is transposed before and after application of the secondary transform at step 1170. The transposition exploits the symmetry of the angles of each intra prediction mode around mode 34 to reduce the number of NSST matrices. Of the NSST indices under test in steps 1150 to 1170, the NSST index resulting in the lowest coding cost of the transform coefficients 332 is selected as the NSST index 390 for the current transform block. Execution of step 1170 also considers the coding cost (“rate”) of the associated residual and resulting error (“distortion”) and is known as a‘rate-distortion’ check. For reduced complexity, approximations of either or both of the rate and the distortion may be used. An encoder NSST index step 1195 is also flagged for subsequent execution by the processor 205. Control in the processor 205 progresses from step 1170 to the quantise coefficients step 1180.

[000115] At the quantise coefficients step 1180 the quantiser module 334, under execution of the processor 205, performs quantisation. The transform coefficients 332 are quantised at step 1180 if the determination of the NSST applied test step 1140 indicated application of the NSST. The intermediate transform coefficients 328 are quantised at step 1180 if the determination of the NSST applied test step 1140 indicates no application of the NSST. A quantisation parameter is applied at execution of step 1180 to produce residual coefficients 336, with the quantisation parameter representing a step size. A larger quantisation parameter results in residual coefficients of smaller magnitude, which may be compressed into fewer bits. Larger quantisation parameters result in more residual coefficients having a value of zero (being ‘insignificant). Control in the processor 205 progresses from step 1180 to an encode residual coefficients step 1190. [000116] At the encode residual coefficients step 1190 the entropy encoder 338, under execution of the processor 205, encodes the residual coefficients 336 into the bitstream 115. The residual coefficients 336 are scanned into a list of coefficients, and coded into the bitstream 115 in an order from the last significant residual coefficient back to the top-left residual coefficient. The scan order may partition the transform block into sub-blocks, with a flag used to indicate the presence of significant residual coefficients at the sub-block level. Further flags used to indicate the presence of significant residual coefficients within each sub- block. The flags may use context coded bins to achieve higher compression efficiency, given the bias towards insignificant residual coefficients, especially at higher quantisation parameter values. Bypass coded bins may be used to further encode the magnitude of residual coefficients. In particular, a truncated Golomb Rice coding scheme, with a Rice parameter adaptive to residual coefficient magnitude may be employed. The intra prediction mode 388 is also encoded into the bitstream. The intra prediction mode is typically encoded either as the selection of one mode out of several modes populating a‘most probable mode list’ or as a ‘remaining mode’ coding scheme, enabling coding of modes not included in the most probable mode list.

[000117] The method 1100 progresses under control of the processor 205 from step 1190 to a check NSST encoding step 1193. The step 1193 operates to check if the encoder NSST index step 1195 is flagged. If the encode NSST index step 1195 was flagged for execution at step 1170 (“Yes” at step 1193), control in the processor 205 progresses from step 1193 to step 1195. If the encode NSST index step 1195 was not flagged for execution (“No” at step 1193), the method 1100 terminates after execution of step 1190.

[000118] At the encode NSST index step 1195 the entropy encoder 338, under control of the processor 205, encodes the NSST index determined from the step 1150 into the bitstream 115. Typically, a truncated unary (TU) bin string is used, as NSST indices of zero are most likely and thus use the shortest bin string. Methods of using a truncated unary bin string to encode the NSST index 390 in the bitstream are known in relation to the VVC standard. When the intra prediction mode 388 is DC or planar, the NSST index 390 is restricted in value to zero to two (the value three is prohibited), otherwise the NSST index 390 is in the range of zero to three. The truncated unary binarisation is truncated according to this range constraint. Moreover, each context in the bin string uses a separate context and separated sets of contexts are used based on the intra prediction mode 388 being DC or planar vs being an angular mode. The method 1100 terminates after execution of step 1195. [000119] Fig. 12 is a flow chart diagram of a method 1200 for decoding a transform block of residual coefficients from the bitstream 133 by selectively applying a non-separable secondary transform. The method 1200 may be embodied by apparatus such as a configured FPGA, an ASIC, or an ASSP. Additionally, the method 1200 may be performed by video encoder 114 under execution of the processor 205. As such, the method 1200 may be stored on computer- readable storage medium and/or in the memory 206.

[000120] The method 1200 commences with the processor 205 at a decode block structure step 1210. By virtue of limiting application of the non-separable secondary transform to a subset of transform block sizes, in particular, to sizes for which selection of a non-zero NSST index is more likely, the computational complexity of the video encoder 114 and the video decoder 134 is reduced, without excessive penalty in terms of compression efficiency.

[000121] At the decode block structure step 1210 the entropy decoder 420, under execution of the processor 205, decodes a coding tree from the bitstream 133 for a coding tree unit, thereby decoding the block structure. The coding tree unit is determined by the block partitioner module 310 at the time of encoding the bitstream. As a result of decoding the coding tree, the coding tree unit is divided into one or more coding units, as described with reference to Figs. 5 and 6. Control in the processor 205 progresses from step 1210 to a determine transform block size step 1220.

[000122] At the determine transform block size step 1220 the video decoder 134, under execution of the processor 205, determines the size of transform blocks for a coding unit (with an iteration over all coding units of the coding tree unit in accordance with the coding tree of the step 1210). If the size of the coding unit in luma samples is equal to one of the transform block sizes of Fig. 10, or 64x64 luma samples, one transform unit is associated with the coding unit. If the size of the coding unit in luma samples is larger than any of the transform block sizes of Fig. 10, or 64x64 luma samples, transform units of the largest available size are tiled to occupy the coding unit. For example, a 128x128 coding unit is occupied by four 64x64 transform units. Further, for the luma channel a transform block size equal to the transform unit size is determined. For each of the chroma channels, a transform block size of half the width and half the height of the transform unit size is determined. The halving in side dimensions for the chroma blocks is a consequence of use of the 4:2:0 chroma format. Control in the processor 205 progresses from step 1220 to a determine intra prediction mode step 1225. [000123] At the determine intra prediction mode step 1225, an intra prediction mode (458) is determined for a prediction block. Each coding unit is associated with a prediction unit and each prediction unit is associated with one prediction block per colour channel. A luma intra prediction mode is decoded for the luma prediction block and a chroma intra prediction mode is decoded for application to both of the chroma prediction blocks. A list of most probable modes is determined by the entropy decoder 420 and a context coded bin is decoded from the bitstream 133 indicative of use of a mode from the most probable mode list. Either an index into the most probable mode list or an indication of the selection of one of the remaining modes is decoded from the bitstream 133. A chroma intra prediction mode is also decoded from the bitstream 133. For the chroma intra prediction mode a syntax element specifies either the use of one of a subset of the 67 available intra prediction modes (Figs. 8A and 8B), or the use of the luma intra prediction mode of the collocated luma prediction block (“DM CHROMA”).

[000124] In some arrangements the coding tree for chroma coding units may differ from the coding tree for luma coding units. Generally, the coding tree for chroma coding units may be as deep as, but not deeper, than the coding tree for luma coding units. In arrangements using different coding trees for luma and chroma coding units, the chroma prediction block may be collocated with several luma prediction blocks. If so, a list of candidate modes for

DM CHROMA is prepared, from luma prediction blocks that may be collocated with various locations within the chroma prediction block. For example, the four comers and centre of the chroma prediction block may serve as locations for the generation of up to give distinct prediction modes. In arrangements using a list of candidate modes, an index is further decoded from the bitstream 133 when DM CHROMA is indicated to select which one of the luma candidate intra prediction modes is to be applied to the chroma prediction blocks.

[000125] Further, separate intra prediction modes may be independently coded for each chroma channel. Separate control of the prediction mode in each chroma channel can enable the block partitioner 310 to less frequently split the chroma coding tree, resulting in fewer chroma transform blocks and hence fewer associated sets of residual coefficients in the bitstream 133.

In particular, for frame data 310 represented in decorrelated colour spaces such as‘YCbCr’, separate control of the intra prediction mode of each chroma channel may offer greater adaptation to the contents of the particular chroma channel. For frame data 310 represented in more correlated colour spaces such as‘RGB’, such independent control is less likely to be beneficial. Control in the processor 205 progresses from step 1125 to a decode residual coefficients step 1230 [000126] At the decode residual coefficients step 1230 the entropy decoder 420, under execution of the processor 205, decodes residual coefficients 424 from the bitstream 133 for a transform block. Decoding residual coefficients is performed for each transform block associated with a transform unit, there generally being one transform unit associated with each coding unit. For each transform block a sequence of residual coefficients are decoded and assembled into a two-dimensional array (i.e. 424) according to a scan pattern. Scanning generally progresses from a list of significant residual coefficients back to the top-left residual coefficient, with the scan grouping coefficients in each 4x4 sub-block of the array. Control in the processor 205 the progresses from step 1230 to an inverse quantise step 1240.

[000127] At the inverse quantise step 1240 the dequantiser 428, under execution of the processor 205, applies a quantisation parameter to convert the residual coefficients 424 into the reconstructed intermediate transform coefficients 432. Control in the processor 205 progresses from step 1240 to an NSST applied test step 1250.

[000128] At the NSST applied test step 1250 the video decoder 134, under execution of the processor 205, uses the determined transform block size from the step 1220 to determine if an inverse non-separable transform is to be applied. A block aspect ratio is derived in execution of step 1205 based on the longest side to the shorter (or equal length) side. The block aspect ratio thus forms a measure of the degree of elongation of the transform block, regardless of the orientation of the transform block. For example, transform blocks of size 8x32 and 32x8 both have a block aspect ratio of 4:1. If the transform block has a block aspect ratio exceeding a predetermined threshold, typically 2: 1 (in other words the transform block size is one of the prohibited sizes 1060 of Fig. 10), a presence of an NSST selection if the bitstream 133 is not found. The step 1250 returns“No”, and control in the processor 205 progresses to an apply primary transform step 1290. Otherwise (if the aspect ratio does not exceed 2: 1), the step 1250 returns“Yes”, determining a presence of a NSST selection for the transform block in the bitstream 133. If the step 1250 returns“Yes”, control in the processor 205 progresses to a decode NSST index step 1260. Accordingly, presence of a NSST selection for the transform block, implemented vis the NSST index 390, is determined based upon the determined aspect ratio.

[000129] At the decode NSST index step 1260 the entropy decoder 420, under execution of the processor 205, decodes the NSST index 454 from the bitstream 133. When the intra prediction mode 458 of the associated prediction block is DC or Planar (mode 0 or 1 respectively), the NSST index 454 is a value in the range of zero to two. When the intra prediction mode of the associated prediction block is an angular intra prediction mode, the NSST index is a value in the range of zero to three. In each case, the NSST index 454 is decoded by parsing a truncated unary string, with separate contexts for each bin of the truncated unary binarisation and separate sets of contexts for each of the two NSST index ranges. Decoding the NSST index 454 from the bitstream is known in relation to the VVC standard. Control in the processor 205 progresses from step 1260 to a determine NSST matrix step 1270.

[000130] At the determine NSST matrix step 1270 the inverse secondary transform

module 436, under execution of the processor 205, determines which matrix is to be used to implement the non-separable secondary transform. The intra prediction mode 458 is mapped via the set table 910 to a set index 914. If the NSST index 454 is non-zero, the set index 914 is used in conjunction with the NSST index 454 to select one set of matrix multiplication coefficients 925 from the matrix coefficient table 920. Steps 1260 and 1270 effectively operate to decode the NSST selection for the transform block form the bitstream 133.

[000131] Control in the processor 205 progresses from step 1270 to an apply secondary inverse transform step 1280. The matrix multiplication coefficients 925 are used in conjunction with the reconstructed intermediate transform coefficients 432 to produce the reconstructed transform coefficients 440. If the NSST index 454 has a value of zero, the matrix multiplication 930 is bypassed. The secondary transform has been described with reference to a matrix

multiplication operation, i.e. at the step 1280 the processor performs the matrix

multiplication 930. Alternatively, the secondary transform may also be implemented as a butterly structure using Givens rotations for reduced complexity at step 1280. Application of the NSST matrix operates to assist in decoding the transform block.

[000132] If the intra prediction mode is greater than 34 (diagonal mode), the block of coefficients is transposed before and after the secondary inverse transform is applied in execution of the apply secondary transform step 1280. The secondary transform selection is also dependent upon the transform block size, with separate transforms for 4x4 and 8x8 groups of coefficients defined, and applied to transform blocks of sizes as described with reference to Fig. 10. Control in the processor 205 progresses from step 1280 to the apply primary transform step 1290.

[000133] At the apply primary transform step 1290 the inverse transform module 444, under execution of the processor 205, performs a primary inverse transform on the reconstructed transform coefficients 440 to produce the residual samples 448. The primary inverse transform is applied as a separable transform, applied vertically and horizontally over the entire transform block. As a result of the step 1250 the residual samples 448 are available for summing with samples from the prediction block (i.e. 452) to form reconstructed samples 456. The reconstructed samples 456 are further used as reference samples for intra prediction for future prediction blocks (i.e. via caching in the module 460) and output after in loop filtering

(i.e. 488), such as deblocking is applied. The steps 1280 and 1290 operate to decode the transform block. The method 1200 terminates upon execution of the step 1290.

[000134] As described in the methods 1100 and 1200, the NSST index (390 or 454) is present in the bitstream (133 or 115 respectively) after the residual coefficients. The NSST index may be coded once for a coding unit or once for each transform block within a coding unit. When the NSST index 390 is coded once for a coding unit, at the step 1150 the processor 205 evaluates the NSST index 390 across all transform blocks associated with the coding unit. The cost of evaluation of the NSST index may be reduced by considering application only when more than one residual coefficient in the coding unit is significant. When this is not the case (i.e. the coding unit contains only zero or one significant residual coefficient), the NSST is not applied (and the steps 1140 and 1250 operate accordingly). For any transform block coded using a‘transform skip’ mode, whereby the primary transform is bypassed, the secondary transform is also bypassed. For a coding unit where all transform blocks are coded using the transform skip mode, the NSST index is not signalled, as there is no transform block for which a secondary transform will be applied.

[000135] The determine NSST index step 1150 introduces relatively high complexity owing to evaluation of each possible NSST index for each transform block, the evaluation involving performing the non-secondary separable transform for each non-zero NSST index 390. The determine NSST index step 1150 may execute at increased speed via early search termination methods. In one example, if after testing a transform block at a given NSST index value there are no significant residual coefficients (i.e. application of the NSST resulted in eliminating all significant coefficients from the applied region, and none were present in any unapplied region of the transform block), searching for further NSST index values is aborted. When determining the initial candidate intra prediction modes (i.e. the first pass of the step 1125), instead of testing all NSST index values, just the values zero and one non-zero (for example, one) value may be used. This is based on the expectation that different NSST index values should produce the same result when measuring distortion under the simplified SATD approximation. [000136] In an arrangement of the methods 1100 and 1200, the aspect ratio beyond which NSST application is prohibited (at steps 1140 and 1250) is set to 2: 1. Alternatively, the aspect ratio threshold may be set to 4: 1. Using the aspect ratio threshold of 4: 1 results in prohibited transform block sizes of 4x32 and 32x4. The less severe restriction on the prohibition of applying NSST to transform blocks using the 4: 1 aspect ratio results in less compression efficiency loss compared to having no prohibition, however a reduced complexity reduction is realised.

[000137] In another arrangement of the methods 1100 and 1200, the threshold aspect ratio beyond which NSST application is prohibited (at steps 1140 and 1250) is set to 1 : 1.

Accordingly, NSST is only implemented in the bitstream for square blocks such as the transform blocks 1010, 1022, 1034, and 1046 shown in Fig. 10, and the 64x64 transform block not shown in Fig. 10. Arrangements that only allow NSST for square blocks have a lower encoder complexity compared to an arrangement which also allows NSST for at least some non-square block sizes. Flowever, the absence of NSST for at least some non-square block sizes reduces the coding gain that can be achieved.

[000138] In yet another arrangement of the methods 1100 and 1200, the NSST index (390 or 454) may be signalled once per transform unit, with one transform unit corresponding to one coding unit, rather than once per transform block. In arrangements where the NSST index is signalled once per transform unit, the determine NSST index step 1150 considers the coding cost on each transform block (i.e. the transform block associated with each colour channel) of the transform unit when determining the NSST index. Signalling the NSST index once per transform unit leads to a reduction in the signalling overhead of coding NSST indices, at the expense of restricting the ability to independently choose different secondary transforms for different colour channels. The use of a single NSST index for all transform blocks in the transform unit does not imply that each colour channel uses the same secondary transform, as the intra prediction mode for each chroma channel may differ from the intra prediction mode for the luma channel.

[000139] In yet another arrangement of the methods 1100 and 1200, the NSST index (390 or 454) is determined, signalled, and applied only to the luma colour channel. Application of secondary transforms to only the luma channel excludes the consideration of chroma channels for use of secondary transforms. The residual coefficients in chroma channels tends to have less magnitude than the residual signal of the luma channel, owing at least in part to the decorrelative property of the use of colour spaces such as YCbCr. Colour channels whose residual coefficients have less magnitude also achieve less benefit from the application of the secondary transforms. Exclusion of the chroma channels from secondary transformation reduces complexity both in the video encoder 114, due to reduced searching, and in the video decoder 134, due to the absence of secondary transform logic for the chroma channels. The reduction in complexity corresponds with lower compression efficiency compared to application to all colour channels. However, the luma channel still receives the compression advantage of use of secondary transforms, where the presence of more residual energy, in the form of higher magnitude residual coefficients, allows the use of secondary transforms to achieve higher compression gain compared to the gain achievable from application to the chroma channels.

[000140] In yet another arrangement of the methods 1100 and 1200, a smaller quantisation step size is applied to residual coefficients subject to secondary transformation. Application of a smaller quantisation step size is achieved by the steps 1180 and 1240 applying a modified quantisation parameter. For example, the quantisation parameter is decremented by one to the residual coefficients for which NSST is applied, such as 1052 and 1050 in Fig. 10. The smaller step size of the modified quantisation parameter increases the precision (reduces the quality loss from quantisation) of the reconstructed transform coefficients (346) for which the quantisation parameter is applied. The increase in precision is at the expense of an increase in coding cost of the associated residual coefficients. As a consequence of using the smaller step size, the benefit of the secondary transform appears primarily as a reduction in distortion in the decoded video, rather than a reduction in the coding cost or bit rate.

[000141] The quantisation parameter is an integer value, where each increment by one results in a relatively large increase in the quantisation step size. An alternative to modifying the quantisation parameter directly is to modify the translation of the quantisation parameter into the quantisation step size. In particular, the quantisation parameter has an exponential relationship to the quantisation step size such that every six increments of the quantisation parameter results in a doubling of the quantisation step size. The intermediate quantisation steps are derived from a look-up table comprising integers, approximating the exponential relationship between each doubling of the quantisation step size The precision of the integers of the look-up table is less than what would be required had floating point arithmetic been used. However the precision of the integers of the look-up table is still ample to express quantisation step sizes quite close to what would result from a floating point implementation. The modified quantisation may be implemented using a modified look-up table, enabling a step size change smaller than one quantisation parameter decrement to be achieved. The modified quantisation is only be applied when the secondary transform (NSST) is applied, i.e. for NSST index values (390 or 454) greater than zero.

[000142] In yet another arrangement of the methods 1100 and 1200, the NSST index signalling is unaffected by the prohibition on applying the non-separable secondary transform to particular transform block sizes (i.e. 1060). As such, the NSST index is signalled for all transform block sizes, aside from other restrictions on signalling resulting from residual coefficient count and transform skip usage. The video encoder 114 may still realise a complexity reduction by still applying the restriction on searching non-zero NSST index values for the prohibited NSST transform blocks 1060. Moreover, a conformance constraint may be said to exist for the bitstream (i.e. 115, 133) whereby the NSST index value is required to be zero when the transform block size is one of the NSST prohibited transform block sizes 106. When a bitstream conformance constraint is in effect, the video decoder 134 may realise a complexity reduction by not implementing support for performing the non-separable secondary transform for transform blocks of specific sizes, i.e. transform blocks of the set 1060.

INDUSTRIAL APPLICABILITY

[000143] The arrangements described are applicable to the computer and data processing industries and particularly for the digital signal processing for the encoding a decoding of signals such as video and image signals, achieving high compression efficiency without excessive cost in terms of memory bandwidth due to non-localised scanning of residual coefficients.

[000144] The foregoing describes only some embodiments of the present invention, and modifications and/or changes can be made thereto without departing from the scope and spirit of the invention, the embodiments being illustrative and not restrictive.

[000145] (Australia only) In the context of this specification, the word“comprising” means “including principally but not necessarily solely” or“having” or“including”, and not “consisting only of’. Variations of the word "comprising", such as“comprise” and“comprises” have correspondingly varied meanings.

Claims

1. A method of decoding a transform block in an image frame from a bitstream, the method comprising: determining an aspect ratio for a transform block of the bitstream; determining, in the bitstream, a presence of a non-separable secondary transform (NSST) selection for the transform block based on the determined aspect ratio; decoding, from the bitstream, the NSST selection for the transform block; and decoding the transform block in the image frame by applying the decoded NSST selection to the transform block.

2. A method according to claim 1, wherein the presence of the NSST selection is determined if the aspect ratio is 2: 1.

3. A method according to claim 1, wherein the presence of the NSST selection is determined if the aspect ratio is less than or equal to a predetermined threshold, the aspect ratio expressed as a longest side of the transform block to a shortest side of the transform block.

4. A non-transitory computer-readable medium having a computer program stored thereon to implement a method of decoding a transform block in an image frame from a bitstream, the program comprising: code for determining an aspect ratio for a transform block; code for determining, in the bitstream, a presence of a non-separable secondary transform (NSST) selection for the transform block based on the determined aspect ratio; code for decoding, from the bitstream, the NSST selection for the transform block; and code for decoding the transform block in the image frame by applying the decoded NSST selection to the transform block.

5. A system, comprising: a memory; and a processor, wherein the processor is configured to execute code stored on the memory for implementing a method of decoding a transform block in an image frame from a bitstream, the method comprising: determining an aspect ratio for a transform block; determining, in the bitstream, a presence of a non-separable secondary transform (NSST) selection for the transform block based on the determined aspect ratio; decoding, from the bitstream, the NSST selection for the transform block; and decoding the transform block in the image frame by applying the decoded NSST selection to the transform block.

6. A video encoder configured to decode a transform block in an image frame from a bitstream, comprising: a memory; a processor configured to execute code stored on the memory to: determine an aspect ratio for a transform block; determine, in the bitstream, a presence of a non-separable secondary transform (NSST) selection for the transform block based on the determined aspect ratio; decode, from the bitstream, the NSST selection for the transform block; and decode the transform block in the image frame by applying the decoded NSST selection to the transform block.