EP3987808A1

EP3987808A1 - A method, an apparatus and a computer program product for video encoding and video decoding

Info

Publication number: EP3987808A1
Application number: EP20826119.8A
Authority: EP
Inventors: Ramin GHAZNAVI YOUVALARI
Original assignee: Nokia Technologies Oy
Current assignee: Nokia Technologies Oy
Priority date: 2019-06-19
Filing date: 2020-06-12
Publication date: 2022-04-27
Also published as: EP3987808A4; WO2020254723A1

Abstract

A method for video coding comprises receiving a source picture; partitioning the source picture into a set of non-overlapping blocks (510); for a first block being coded, coding syntax elements of various coding tools into the bitstream (530) according to a pre-defined order (520); for any one or more subsequent block, determining a probability of a usage of each of the various coding tools in any one or more previous blocks; deriving a probability model according to the determined probabilities (540); defining an order for syntax elements for a block according to the probability model (550); and encoding syntax elements of the block into a bitstream (530) according to the defined order (520). In some embodiments, the probability of usage of each coding tool is determined according to a learning process involving a neural network.

Description

A METHOD, AN APPARATUS AND A COMPUTER PROGRAM PRODUCT FOR VIDEO ENCODING AND VIDEO DECODING

Technical Field

[0001] The present solution generally relates to video encoding and video decoding. Background

[0002] A video coding system may comprise an encoder that transforms an input video into a compressed representation suited for storage/transmission, and a decoder that can uncompress the compressed video representation back into a viewable form. The encoder may discard some information in the original video sequence in order to represent the video in a more compact form, for example, to enable the storage/transmission of the video information at a lower bitrate than otherwise might be needed.

[0003] Hybrid video codecs may encode the video information in two phases. Firstly, sample values (i.e. pixel values) in a certain picture area are predicted, e.g., by motion compensation means or by spatial means. Secondly, the prediction error, i.e. the difference between the prediction block of samples and the original block of samples, is coded. The video decoder on the other hand reconstructs the output video by applying prediction means similar to the encoder to form a prediction representation of the sample blocks and prediction error decoding. After applying prediction and prediction error decoding means, the decoder sums up the prediction and prediction error signals to form the output video frame.

Summary

[0004] The scope of protection sought for various embodiments of the invention is set out by the independent claims. The embodiments and features, if any, described in this specification that do not fall under the scope of the independent claims are to be interpreted as examples useful for understanding various embodiments of the invention.

[0005] Various aspects include a method, an apparatus and a computer readable medium comprising a computer program stored therein, which are characterized by what is stated in the independent claims. Various embodiments are disclosed in the dependent claims.

[0006] According to a first aspect, there is provided a method comprising

- receiving a source picture;

partitioning the source picture into a set of non-overlapping blocks; for a first block being coded, coding syntax elements of various coding tools according to a pre-defmed order into the bitstream; for any one or more subsequent block, determining a probability of a usage of each of the various coding tools in any one or more previous blocks;

deriving a probability model according to the determined probabilities;

defining an order for syntax elements for a block according to the probability model;

encoding syntax elements of the block into a bitstream according to the defined order.

[0007] According to an embodiment, the probability of usage of each coding tool is determined according to a learning process.

[0008] According to an embodiment, the learning process is based on already coded group of pictures.

[0009] According to an embodiment, the probability of usage is determined according to one or more of the following: a block size, a quality of the current block compared to a reference block, temporal distance of a block compared to the reference block.

[0010] According to an embodiment, the probability model is different for blocks having different sizes.

[0011] According to an embodiment, the probability model is different for different regions of the source picture.

[0012] According to an embodiment, the order for syntax elements is defined based on a pre encoding and decoding process modelling the determined probabilities.

[0013] According to a second aspect, there is provided an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: receive a source picture; partition the source picture into a set of non-overlapping blocks; for a first block being coded, to code syntax elements of various coding tools according to a pre-defined order into the bitstream; for any one or more subsequent block, to determine a probability of a usage of each of the various coding tools in any one or more previous blocks; derive a probability model according to the determined probabilities; define an order for syntax elements for a block according to the probability model; and encode syntax elements of the block into a bitstream according to the defined order.

[0014] According to an embodiment, the apparatus further comprising computer program code to cause the apparatus to determine the probability of usage of each coding tool according to a learning process.

[0015] According to an embodiment, the learning process is based on already coded group of pictures.

[0016] According to an embodiment, the probability of usage is determined according to one or more of the following: a block size, a quality of the current block compared to a reference block, temporal distance of a block compared to the reference block. [0017] According to an embodiment, the probability model is different for blocks having different sizes.

[0018] According to an embodiment, the probability model is different for different regions of the source picture.

[0019] According to an embodiment, the apparatus comprises means for defining the order for syntax elements based on a pre-encoding and decoding process modelling the determined probabilities.

[0020] According to a third aspect, there is provided a computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to receive a source picture; partition the source picture into a set of non overlapping blocks; for a first block being coded, to code syntax elements of various coding tools according to a pre-defined order into the bitstream; for any one or more subsequent block, to determine a probability of a usage of each of the various coding tools in any one or more previous blocks; to derive a probability model according to the determined probabilities; to define an order for syntax elements for a block according to the probability model; and to encode syntax elements of the block into a bitstream according to the defined order.

[0021] According to an embodiment, the computer program product is embodied on a non- transitory computer readable medium.

Description of the Drawings

[0022] In the following, various embodiments will be described in more detail with reference to the appended drawings, in which

[0023] Fig. 1 shows an encoding process according to an embodiment;

[0024] Fig. 2 shows a decoding process according to an embodiment;

[0025] Fig. 3 shows syntax element signaling order of Merge tools in VVC;

[0026] Fig. 4 shows an example of previously coded blocks that are used for ordering the syntax elements of the current block;

[0027] Fig. 5 shows an adaptive syntax element ordering procedure according to an embodiment;

[0028] Fig. 6 shows an example of a present embodiments applied for different block sizes;

[0029] Fig. 7 shows an example of an offline training;

[0030] Fig. 8 shows an example of an online training and online fine-tuning neural network;

[0031] Fig. 9 shows an example of offline training and online fine-tuning neural network;

[0032] Fig. 10 is a flowchart illustrating a method according to an embodiment;

[0033] Fig. 11 shows an apparatus according to an embodiment. Description of Example Embodiments

[0034] In the following, several embodiments will be described in the context of one video coding arrangement. It is to be noted, however, that the invention is not limited to this particular arrangement. For example, the invention may be applicable to video coding systems like streaming system, DVD (Digital Versatile Disc) players, digital television receivers, personal video recorders, systems and computer programs on personal computers, handheld computers and communication devices, as well as network elements such as transcoders and cloud computing arrangements where video data is handled.

[0035] In the following, several embodiments are described using the convention of referring to (de)coding, which indicates that the embodiments may apply to decoding and/or encoding.

[0036] The Advanced Video Coding standard (which maybe abbreviated AVC or H.264/AVC) was developed by the Joint Video Team (JVT) of the Video Coding Experts Group (VCEG) of the Telecommunications Standardization Sector of International Telecommunication Union (ITU-T) and the Moving Picture Experts Group (MPEG) of International Organization for Standardization (ISO) / International Electrotechnical Commission (IEC). The H.264/AVC standard is published by both parent standardization organizations, and it is referred to as ITU- T Recommendation H.264 and ISO/IEC International Standard 14496-10, also known as MPEG-4 Part 10 Advanced Video Coding (AVC). There have been multiple versions of the H.264/AVC standard, each integrating new extensions or features to the specification. These extensions include Scalable Video Coding (SVC) and Multiview Video Coding (MVC).

[0037] The High Efficiency Video Coding standard (which may be abbreviated HEVC or H.265/HEVC) was developed by the Joint Collaborative Team - Video Coding (JCT-VC) of VCEG and MPEG. The standard is published by both parent standardization organizations, and it is referred to as ITU-T Recommendation H.265 and ISO/IEC International Standard 23008- 2, also known as MPEG-H Part 2 High Efficiency Video Coding (HEVC). Extensions to H.265/HEVC include scalable, multiview, three-dimensional, and fidelity range extensions, which may be referred to as SHVC, MV-HEVC, 3D-HEVC, and REXT, respectively. The references in this description to H.265/HEVC, SHVC, MV-HEVC, 3D-HEVC and REXT that have been made for the purpose of understanding definitions, structures or concepts of these standard specifications are to be understood to be references to the latest versions of these standards that were available before the date of this application, unless otherwise indicated.

[0038] Some key definitions, bitstream and coding structures, and concepts of H.264/ AVC and HEVC and some of their extensions are described in this section as an example of a video encoder, decoder, encoding method, decoding method, and a bitstream structure, wherein the embodiments may be implemented. Some of the key definitions, bitstream and coding structures, and concepts of H.264/AVC are the same as in HEVC standard - hence, they are described below jointly. The aspects of various embodiments are not limited to H.264/ AVC or HEVC or their extensions, but rather the description is given for one possible basis on top of which the present embodiments may be partly or fully realized.

[0039] In the description of existing standards as well as in the description of example embodiments, a syntax element may be defined as an element of data represented in the bitstream. A syntax structure may be defined as zero or more syntax elements present together in the bitstream in a specified order.

[0040] Similarly, to many earlier video coding standards, the bitstream syntax and semantics as well as the decoding process for error-free bitstreams are specified in H.264/AVC and HEVC. The encoding process is not specified, but encoders must generate conforming bitstreams. Bitstream and decoder conformance can be verified with the Hypothetical Reference Decoder (HRD). The standards contain coding tools that help in coping with transmission errors and losses, but the use of the tools in encoding is optional and no decoding process has been specified for erroneous bitstreams.

[0041] The elementary unit for the input to an H.264/AVC or HEVC encoder and the output of an H.264/AVC or HEVC decoder, respectively, is a picture. A picture given as an input to an encoder may also be referred to as a source picture, and a picture decoded by a decoder may be referred to as a decoded picture.

[0042] The source and decoded pictures may each be comprised of one or more sample arrays, such as one of the following sets of sample arrays, wherein each of the samples may represent one color component:

- Luma (Y) only (monochrome).

- Luma and two chroma (Y CbCr or Y CgCo).

- Green, Blue and Red (GBR, also known as RGB).

- Arrays representing other unspecified monochrome or tri-stimulus color samplings (for example, YZX, also known as XYZ).

[0043] In the following, these arrays may be referred to as luma (or L or Y) and chroma, where the two chroma arrays may be referred to as Cb and Cr; regardless of the actual color representation method in use. The actual color representation method in use may be indicated e.g. in a coded bitstream e.g. using the Video Usability Information (VUI) syntax of H.264/AVC and/or HEVC. A component may be defined as an array or a single sample from one of the three sample arrays (luma and two chroma) or the array or a single sample of the array that compose a picture in monochrome format.

[0044] In H.264/AVC and HEVC, a picture may either be a frame or a field. A frame comprises a matrix of luma samples and possibly the corresponding chroma samples. A field is a set of alternate sample rows of a frame. Fields may be used as encoder input for example when the source signal is interlaced. Chroma sample arrays may be absent (and hence monochrome sampling may be in use) or may be subsampled when compared to luma sample arrays. Some chroma formats may be summarized as follows: - In monochrome sampling there is only one sample array, which may be nominally considered the luma array.

- In 4:2:0 sampling, each of the two chroma arrays has half the height and half the width of the luma array.

- In 4:2:2 sampling, each of the two chroma arrays has the same height and half the width of the luma array.

- In 4:4:4 sampling when no separate color planes are in use, each of the two chroma arrays has the same height and width as the luma array.

[0045] In H.264/AVC and HEVC, it is possible to code sample arrays as separate color planes into the bitstream and respectively decode separately coded color planes from the bitstream. When separate color planes are in use, each one of them is separately processed (by the encoder and/or the decoder) as a picture with monochrome sampling.

[0046] When chroma subsampling is in use (e.g. 4:2:0 or 4:2:2 chroma sampling), the location of chroma samples with respect to luma samples may be determined in the encoder side (e.g. as pre-processing step or as part of encoding). The chroma sample positions with respect to luma sample positions may be pre-defined for example in a coding standard, such as H.264/AVC or HEVC, or may be indicated in the bitstream for example as part of VUI of H.264/AVC or HEVC.

[0047] Generally, the source video sequence(s) provided as input for encoding may either represent interlaced source content or progressive source content. Fields of opposite parity have been captured at different times for interlaced source content. Progressive source content contains captured frames. An encoder may encode fields of interlaced source content in two ways: a pair of interlaced fields may be coded into a coded frame or a field may be coded as a coded field. Likewise, an encoder may encode frames of progressive source content in two ways: a frame of progressive source content may be coded into a coded frame or a pair of coded fields. A field pair or a complementary field pair may be defined as two fields next to each other in decoding and/or output order, having opposite parity (i.e. one being a top field and another being a bottom field) and neither belonging to any other complementary field pair. Some video coding standards or schemes allow mixing of coded frames and coded fields in the same coded video sequence. Moreover, predicting a coded field from a field in a coded frame and/or predicting a coded frame for a complementary field pair (coded as fields) may be enabled in encoding and/or decoding.

[0048] A partitioning may be defined as a division of a set into subsets such that each element of the set is in exactly one of the subsets. A picture partitioning may be defined as a division of a picture into smaller non-overlapping units. A block partitioning may be defined as a division of a block into smaller non-overlapping units, such as sub-blocks. In some cases, term block partitioning may be considered to cover multiple levels of partitioning, for example partitioning of a picture into slices, and partitioning of each slice into smaller units, such as macroblocks of H.264/AVC. It is noted that the same unit, such as a picture, may have more than one partitioning. For example, a coding unit of HEVC may be partitioned into prediction units and separately by another quadtree into transform units.

[0049] Hybrid video codecs, for example ITU-T H.263, H.264/AVC and HEVC, may encode the video information in two phases. At first, pixel value in a certain picture area (or“block”) are predicted for example by motion compensation means or by spatial means. In the first phase, predictive coding may be applied, for example, as so-called sample prediction and/or so-called syntax prediction.

[0050] In the sample prediction, pixel or sample values in a certain picture area or“block” are predicted. These pixel or sample values can be predicted, for example, using one or more of the following ways:

- motion compensation mechanism;

- intra prediction mechanism.

[0051] Motion compensation mechanisms (which may also be referred to as inter prediction, temporal prediction or motion-compensation temporal prediction or motion-compensated prediction or MCP) involve finding and indicating an area in one of the previously encoded video frames that corresponds closely to the block being coded. Inter prediction may reduce temporal redundancy.

[0052] Intra prediction, where pixel or sample values can be predicted by spatial mechanisms, involve finding and indicating a spatial region relationship. Intra prediction can be performed in spatial or transform domain, i.e., either sample values or transform coefficients can be predicted. Intra prediction may be exploited in intra coding, where no inter-prediction is applied.

[0053] In the syntax prediction, which may also be referred to as parameter prediction, syntax elements and/or syntax element values and/or variables derived from syntax elements are predicted from syntax elements (de)coded earlier and/or variables derived earlier. Non-limiting examples of syntax prediction are provided below:

[0054] In motion vector prediction, motion vectors, e.g. for inter and/or inter-view prediction may be coded differentially with respect to a block-specific predicted motion vector. In many video codecs, the predicted motion vectors are created in a predefined way, for example by calculating the media of the encoded or decoded motion vectors of the adjacent blocks. Another way to create motion vector predictions, sometimes referred to as advanced motion vector prediction (AMVP), is to generate a list of candidate predictions from adjacent blocks and/or co-located blocks in temporal reference pictures and signaling the chosen candidate as the motion vector predictor. In addition to predicting the motion vector values, the reference index of previously coded/decoded picture can be predicted. Differential coding of motion vectors may be disabled according to slice boundaries.

[0055] The block partitioning, e.g. from CTU to CUs and down to PUs, may be predicted. [0056] In filter parameter prediction, the filtering parameters e.g. for sample adaptive offset may be predicted.

[0057] Prediction approaches using image information from a previously coded image can also be called as inter prediction methods which may also be referred to as temporal prediction and motion compensation. Prediction approaches using image information within the same image can also be called as intra prediction methods.

[0058] The second phase is coding the error between the prediction block of samples and the original block of samples. This may be accomplished by transforming the difference in sample values using a specified transform. This transform may be e.g. a Discrete Cosine Transform (DCT) or a variant thereof. After transforming the difference, the transformed difference is quantized, and entropy coded.

[0059] By varying the fidelity of the quantization process, encoder can control the balance between the accuracy of the pixel representation (picture quality) and size of the resulting coded video representation (file size of transmission bitrate).

[0060] The decoder reconstructs the output video by applying a prediction mechanism similar to that used by the encoder in order to form a prediction representation of the sample blocks (using the motion or spatial information created by the encoder and included in the compressed representation of the image) and prediction error decoding (the inverse operation of the prediction error coding to recover the quantized error signal in the spatial domain).

[0061] After applying sample prediction and error decoding processes the decoder combines the prediction and the prediction error signals (the sample values) to form the output video frame.

[0062] An example of an encoding process is illustrated in Fig. 1. Fig. 1 illustrates an image to be encoded (I_n); a prediction representation of an image block (P’_n); a prediction error signal (D_n); a reconstructed prediction error signal (D’_n); a preliminary reconstructed image (F_n); a final reconstructed image (R’_n); a transform (T) and inverse transform a quantization (Q) and inverse quantization (Q ¹); entropy encoding (E); a reference frame memory (RFM); inter prediction (Pi_nter); intra prediction (P_mtra); mode selection (MS) and filtering (F). An example of a decoding process is illustrated in Fig. 2. Fig. 2 illustrates a prediction representation of an image block (P’_n); a reconstructed prediction error signal (D’_n); a preliminary reconstructed image (T_n); a final reconstructed image (R’_n); an inverse transform (T-l); an inverse quantization (Q ¹); an entropy decoding (E ¹); a reference frame memory (RFM); a prediction (either inter or intra) (P); and filtering (F).

[0063] The decoder (and encoder) may also apply additional filtering processes in order to improve the quality of the output video before passing it for display and/or storing as a prediction reference for the forthcoming pictures in the video sequence.

[0064] In many video codecs, including H.264/AVC and HEVC, motion information is indicated by motion vectors associated with each motion compensated image block. Each of these motion vectors represents the displacement of the image block in the picture to be coded (in the encoder) or decoded (at the decoder), and the prediction source block in one of the previously coded or decoded image (or pictures). H.264/AVC and HEVC, as many other video compression standards, a picture is divided into a mesh of rectangles, for each of which a similar block in one of the reference pictures is indicated for inter prediction. The location of the prediction block is coded as a motion vector that indicates the position of the prediction block relative to the block being coded.

[0065] Many video codecs utilize neural networks to enhance their operation. A neural network (NN) is a computation graph that may comprise several layers of successive computation. A layer may comprise one or more units“neurons” performing an elementary computation. A unit may be connected to one or more units, and the connection may have associated a weight. The weight may be used for scaling the signal passing through the associated connection. Weights may be leamable parameters, i.e., values which can be learned from training data.

[0066] Each layer is configured to extract data from the input data. Data form lower layers may represent low-level semantics, whereas higher layers represent higher-level semantics.

[0067] A feed-forward NN and recurrent NN represent examples of neural network architectures. Feed-forward neural networks do not have a feedback loop: each layer takes input from one or more of the subsequent layers. Also, units inside a certain layer may take input from units in one or more of the preceding layers, and provide output to one or more of the following layers.

[0068] In order to configure a neural network to perform a task, an untrained neural network model has to go through a training phase. The training phase is the development phase, where the neural network learns to perform the final task. A training data set that is used for training a neural network is supposed to be representative of the data on which the neural network will be used. During training, the neural network uses the examples in the training dataset to modify its leamable parameters (e.g., its connections’ weights) in order to achieve the desired task. Training may happen by minimizing or decreasing the output’s error, also referred to as the loss. Examples of the losses are mean squared error, cross-entropy, etc. In recent deep-leaming techniques, training is an iterative process, where at each iteration, the algorithm modifies the weights of the neural network to make a gradual improvement of the network’s output, i.e., to gradually decrease the loss.

[0069] After training, the trained neural network model is applied to new data during an interference phase, in which the neural network performs the desired task to which it has been trained for. The inference phase is also known as a testing phase. As a result of the inference phase, the neural network provides an output which is a result of the inference on the new data. For example, the result can be a classification result or a recognition result.

[0070] Training can be performed in several ways. The main ones are supervised, unsupervised, and reinforcement training. In supervised training, the neural network model is provided with input-output pairs, where the output may be a [0071] label. In unsupervised training, the neural network is provided only with input data (and also with output raw data in case of self-supervised training). In reinforcement learning, the supervision is sparser and less precise; instead of input-output pairs, the neural network gets input data and, sometimes, delayed rewards in the form of scores (E.g., -1, 0, or +1). The learning is a result of a training algorithm, or of a meta-level neural network providing the training signal. In general, the training algorithm comprises changing some properties of the neural network so that its output is as close as possible to a desired output.

[0072] In image/video coding standards, e.g., AVC, HEVC and VVC, the syntax elements (SEs) in a category of coding tools are coded and transmitted in the bitstream in a certain and fixed order. The ordering may be done based on the probability of the usage of each tool in the entire sequence. However, such ordering is sub-optimal since the probability of usage of a certain coding tool depends on different properties such as: content in each image or coding block, temporal distance of each frame to the reference frame, quality of each frame or block relative to its reference, size of the block, etc.

[0073] Figure 3 illustrates the syntax element order of merge coding tools in current version of Versatile Coding standard (H.266/VVC). There are several merge coding tools 310 which are encoded or decoded in a fixed order. These merge coding tools comprise normal merge mode 320, merge with motion vector difference (MMVD) mode 330, sub-block merge mode 340, Combined Intra Inter Prediction (CUP) mode 350 and a triangle mode 360. In the decoder side, in order to understand the usage of any of these tools for a certain block, all the previous syntax elements need to be decoded first. For example, if a block is coded in Combined Intra-Inter (CUP) mode, the decoder has to decode Normal merge flag 320, MMVD flag 330 and Sub block flag 340 syntax elements before parsing and decoding the CUP syntax element 350. Moreover, if the usage of CUP merge mode 350 for the content is higher than usage of the other tools in the category, then the bitrate will increase since these syntax elements are coded into the bitstream in a block level. The same issue applies to the other syntax elements in this category but also in general with the entire video coding tools.

[0074] The present embodiments are targeted to solve the issue of having a fixed order for coding the syntax elements (SEs) in video coding, and propose an adaptive re-ordering of syntax elements in order to provide an efficient model for coding syntax elements.

[0075] The adaptive order for the syntax elements according to the embodiments may be based on a learning process from the video content. To this end, the learning process may be implemented according to one or more of the following options:

[0076] A probability of the usage of each coding tool in a coding category in each frame can be determined based on a learning process from the previously encoded/decoded blocks in the picture(s).

[0077] The probability of the usage of each coding tool may be determined based on a certain property. For example, the probability of the usage can be determined based on a block size, the quality of the current block compared to the reference block, the temporal distance of a block inside the picture compared to the reference block, etc.

[0078] The learning may be based on already coded group of pictures (GOP). In such case, in order to decide the coding order of syntax elements, at least one previously coded picture’s coding information is used for learning the syntax element of each block in the current picture. This is adaptively improved when more frames are involved.

[0079] In another approach, encoding in two steps may be performed to the video. In the first step, a video may be partially or fully coded in order to derive the picture/block level information of the coding tools for the learning process. In the second round of encoding (i.e., the final encoding), the syntax elements are re-ordered based on the information that is extracted from the first round of encoding.

[0080] In the following, the solution is described in more detailed manner by means of the following embodiments.

[0081] The syntax elements (SEs) of a group of coding tools can be implemented according to various embodiments. For example, there can be N prediction tools with distinct syntax elements (SE1, SE2, ..., SEN) for coding a block in the codec. In order to efficiently code the syntax elements of a block into the bitstream, these syntax elements may be adaptively re ordered based on their probability of usage in the codec. This means that the more probable a certain tool is used, the earlier it’s syntax element is placed in the order of syntax elements. The probability of the usage may be estimated based on a learning process from the previously coded blocks.

[0082] Figure 4 illustrates an example of the current coding block and the previously coded blocks. Based on the learning process, a probability model may be derived, which probability model defines the ordering of the syntax elements. The model may be derived based on at least one previously coded block. However, in order to have more precise probability model, the coding information of more blocks are needed, e.g. one or more CTU rows/columns of the picture.

[0083] Thus, according to an embodiment, the order of syntax elements in a group of coding tools can be modelled based on the probability of the corresponding syntax elements in previous blocks.

[0084] Figure 5 illustrates a procedure for re-ordering syntax elements in the encoding and decoding process, according to an embodiment. As will be realized, syntax elements of an input block 510 may be coded in a different order than other blocks. For the first input block 510, the syntax elements may be coded based on a pre-defined and fixed syntax elements order 520. When the first input block is coded 530, its coding information, i.e., the syntax element information is fed to the syntax element modelling process 540 that models the order based on the probability of the usage of the coding tools in previous block(s). The syntax elements modelling 540 is then used for updating 550 the syntax elements order. Consequently, the syntax elements are re-ordered based on the coding information from the previous block(s). The re-ordered syntax elements are then used for coding the syntax elements of the next block(s).

[0085] The process of modelling the syntax elements order may be done in different ways. Few examples are given in the following:

[0086] The modelling can be done by calculating the probability of usage of each syntax element or coding tool in previous blocks and by re-ordering the syntax elements from highest probable SE to the lowest probable SE.

[0087] The modelling can be done by using a set of neural networks (NNs) for the learning process and deciding the order of SEs based on the previously coded block(s) SE information.

[0088] When the modelling is done by using a set of neural networks, the neural network can be trained according to one of the following ways:

[0089] Figure 7 shows an example of offline training of a neural network. In this example, the NN is trained (separately outside of the coding chain with the same or different data as training set) prior to encoding and then it is used for syntax element re-ordering. As will be realized, in training syntax elements of an input block 710, the input block(s) 710 are provided for neural network 720. The neural network 720 models the syntax element order 730 for each block based on the probability of the usage of the coding tools in previous block(s). Consequently, the syntax elements are re-ordered based on the coding information from the neural network. The syntax elements are then (de)coded according to the computed the syntax elements order.

[0090] Figure 8 shows an example of online training and online fine-tuning of neural network. In this example, the NN is trained in the encoding/decoding process (i.e., online learning) and then used for SE re-ordering. In this case, for the NN to be more effective, more than one block’s coding information may be needed. For example, the NN may use the coding information of at least one CTU row or column for the training operation. Then after the training, the model is used for SE re-ordering. The NN may be later fine-tuned based on the next blocks’ coding information.

[0091] Figure 9 shows an example of offline training and online fine-tuning of neural network. In this example, the NN is trained similar as in the example of Figure 7 (i.e., outside of the coding chain), but the coding information of the blocks in the encoding/decoding process is used for fine-tuning the NN. For the fine-tuning process, there may be another NN is used in the coding chain.

[0092] According to embodiment, the process of modelling the SEs’ order may be done differently for different block sizes. In other words, for a specific block size, a distinct SEs ordering model may be defined and the order may be updated based on the previously coded blocks with the same size. This is illustrated in Figure 6.

[0093] In video content, the syntax elements modelling and ordering may be done based on the coding information from the previous frame(s). In this case, when a frame of a video is (de)coded, a syntax elements model may be derived from the coding information of the blocks in the frame. The syntax elements model may then be used for ordering the syntax elements of the next frame(s). It is appreciated that there may be more than one frame used for the modelling the syntax elements order.

[0094] According to an embodiment, for the past frame (already coded and reconstructed frame) a separate syntax elements model may be derived. When coding a block in future frame, the syntax elements of this block is ordered based on the syntax elements model of the reference picture.

[0095] According to an embodiment, the ordering of syntax elements may be done based on pre-encoding and decoding process. For example, one or more frames of the video may be pre encoded and decoded prior to the final encoding/decoding. Accordingly, the syntax elements modelling can be done with coding information and probabilities of the coding tools in the pre encoding/decoding process. At least one syntax elements order may be derived and stored in the codec for coding the syntax elements in the final encoding/decoding process.

[0096] In such case, there may be more than one syntax elements order derived for coding the syntax elements of the blocks in the final encoding/decoding process. The order may be derived based on different aspects as below:

[0097] The syntax elements order may be derived based on the size of the blocks.

[0098] The quality level of each frame may be different in the video sequence. Moreover, the quality of each block may also differ inside each picture. Hence, the syntax elements order may be derived based on the quality level of each block.

[0099] The syntax elements order may be derived based on the temporal ID of the frame in the group of pictures (GOP).

[0100] In the case of multiple syntax elements order derivation for a group of coding tools, the codec may decide to use the one with proper properties, e.g. based on a certain block size, temporal ID or quality.

[0101] According to another embodiment, the encoder may select the proper order based on a rate-distortion optimization by using all the available syntax elements orders for a certain group of coding tools. In such case, there may be a signaling of the index of the corresponding syntax elements order to the decoder.

[0102] According to another embodiment, the syntax elements order modelling can be done for different regions of the image/video differently. For example, in the case of 360 degree content, due to the characteristics of the projection format of the resulted 2D image from the 3D mapping process, there are certain sampling characteristics in different parts. In the equirectangular projection format, there are over-sampling and stretching in the areas close to the polar areas of the projected content. Hence, for these areas, the probability of usage of tools may be different than other parts (e.g. areas in equator). Thus, the syntax elements modelling, and re-ordering may be done differently for each area. For example, at least one syntax elements modelling may be applied to polar areas and at least one to the equator areas.

[0103] In general, based on the characteristics of the content, there may be more than one syntax elements model for modeling the syntax elements order for a content. [0104] The syntax elements modelling may be done in encoder side and/or decoder side.

[0105] Fig. 10 is a flowchart illustrating a method according to an embodiment. A method comprises receiving 1010 a source picture; partitioning 1020 the source picture into a set of non-overlapping blocks; for a first block being coded, coding 1030 syntax elements of various coding tools according to a pre-defined order into the bitstream; for any one or more subsequent block, determining 1040 a probability of a usage of each of the various coding tools in any one or more previous blocks; deriving 1050 a probability model according to the determined probabilities; defining 1060 an order for syntax elements for a block according to the probability model; and 1070 syntax elements of the block into a bitstream according to the defined order.

[0106] It is appreciated that term“receiving” is understood here to mean that the information is read from a memory or received over a communications connection.

[0107] An apparatus according to an embodiment comprises means for receiving a source picture; means for partitioning the source picture into a set of non-overlapping blocks; for a first block being coded, means for coding syntax elements of various coding tools according to a pre-defmed order into the bitstream; for any one or more subsequent block, means for determining a probability of a usage of each of the various coding tools in any one or more previous blocks; means for deriving a probability model according to the determined probabilities; means for defining an order for syntax elements for a block according to the probability model; and means for encoding syntax elements of the block into a bitstream according to the defined order. The means comprises at least one processor, and a memory including a computer program code, wherein the processor may further comprise processor circuitry. The memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method of Figure 10 according to various embodiments.

[0108] An apparatus according to an embodiment is illustrated in Figure 11. An apparatus of this embodiment is a camera having multiple lenses and imaging sensors, but also other types of cameras may be used to capture wide view images and/or wide view video.

[0109] The terms wide view image and wide view video mean an image and a video, respectively, which comprise visual information having a relatively large viewing angle, larger than 100 degrees. Hence, a so called 360 panorama image/video as well as images/videos captured by using a fisheye lens may also be called as a wide view image/video in this specification. More generally, the wide view image/video may mean an image/video in which some kind of projection distortion may occur when a direction of view changes between successive images or frames of the video so that a transform may be needed to find out co located samples from a reference image or a reference frame. This will be described in more detail later in this specification.

[0110] The camera 2700 of Figure 11 comprises two or more camera units 2701 and is capable of capturing wide view images and/or wide view video. Each camera unit 2701 is located at a different location in the multi-camera system and may have a different orientation with respect to other camera units 2501. As an example, the camera units 2701 may have an omnidirectional constellation so that it has a 360-viewing angle in a 3D-space. In other words, such camera 2700 may be able to see each direction of a scene so that each spot of the scene around the camera 2700 can be viewed by at least one camera unit 2701.

[0111] The camera 2700 of Figure 11 may also comprise a processor 2704 for controlling the operations of the camera 2700. There may also be a memory 2706 for storing data and computer code to be executed by the processor 2704, and a transceiver 2708 for communicating with, for example, a communication network and/or other devices in a wireless and/or wired manner. The camera 2700 may further comprise a user interface (UI) 2710 for displaying information to the user, for generating audible signals and/or for receiving user input. However, the camera 2700 need not comprise each feature mentioned above or may comprise other features as well. For example, there may be electric and/or mechanical elements for adjusting and/or controlling optics of the camera units 2701 (not shown).

[0112] Figure 11 also illustrates some operational elements which may be implemented, for example, as a computer code in the software of the processor, in a hardware, or both. A focus control element 2714 may perform operations related to adjustment of the optical system of camera unit or units to obtain focus meeting target specifications or some other predetermined criteria. An optics adjustment element 2716 may perform movements of the optical system or one or more parts of it according to instructions provided by the focus control element 2714. It should be noted here that the actual adjustment of the optical system need not be performed by the apparatus, but it may be performed manually, wherein the focus control element 2714 may provide information for the user interface 2710 to indicate a user of the device how to adjust the optical system.

[0113] The various embodiments can be implemented with the help of computer program code that resides in a memory and causes the relevant apparatuses to carry out the method. For example, a device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment. Yet further, a network device like a server may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of an embodiment. The computer program code comprises one or more operational characteristics. Said operational characteristics are being defined through configuration by said computer based on the type of said processor, wherein a system is connectable to said processor by a bus, wherein a programmable operational characteristic of the system comprises receiving a source picture; partitioning the source picture into a set of non-overlapping blocks; for a first block being coded, coding syntax elements of various coding tools according to a pre-defined order into the bitstream; for any one or more subsequent block, determining a probability of a usage of each of the various coding tools in any one or more previous blocks; deriving a probability model according to the determined probabilities; defining an order for syntax elements for a block according to the probability model; and encoding syntax elements of the block into a bitstream according to the defined order.

[0114] f desired, the different functions discussed herein may be performed in a different order and/or concurrently with other. Furthermore, if desired, one or more of the above-described functions and embodiments may be optional or may be combined.

[0115] Although various aspects of the embodiments are set out in the independent claims, other aspects comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims.

[0116] It is also noted herein that while the above describes example embodiments, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications, which may be made without departing from the scope of the present disclosure as, defined in the appended claims.

Claims

Claims:

1. A method, comprising:

- receiving a source picture;

- partitioning the source picture into a set of non-overlapping blocks;

- for a first block being coded, coding syntax elements of various coding tools according to a pre-defined order into the bitstream;

- for any one or more subsequent block, determining a probability of a usage of each of the various coding tools in any one or more previous blocks;

- deriving a probability model according to the determined probabilities;

- defining an order for syntax elements for a block according to the probability model; and

- encoding syntax elements of the block into a bitstream according to the defined order.

2. The method according to claim 1, determining the probability of usage of each coding tool according to a learning process.

3. The method according to claim 2, wherein the learning process is based on already coded group of pictures.

4. The method according to claim 1 , wherein the probability of usage is determined according to one or more of the following: a block size, a quality of the current block compared to a reference block, temporal distance of a block compared to the reference block.

5. The method according to claim 1 , wherein the probability model is different for blocks having different sizes.

6. The method according to claim 1, wherein the probability model is different for different regions of the source picture.

7. The method according to claim 1, wherein the order for syntax elements is defined based on a pre-encoding and decoding process modelling the determined probabilities.

8. An apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following:

- receive a source picture;

- partition the source picture into a set of non-overlapping blocks;

- for a first block being coded, to code syntax elements of various coding tools according to a pre-defmed order into the bitstream; - for any one or more subsequent block, to determine a probability of a usage of each of the various coding tools in any one or more previous blocks;

- derive a probability model according to the determined probabilities;

- define an order for syntax elements for a block according to the probability model; and

- encode syntax elements of the block into a bitstream according to the defined order.

9. The apparatus according to claim 8, further comprising computer program code to cause the apparatus to determine the probability of usage of each coding tool according to a learning process.

10. The apparatus according to claim 9, wherein the learning process is based on already coded group of pictures.

11. The apparatus according to claim 8, wherein the probability of usage is determined according to one or more of the following: a block size, a quality of the current block compared to a reference block, temporal distance of a block compared to the reference block.

12. The apparatus according to claim 8, wherein the probability model is different for blocks having different sizes.

13. The apparatus according to claim 8, wherein the probability model is different for different regions of the source picture.

14. The apparatus according to claim 8, further comprising means for defining the order for syntax elements based on a pre-encoding and decoding process modelling the determined probabilities.

15. An apparatus according to claim 8, further comprising at least one processor and a memory including computer program code.

16. A computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to

- receive a source picture;

- partition the source picture into a set of non-overlapping blocks;

- for a first block being coded, to code syntax elements of various coding tools according to a pre-defined order into the bitstream;

- for any one or more subsequent block, to determine a probability of a usage of each of the various coding tools in any one or more previous blocks;

- to derive a probability model according to the determined probabilities; - to define an order for syntax elements for a block according to the probability model; and

- to encode syntax elements of the block into a bitstream according to the defined order.

17. A computer program product according to claim 16, wherein the computer program product is embodied on a non-transitory computer readable medium.