CN113994685A

CN113994685A - Exchanging information in scalable video coding

Info

Publication number: CN113994685A
Application number: CN202080028272.6A
Authority: CN
Inventors: 圭多·梅迪; 伊万·达姆贾诺维奇
Original assignee: V-Nova Ltd
Current assignee: V-Nova Ltd
Priority date: 2019-04-16
Filing date: 2020-04-16
Publication date: 2022-01-28
Also published as: WO2020212701A1; GB2600025A; US20220182654A1; GB2600025B; GB201905400D0; GB202116455D0

Abstract

According to an aspect of the present invention, there may be provided a method of encoding a data stream, the method comprising: receiving an input signal; applying a first encoding operation to the input signal using a first codec to produce a first encoded stream; and applying a second encoding operation to the input signal to produce a second encoded stream, wherein the first and second encoded streams are for combining at a decoder; and wherein the method further comprises exchanging information between the first encoding operation and the second encoding operation. A method of decoding an encoder, decoder and computer readable medium is also provided.

Description

Exchanging information in scalable video coding

Background

Recent improvements in video coding techniques have included the concept of scalable video coding. Examples include VC6, which undergoes normalization at SMPTE to ST 2117, and LCEVC, which undergoes normalization at MPEG to MPEG-5 part II. Typically, these hierarchical encoding schemes use multiple resolution levels and an encoder (or encoding module) associated with each resolution level.

Examples of hierarchical encoding techniques include patent publications WO2013/171173, WO 2014/170819, WO 2018/046940, and WO2019/111004, the contents of which are incorporated herein by reference.

In these new coding schemes, efficiencies and optimizations are sought to reduce data size and/or processing requirements while improving picture quality of the final reconstructed image.

Disclosure of Invention

According to a first aspect of the present invention, there may be provided a method of encoding a signal, the method comprising: receiving an input signal; applying a first encoding operation to the input signal using a first codec to produce a first encoded stream; and applying a second encoding operation to the input signal to produce a second encoded stream, wherein the first and second encoded streams are for combining at a decoder; and wherein the method further comprises exchanging information between the first encoding operation and the second encoding operation. Depending on the information exchanged, the exchange of useful information between encoding operations allows for improved data quality after reconstruction, data simplification, and/or processing efficiency. The signal may be a data stream.

Preferably, the method may further comprise adapting the first or second encoding operation or both based on the information. With such information exchange, the adaptation may be coordinated to improve overall efficiency or quality or to provide a balance in the overall encoding operation. More preferably, the first and second encoding operations may be encoding operations of a hierarchical encoding scheme. Since each level of the hierarchy provides a particular benefit or impact, by adapting the parameters at each level, an overall improvement may be made.

In certain examples, the hierarchical encoding scheme may comprise: generating a base encoded signal by feeding a downsampled version of an input signal to an encoder; generating a first residual signal by: obtaining a decoded version of the base encoded signal; and generating the first residual signal using a difference between the decoded version of the base encoded signal and the downsampled version of the input signal; and encoding the first residual signal to generate a first encoded residual signal. The method may further comprise: generating a second residual signal by: decoding the first encoded residual signal to generate a first decoded residual signal; correcting the decoded version of the base encoded signal using the first decoded residual signal to produce a corrected decoded version; upsampling the corrected decoded version; and generating the second residual signal using a difference between the upsampled version and the input signal; wherein the method further comprises: encoding the second residual signal to generate a second encoded residual signal, wherein the base encoded signal, the first encoded residual signal, and the second encoded residual signal comprise an encoding of the input signal. The first encoding operation may be the encoder or a step of encoding the first residual signal. The second encoding operation may be a step of encoding the first residual signal or a step of encoding the second residual signal.

In certain other examples, the hierarchical encoding scheme may comprise: generating a base encoded signal by feeding a downsampled version of an input signal to an encoder, said downsampled version having undergone one or more downsampling operations; generating one or more residual signals by: upsampling an output of each downsampling operation to produce one or more upsampled signals; and generating the one or more residual signals using a difference between each upsampled signal and an input to a respective downsampling operation; and encoding the one or more residual signals to generate one or more encoded residual signals. The first and second encoding operations may correspond to the step of encoding any two of the one or more residual signals.

Alternatively, the hierarchical encoding scheme may include: generating a base encoded signal by feeding a downsampled version of an input signal to an encoder, said downsampled version having undergone a plurality of sequential downsampling operations; and generating an encoded first residual signal by: upsampling the downsampled version of the input signal; generating the first residual signal using a difference between an upsampled version of the downsampled version and an input to a last downsampling operation of the plurality of sequential downsampling operations; and encoding the first residual signal; generating a second residual signal by: upsampling a sum of the first residual signal and an output of the preceding downsampling operation of the last downsampling operation of the plurality of sequential downsampling operations; generating the second residual signal using a difference between the upsampled sum and the input to the preceding downsampling operation; and encoding the second residual signal. The first encoding operation may be the encoder or a step of encoding the first residual signal. The second encoding operation may be a step of encoding the first residual signal or a step of encoding the second residual signal.

According to a second aspect of the present invention, there may be provided a method of decoding a signal, the method comprising: receiving a first encoded signal and a second encoded signal; applying a first decoding operation to the first encoded signal to generate a first output signal; applying a second encoding operation to the second encoded signal to generate a second output signal; and combining the first output signal and the second output signal to reconstruct an input signal, wherein the method further comprises exchanging information between the first decoding operation and the second decoding operation.

Preferably, the method may further comprise adapting the first or second decoding operation or both based on the information. More preferably, the first and second decoding operations may be decoding operations of a hierarchical encoding scheme.

In certain examples, the hierarchical encoding scheme may comprise: receiving a base encoded signal and indicating decoding of the base encoded signal to produce a base decoded signal; receiving a first encoded residual signal and decoding the first encoded residual signal to generate a first decoded residual signal; correcting the base decoded signal using the first decoded residual signal to generate a corrected version of the base decoded signal; upsampling the corrected version of the base decoded signal to produce an upsampled signal; receiving a second encoded residual signal and decoding the second encoded residual signal to generate a second decoded residual signal; and combining the upsampled signal and the second decoded residual signal to generate a reconstructed version of the input signal.

In certain examples, the first encoded signal and the second encoded signal may include first and second sets of components, respectively, the first set of components corresponding to a lower image resolution than the second set of components, the method comprising: for each of the first and second component sets: decoding the set of components to obtain a decoded set, the method further comprising: the method further includes updating the decoded first set of components to increase a corresponding image resolution of the decoded first set of components to be equal to a corresponding image resolution of the decoded second set of components, and combining the decoded first and second sets of components together to generate a reconstructed set. The method may further comprise receiving one or more additional sets of components, wherein each of the one or more additional sets of components corresponds to a higher image resolution than the second set of components, and wherein each of the one or more additional sets of components corresponds to a progressively higher image resolution, the method comprising, for each of the one or more additional sets of components, decoding the set of components so as to obtain a decoded set, the method further comprising, for each of the one or more additional sets of components, in increasing order of corresponding image resolutions: the reconstructed set having the highest corresponding image resolution is upgraded to increase the corresponding image resolution of the reconstructed set to equal a corresponding image resolution of another set of components, and the reconstructed set and the another set of components are combined together to produce another reconstructed set.

Optionally, the step of exchanging information may comprise sending information with the metadata in the stream. Alternatively, the step of exchanging information may comprise embedding the information in the stream. Further alternatively, the step of exchanging information may comprise sending the information using an Application Programming Interface (API). Further alternatively, the step of exchanging information may comprise sending a pointer to the shared memory space. The step of exchanging information may further include exchanging information using Supplemental Enhancement Information (SEI).

The exchanged information may include encoding or decoding parameters for modifying encoding or decoding operations, respectively.

The exchanged information may comprise one or more selected from the group comprising: inputting information by a user; metadata describing the content of the input video; host device information; content analysis information; perceptual information describing a region of an image, wherein coding artifacts are less noticeable considering the Human Visual System (HVS); motion information; the complexity of the frame to be encoded; frame entropy; the estimated required number of bits; frame information; decisions taken during encoding or decoding; a type of prediction for a frame to be decoded/encoded; quantization levels for frames to be decoded/encoded, the decision made by each group of pixels; counting data information; a target video quality metric; and, rate control information.

Said signal being video-enabled and said step of exchanging information being performed per video, per group of pictures, per slice, per picture, per group of pixels and per pixel

According to another aspect of the present invention, there may be provided a method of encoding a data stream, the method may include: receiving an input video; applying a base encoding operation to the input video using a base codec to generate a base encoded stream; applying another encoding operation to the input video to produce an enhancement stream; exchanging information between the base encoding operation and the another encoding operation.

According to another aspect of the present invention, there may be provided a method of decoding a data stream, the method may include: receiving a base encoded data stream and enhancement stream data; applying a base decoding operation to the base encoded stream to produce a first output video; applying another encoding operation to the enhancement data stream to produce a set of residues; and combining the first output video and the set of residues to reconstruct the input video, wherein the method further comprises: exchanging information between the base decoding operation and the another decoding operation.

According to another aspect, an apparatus may be provided for encoding a data set into an encoded data set including a header and a payload. The device is configured to encode the input video according to the above steps. The apparatus may include a processor configured to carry out the method of any of the above aspects.

According to another aspect, an apparatus may be provided for decoding a data set from the data set including a header and a payload into reconstructed video. The device is configured to decode the output video according to the above steps. The apparatus may include a processor configured to carry out the method of any of the above aspects.

An encoding apparatus and a decoding apparatus may also be provided.

According to further aspects of the invention, a computer-readable medium may be provided, which, when executed by a processor, causes the processor to perform any of the methods of the above aspects.

Drawings

FIG. 1 shows a high-level schematic diagram of a hierarchical encoding and decoding process;

FIG. 2 shows a high-level schematic diagram of a hierarchical deconstruction process;

FIG. 3 shows an alternative high-level schematic diagram of a hierarchical deconstruction process;

FIG. 4 shows a high-level schematic diagram of an encoding process suitable for encoding a residual of a layered output;

FIG. 5 shows a high level schematic diagram of a hierarchical decoding process suitable for decoding each output level from FIG. 4;

FIG. 6 shows a high-level schematic diagram of an encoding process of a hierarchical encoding technique; and the number of the first and second electrodes,

FIG. 7 shows a high-level schematic diagram of a decoding process suitable for decoding the output of FIG. 6.

Detailed Description

The present invention relates to a method. In particular, the invention relates to methods for encoding and decoding signals. Processing data may include, but is not limited to, obtaining, deriving, outputting, receiving, and reconstructing data. The present invention relates to the exchange of useful information between two (or more) encoders that encode the same content (or portions or representations thereof) in the form of a joint module or a module that is implemented in one of the encoders but produces information that may be useful to the other. Similarly, each of the concepts described herein also applies to decoder stages in which multiple decoding stages may exchange information with each other.

In a preferred example, the encoder or decoder is part of a hierarchical coding scheme. In a more preferred example, although the encoder or decoder utilizes techniques incorporated in the VC-6 or LCEVC encoding schemes, the concepts shown herein are not limited to these particular hierarchical encoding schemes.

Fig. 1 shows very generally a hierarchical coding scheme. Data to be encoded 101 is retrieved by a scalable encoder 102 which outputs encoded data 103. Encoded data 103 is then received by a scalable decoder 104 that decodes the data and outputs decoded data 105.

In general, the hierarchical encoding scheme used in the examples herein generates a base or core level that represents the original data of a lower quality level and one or more residual levels that may be used to recreate the original data of a higher quality level using a decoded version of the base level data. In general, the term "residual" as used herein refers to the difference between the values of a reference array or reference frame and the actual array or frame of data. The array may be a one-dimensional or two-dimensional array representing coding units. For example, an encoding unit may be a 2 x 2 or 4 x 4 set of residual values corresponding to similarly sized regions of an input video frame.

It should be noted that the generalized example is agnostic as to the nature of the input signal. Reference to "residual data" as used herein refers to data derived from a set of residuals, e.g., the set of residuals themselves or the output of the set of data that processes an operation performed on the set of residuals. Throughout this specification, in general, a set of residuals comprises a plurality of residuals or residual elements, each corresponding to a signal element, i.e. an element of the signal or original data.

In a particular example, the data may be an image or video. In these examples, the set of residuals correspond to an image or frame of the video, where each residual is associated with a pixel of the signal, which is a signal element.

The methods described herein are applicable to so-called data planes that reflect different color components of a video signal. For example, the method may be applied to different planes reflecting YUV or RGB data for different color channels. Different color channels may be processed in parallel. The components of each stream may be aligned in any logical order.

A hierarchical coding scheme in which the concepts of the present invention may be deployed will now be described. The scheme is conceptually illustrated in fig. 2-5 and generally corresponds to VC-6 described above. In such coding techniques, residual data is used for progressively higher quality levels. In the technique proposed here, the core layer represents the image at a first resolution, and the subsequent layers in the hierarchical structure are residual data or adjustment layers necessary for the decoding side to reconstruct the image at a higher resolution. Each layer or level may be referred to as a gradient index, such that the residual data is the data needed to correct low quality information present in lower gradient indices. Each layer or echelon index in this hierarchical technique, in particular each residual layer, is typically a relatively sparse data set with many zero-valued elements.

When referring to the scale index, it collectively refers to all scales or component sets at the level, such as all subsets resulting from the transform steps performed at the quality level.

In this particular hierarchical manner, the described data structure removes any requirements for or reliance on the aforementioned or advancing quality levels. The quality levels may be encoded and decoded separately and without reference to any other layers. Thus, the described method does not require decoding of any other layers, as compared to many known other hierarchical encoding schemes where there is a requirement to decode the lowest quality level in order to decode any higher quality level. However, the principles of exchanging information described below may also be applied to other hierarchical encoding schemes.

As shown in fig. 2, the encoded data represents a set of layers or levels, referred to herein generally as a echelon index. The base or core level represents the original data frame 210, but at the lowest quality level or resolution, and subsequent residual data ladders may be combined with the data at the core ladder index to recreate the original image at progressively higher resolutions.

To generate the core gradient index, the input data frame 210 may be downsampled using a number of downsampling operations 201 corresponding to the number of levels or gradient indices to be used for the hierarchical encoding operation. A downsampling operation 201 one less than the number of levels in the hierarchy is required. In all examples shown herein, although there are 4 levels or step indices of the output encoded data and thus 3 downsampling operations, it will of course be understood that these are for illustration only. Where n indicates the number of levels, the number of downsamplers is n-1. Core level R_1-nIs the output of the third downsampling operation. As noted above, core level R_1-nCorresponding to the representation of the input data frame at the lowest quality level.

To distinguish the downsampling operation 201, each operation will be referred to in the order in which it is performed on the input data 210 or the data represented by its output. For example, in an example, the third downsampling operation 201_1-nMay also be referred to as a core downsampler because its output produces a core ladder index or ladder_1-nThat is, the indices of all the ladders at this level are 1-n. Thus, in this example, the first downsampling operation 201_-1Corresponding to the R-1 downsampler, a second downsampling operation 201_-2Corresponding to the R-2 downsampler, and a third downsampling operation 201_1-nCorresponding to the core or R-3 downsampler.

As shown in fig. 2, representing a core quality level R_1-nIs subjected to an upsample operation 202_1-nReferred to herein as core upsamplers. In a second downsampling operation 201_-2The output of (R-2 down sampler, i.e. the input to the core down sampler) and core up sampler 202_1-nIs output as first residual data R-is output as the difference 203 between the outputs of_-2. The first residual data R_-2Correspondingly, the core level R is represented_-3And the signal used to create the hierarchy. Since the signal itself undergoes two down-sampling operations in this example, the first residual data R_-2To adjust the layers, the adjustment layers may be used to regenerate the original signal at a higher quality level than the core quality level, but at a lower level than the input data frame 210.

The variation in the way residual data representing higher quality levels is generated is conceptually illustrated in fig. 2 and 3.

In fig. 2, a second downsampling operation 201 is performed_-2(or R-2 downsampler, i.e. for generating first residual data R_-2Signal of) is up-sampled 202_-2And generates first residual data R in accordance with the sum_-2The same way is computed to the second down-sampling operation 201_-2(or the output of the R-2 downsampler, i.e., the R-1 downsampler) 203_-1. This difference is thus the second residual data R_-1And represents an adjustment layer that can be used to regenerate the original signal at a higher quality level using data from a lower level.

However, in the variation of fig. 3, a second downsampling operation 201_-2(or R-2 downsampler) and first residual data R_-2Combining or summing 304_-2To regenerate the core upsampler 202_1-nTo output of (c). In this variation, the regenerated data is upsampled 202 instead of the downsampled data_-2. Similar comparison 203 of the upsampled data with the input to the second downsampling operation (or R-2 downsampler, i.e., the output of R-1 downsampler)_-1To produceGenerating second residual data R_-1。

The variation between the embodiments of fig. 2 and 3 produces a slight variation in residual data between the two embodiments. Fig. 2 benefits from a greater parallelization possibility.

The process or loop is repeated to produce a third residual R₀. In the example of fig. 2 and 3, the output residual data R0 (i.e., the third residual data) corresponds to the highest level and is used at the decoder to regenerate the input data frame. At this level, the difference operation is based on the same frame of input data as the input to the first downsampling operation.

Fig. 4 shows an example encoding process 401 for encoding each of a level or a gradient index of data to generate a set of encoded data gradients with a gradient index. This encoding process is merely an example of a suitable encoding process to encode each of the levels, but it should be understood that any suitable encoding process may be used. The inputs to the process are the respective levels of residual data output from fig. 2 or 3, and the output is a set of ladders of encoded residual data, which together hierarchically represent the encoded data.

In a first step, a transformation 402 is performed. The transform may be a directed decomposition transform or a wavelet or discrete cosine transform as described in WO 2013/171173. If a directional decomposition transform is used, a set of four components may be output. When referring to the step index, it refers collectively to all directions (A, H, V, D), i.e., 4 steps. The set of components is then quantized 403 prior to entropy encoding. In this example, entropy encoding operation 404 is coupled to a sparsification step 405 that exploits sparsity of residual data to reduce total data size and involves mapping data elements to ordered quadtrees. This coupling of entropy coding and sparsification is further described in WO2019/111004, but the precise details of this process are not relevant to the understanding of the present invention. Each array of residuals e may be considered a ladder.

The process set forth above corresponds to an encoding process suitable for encoding data for reconstruction according to the SMPTE ST 2117, VC-6 multi-plane picture format. VC-6 is a flexible, multi-resolution, intra-bit-only stream format capable of compressing any ordered set of integer element grids, each grid having an independent size and designed for picture compression. It employs data agnostic techniques for compression and is capable of compressing low or high bit depth pictures. The header of the bitstream may contain a variety of metadata about the picture.

As will be appreciated, each ladder or ladder index may be implemented using a separate encoder or encoding operation. Similarly, the encoding module may be divided into steps of down-sampling and comparing to generate residual data and then encode the residual, or alternatively, each of the steps of the ladder may be implemented in a combinatorial encoding module. Thus, the process may be implemented, for example, using 4 encoders, one for each echelon index, 1 encoder and multiple encoding modules operating in parallel or in series, or one encoder operating on different data sets repeatedly.

The following describes an example of reconstructing an original data frame, which has been encoded using the above exemplary process. This reconstruction process may be referred to as pyramidal reconstruction. Advantageously, the method provides an efficient technique for reconstructing images encoded in a received data set, which may be received by means of a data stream, for example by means of individually decoding different sets of components corresponding to different image sizes or resolution levels and combining image details from one set of decoded components with upgraded decoded image data from a set of low resolution components. Thus, by performing this process for two or more sets of components, digital images at structures or details in the sets of components may be reconstructed for progressively higher resolutions or greater numbers of pixels without receiving the full or full image details of the highest resolution set of components. In particular, the method facilitates gradually adding higher and higher resolution details while reconstructing an image from a low resolution component set in a hierarchical manner.

Furthermore, decoding of each component set separately facilitates parallel processing of the received component sets, thus improving reconstruction speed and efficiency in implementations where multiple processes are available.

Each resolution level corresponds to a quality level or a gradient index. This is a collective term associated with the planes (in this example, a representation of a grid of integer value elements) that describe all new inputs or received component sets and the output reconstructed image for the loop of index-m. For example, a reconstructed image with a gradient index of zero is the output of the final loop of the pyramid reconstruction.

Pyramid reconstruction may be a process that starts with the initial gradient index and reconstructs the inverted pyramid with the new residual using a loop to derive a higher gradient index up to the maximum quality, quality zero, at gradient index zero. A loop can be considered as a step in such a pyramidal reconstruction, which is identified by the index-m. The steps typically include upsampling data output from possible previous steps, e.g. upgrading the decoded first set of components, and taking the new residual data as further input in order to obtain output data to be upsampled in possible subsequent steps. In the case where only the first and second component sets are received, the number of the echelon indices will be two, and there is no possible following step. However, in examples where the number of component sets or the echelon index is three or more, the output data may be gradually upsampled in the following steps.

The first set of components generally corresponds to an initial rank index, which may be represented by rank indices 1-N, where N is the number of rank indices in a plane.

Typically, the upgrading of the decoded first set of components includes applying an upsampler to the output of the decoding procedure for the initial gradient index. In an example, this involves conforming a resolution of a reconstructed picture output from decoding of the initial set of scale index components to a resolution of a second set of components corresponding to 2-N. In general, the upgraded output from the set of lower scale index components corresponds to predicted images at the higher scale index resolution. Due to the low resolution initial step index image and the upsampling process, the predicted image typically corresponds to a smooth or blurred picture.

Adding higher resolution detail from the above echelon index into this predicted picture provides a combined set of reconstructed images. Advantageously, where the received component sets for one or more higher gradient index component sets comprise residual image data or data indicative of pixel value differences between the upgraded predicted picture and the original, uncompressed or pre-coded image, the amount of received data required to reconstruct an image or data set of a given resolution or quality may be significantly less than the amount or rate of data required to receive the same quality image using other techniques. Thus, by combining low detail image data received at lower resolutions with progressively larger detail image data received at higher and higher resolutions according to the method, the data rate requirements are reduced.

Typically, the encoded data set comprises one or more further sets of components, wherein each of the one or more further sets of components corresponds to a higher image resolution than the second set of components, and wherein each of the one or more further sets of components corresponds to a progressively higher image resolution, the method comprising, for each of the one or more further sets of components, decoding the set of components so as to obtain a decoded set, the method further comprising, for each of the one or more further sets of components, in increasing order of corresponding image resolutions: the reconstructed set having the highest corresponding image resolution is upgraded to increase the corresponding image resolution of the reconstructed set to equal a corresponding image resolution of another set of components, and the reconstructed set and the another set of components are combined together to produce another reconstructed set.

In this way, the method may involve taking a reconstructed image output of a given component set level or step index, upgrading the reconstructed set, and combining it with the decoded output of the above component set or step index to generate a new higher resolution reconstructed picture. It should be understood that this may be performed repeatedly for progressively higher gradient indices, depending on the total number of component sets in the received set.

In a typical example, each of the component sets corresponds to a progressively higher image resolution, where each progressively higher image resolution corresponds to a four-fold increase in the number of pixels in the corresponding image. Typically, therefore, the size of an image corresponding to a given component set is four times the size or number of pixels of the image corresponding to the lower component set, which is a component set having a step index one less than the step index in question, or twice the height and twice the width of the image. The received set of components in which the linear size of each corresponding image is twice the size of the underlying image may, for example, facilitate a simpler upgrade operation.

In the example shown, the number of the other component sets is two. Thus, the total number of component sets in the received set is four. This corresponds to the initial step index being step-3.

The first set of components may correspond to image data and the second and any further sets of components correspond to residual image data. As noted above, where the lowest gradient index, i.e., the first set of components, contains a low resolution or downsampled version of the image being transmitted, the method provides a particularly advantageous reduction in data rate requirements for a given image size. In this way, where each cycle of reconstruction begins with a low resolution image, the image is upgraded so as to produce a high resolution, but smooth version, and then the image is improved by means of adding the difference between the upgraded predicted picture and the actual image to be transmitted at that resolution, and this addition improvement can be repeated for each cycle. Thus, each set of components above the initial step index requirement contains only residual data in order to reintroduce information that may have been lost when the original image was downsampled to the lowest step index.

The method provides a way of obtaining image data, which may be residual data, for example upon receiving a set containing data that has been compressed, for example by means of decomposition, quantization, entropy coding and thinning.

The thinning step is particularly advantageous when used in conjunction with raw or pre-transmitted data, which may generally correspond to residual image data, being a sparse set. The residual may be the difference between an element of the first image and an element of the second image that is typically co-located. Such residual image data may typically be highly sparse. This may be considered to correspond to an image in which the regions of detail are sparsely distributed among regions where the detail is minimal, negligible, or absent. Such sparse data may be described as an array of data, where the data is organized in at least a two-dimensional structure (e.g., a grid), and where a majority of the data so organized is zero (logically or numerically) or is considered to be below a certain threshold. The residual data is only one example. In addition, the metadata may be sparse and thus reduced in size to a significant degree by this process. Sending already sparse data allows a significant reduction in the required data rate to be achieved by not sending such sparse regions and instead reintroducing them at appropriate locations within the received byte set at the decoder.

Typically, the entropy decoding, dequantization and directional synthesis transform steps are performed according to parameters defined by an encoder or node from which the received encoded data set is sent. For each step index or component set, the steps are used to decode the image data so as to arrive at a set that can be combined with different step indices in accordance with the techniques disclosed above, while allowing the set of each level to be transmitted in a data efficient manner.

A method of reconstructing an encoded data set according to the method disclosed above may also be provided, wherein the decoding of each of the first and second component sets is performed according to the method disclosed above. Thus, the advantageous decoding methods of the present disclosure may be used for each component set or gradient index in the received image data set and reconstructed accordingly.

With reference to fig. 5, a decoding example is now described. Receiving a set of encoded data 501, wherein the set comprises four ladder indices, each ladder index comprising four ladders: from echelon 0, the highest resolution or quality level, to echelon-3, the initial echelon. The image data carried in the echelon-3 component set corresponds to image data and the other component set contains residual data of the transmitted image. While each of the levels may output data that may be considered residuals, the residuals in the initial echelon level, i.e., echelon-3, actually correspond to the actual reconstructed image. At stage 503, each of the component sets is processed in parallel in order to decode the encoded set.

Referring to the initial or core rank index, the following decoding steps are performed for each component set rank-3 to rank 0.

At step 507, the component set is de-thinned. In this way, de-sparsification causes a sparse two-dimensional array to be regenerated from the encoded byte sets received at each ladder. The zero values of the packets at locations within the two-dimensional array that are not received (due to omission from the transmitted byte set in order to reduce the amount of data transmitted) are re-padded by this process. The non-zero values in the array retain their correct values and positions within the regenerated two-dimensional array, with the de-thinning step refilling the transmitted zero values at the appropriate positions or groups of positions in between.

At step 509, a range decoder, whose configuration parameters correspond to those used to encode the transmitted data prior to transmission, is applied to the de-thinned set at each ladder in order to replace the encoded symbols within the array with pixel values. The encoded symbols in the received set are replaced by pixel values according to an approximation of the pixel value distribution of the image. Using an approximate distribution, i.e. the relative frequency of each value across all pixel values in the image, rather than a true distribution, permits a reduction in the amount of data required to decode the set, since the range decoder requires distribution information in order to perform this step. As described in this disclosure, the steps of de-thinning and range decoding are interdependent rather than sequential. This is indicated by the loop formed by the arrows in the flow chart.

At step 511, the array of values is dequantized. This process is performed again according to the parameters of the decomposed image that were quantized prior to transmission.

After dequantization, the set is transformed at step 513 by a synthesis transform that includes applying an inverse directional decomposition operation to the dequantized array. This reverses the directional filtering according to the 2 x 2 operators, including the average, horizontal, vertical and diagonal operators, so that the resulting array is image data for run-3 and residual data for run-2 to run 0.

Stage 505 shows the number of cycles involved in the reconstruction of the output of the synthesis transform with each of the set of echelon components 501.

Stage 515 indicates reconstructed image data for the initial ladder output from decoder 503. In an example, reconstructed picture 515 has a resolution of 64 × 64. At 516, this reconstructed picture is upsampled so as to increase its constituent number of pixels by a factor of four, thereby producing a predicted picture 517 having a resolution of 128 × 128. At stage 520, predicted picture 517 is added to decoded residual 518 from the output of the decoder at pass-2. The addition of these two 128 x 128 sized images results in a 128 x 128 sized reconstructed image containing smooth image details from the initial step enhanced by the higher resolution details of the residual from step-2. If the desired output resolution corresponds to a echelon-2, this resulting reconstructed picture may be output or displayed 519. In this example, reconstructed picture 519 is used for another loop.

At step 512, the reconstructed image 519 is upsampled in the same manner as at step 516 in order to produce a predicted picture 524 of 256 x 256 size. This is then combined with the decoded echelon-1 output 526 at step 528, thereby producing a 256 x 256 size reconstructed picture 527, which is an upgraded version of prediction 519 enhanced by the higher resolution details of the residual 526. At 530, this process is repeated at the last time, and reconstructed picture 527 is upgraded to 512 × 512 resolution for combination with the echelon 0 residual at stage 532. Thereby obtaining 512 × 512 reconstructed pictures 531.

Another hierarchical encoding technique by which the principles of the present invention may be utilized is illustrated in fig. 6 and 7. This technique is a flexible, adaptable, efficient, and computationally inexpensive coding format that combines different video coding formats, base codecs (e.g., AVC, HEVC, or any other current or future codec) with at least two enhancement levels of encoded data.

A general structure of the encoding scheme uses a downsampled source signal encoded with a base codec, adds a first level of correction data to the decoded output of the base codec to generate a corrected picture, and then adds another level of enhancement data to an upsampled version of the corrected picture.

Thus, the streams are considered as a base stream and an enhancement stream. It is noted that it is generally contemplated that the base stream may be decoded by a hardware decoder, while the enhancement stream is contemplated to be suitable for a software processing implementation with suitable power consumption.

This structure creates multiple degrees of freedom, which allows great flexibility and adaptability to many scenarios, making the encoding format suitable for many use cases, including OTT transmission, live streaming, live UHD broadcasting, etc.

Although the decoded output of the base codec is not intended for viewing, it is fully decoded video at a lower resolution, making the output compatible with existing decoders and also usable as a lower resolution output if deemed appropriate.

In this example and other contemplated examples, each or two enhancement streams may be encapsulated into one or more enhancement bitstreams, in general, using a set of Network Abstraction Layer Units (NALUs). NALUs are intended to encapsulate the enhancement bitstream in order to apply the enhancement to the correct underlying reconstructed frame. A NALU may, for example, contain a reference index to the NALU that contains the base decoder reconstructed frame bitstream to which the enhancement must be applied. In this way, the enhancement may be synchronized to the base stream, and the frames of each bitstream combined to produce the decoded output video (i.e., the residual of each frame of the enhancement level is combined with the frames of the base decoded stream). A group of pictures may represent multiple NALUs.

Returning to the initial process described above, where the base stream is provided along with the enhancement of two levels (or sub-level) within the enhancement stream, an example of a generalized encoding process is depicted in the block diagram of fig. 6. The input full resolution video 600 is processed to generate various encoded

streams

601, 602, 603. A first encoded stream (encoded base stream) is generated by feeding a downsampled version of the input video to a base codec (e.g., AVC, HEVC, or any other codec). The encoded base stream may be referred to as a base layer or base level. A second encoded stream (encoded level 1 stream) is generated by processing a residual obtained by taking the difference between the reconstructed base codec video and the downsampled version of the input video. A third encoded stream (encoded level 2 stream) is generated by processing a residual obtained by taking the difference between the upsampled version of the corrected version of the reconstructed base encoded video and the input video. In some cases, the components of fig. 6 may provide a generally low complexity encoder. In some cases, the enhancement stream may be generated by an encoding process that forms part of a low complexity encoder, and the low complexity encoder may be configured to control separate base encoders and decoders (e.g., packaged as a base codec). In other cases, the base encoder and decoder may be provided as part of a low complexity encoder. In one case, the low complexity encoder of fig. 6 may be considered as a form of envelope for the base codec, where the functionality of the base codec may be hidden from the entity implementing the low complexity encoder.

The downsampling operation illustrated by downsampling component 105 may be applied to the input video to generate downsampled video to be encoded by base encoder 613 of the base codec. The down-sampling may be done in both the vertical and horizontal directions, or alternatively only in the horizontal direction. The base encoder 613 and the base decoder 614 may be implemented by a base codec (e.g., as different functions of a common codec). One or more of the base codec and/or base encoder 613 and base decoder 614 may comprise suitably configured electronic circuitry (e.g., a hardware encoder/decoder) and/or computer program code executed by a processor.

Each enhancement stream coding process may not necessarily include an upsampling step. For example, in fig. 6, the first enhancement stream is conceptually a correction stream, while the second enhancement stream is upsampled to provide an enhancement level.

Referring in more detail to the process of generating the enhancement stream, to generate the encoded level 1 stream, the encoded base stream is decoded by base decoder 614 (i.e., a decoding operation is applied to the encoded base stream to generate a decoded base stream). Decoding may be performed by a decoding function or mode of the base codec. The difference between the decoded base stream and the downsampled input video is then generated at the level 1 comparator 610 (i.e., a subtraction operation is applied to the downsampled input video and the decoded base stream to generate a first set of residues). The output of comparator 610 may be referred to as a first set of residuals, e.g., a surface or frame of residual data, where the residual values are determined for each picture element at the resolution of the output of base encoder 613, base decoder 614, and downsample block 605.

The difference is then encoded by a first encoder 615 (i.e., a level 1 encoder) to produce an encoded level 1 stream 602 (i.e., an encoding operation is applied to the first set of residues to produce a first enhancement stream).

As described above, the enhancement stream may include a first enhancement layer 602 and a second enhancement layer 603. The first enhancement level 602 may be considered a corrected stream, e.g., a stream that provides a level of correction to the base encoded/decoded video signal at a lower resolution than the input video 600. The second enhancement level 603 may be considered as converting the corrected stream to another enhancement level of the original input video 600, e.g., it applies an enhancement or correction level to the signal reconstructed from the corrected stream.

In the example of fig. 6, the second enhancement level 603 is generated by encoding another set of residuals. The other set of residues is generated by a 2-stage comparator 619. The 2-stage comparator 619 determines the difference between the upsampled version of the decoded 1-stage stream, e.g., the output of the upsampling component 617, and the input video 600. The input to the upsampling component 617 is generated by applying a first decoder (i.e., a 1-stage decoder) to the output of the first encoder 615. This produces a decoded level 1 residual set. These are then combined with the output of base decoder 614 at summing component 620. This effectively applies the level 1 residual to the output of the base decoder 614. Which allows the losses in the level 1 encoding and decoding process to be corrected by the level 2 residual. The output of the summing component 620 can be viewed as an analog signal representing the output of applying a level 1 processing to the encoded base stream 601 and the encoded level 1 stream 602 at the decoder.

As mentioned, the upsampled stream is compared to the input video, which forms another set of residues (i.e., a difference operation is applied to the regenerated upsampled stream to generate another set of residues). The other set of residues is then encoded by a second encoder 621 (i.e., a 2-level encoder) into an encoded 2-level enhancement stream (i.e., an encoding operation is then applied to the other set of residues to generate another encoded enhancement stream).

Thus, as shown in fig. 6 and described above, the output of the encoding process is a base stream 601 and one or more enhancement streams 602, 603, which preferably include a first enhancement level and another enhancement level. The three

streams

601, 602, and 603 may be combined, with or without additional information such as a control header, to generate a combined stream that is used to represent the video coding architecture of the input video 600. It should be noted that the components shown in fig. 6 may operate on blocks or coding units of data, e.g., corresponding to 2 x 2 or 4 x 4 portions of a frame at a particular resolution level. The components operate without any inter-block dependencies, so they may be applied in parallel to multiple blocks or coding units within a frame. This is different from comparative video coding schemes in which there are dependencies (e.g., spatial or temporal dependencies) between blocks. The dependencies of the compared video coding schemes limit the level of parallelism and require much higher complexity.

A corresponding generalized decoding process is depicted in the block diagram of fig. 7. Purportedly, fig. 7 may show a low complexity decoder corresponding to the low complexity encoder of fig. 6. The low complexity decoder receives the three

streams

601, 602, 603 generated by the low complexity encoder along with a header 704 containing other decoding information. The encoded base stream 601 is decoded by a base decoder 710 corresponding to the base codec used in the low complexity encoder. The encoded level 1 stream 602 is received by a first decoder 711 (i.e., a level 1 decoder) that decodes the first set of residuals as encoded by the first encoder 615 of fig. 1. At the first summing component 712, the output of the base decoder 710 is combined with the decoded residual obtained from the first decoder 711. The combined video, which may be referred to as a 1-level reconstructed video signal, is upsampled by upsampling component 713. The encoded 2-level stream 103 is received by a second decoder 714 (i.e., a 2-level decoder). The second decoder 714 decodes the second set of residuals as encoded by the second encoder 621 of fig. 1. Although the header 704 is shown in fig. 7 as being used by the second decoder 714, it may also be used by the first decoder 711 as well as the base decoder 710. The output of the second decoder 714 is a second set of decoded residues. These may have a higher resolution on the first set of residues and the input of the upsampling component 713. At the second summing component 715, the second set of residues from the second decoder 714 is combined with the output of the upsampling component 713, i.e., the upsampled reconstructed level 1 signal, to reconstruct the decoded video 750.

According to a low complexity encoder, the low complexity decoder of fig. 7 may operate in parallel on different blocks or coding units of a given frame of a video signal. In addition, decoding by two or more of the base decoder 710, the first decoder 711, and the second decoder 714 may be performed in parallel. This is possible because there are no inter-block dependencies.

In the decoding process, the decoder may parse the headers 704 (which may contain global configuration information, picture or frame configuration information, and data block configuration information) and configure the low complexity decoder based on those headers. To reproduce the input video, a low complexity decoder may decode each of the base stream, the first enhancement stream, and the other or second enhancement stream. The frames of the stream may be synchronized and then combined to derive decoded video 750. Depending on the configuration of the low complexity encoder and decoder, the decoded video 750 may be a lossy or lossless reconstruction of the original input video 100. In many cases, the decoded video 750 may be a lossy reconstruction of the original input video 600, where the loss has a reduced or minimal impact on the perception of the decoded video 750.

In each of fig. 6 and 7, the level 2 and level 1 encoding operations may include the steps of transform, quantization, and entropy encoding (e.g., in that order). It may also include residual scaling, weighting and filtering. Similarly, in the decoding stage, the residual may be passed through an entropy decoder, a dequantizer, and an inverse transform module (e.g., in that order). Any suitable encoding and corresponding decoding operations may be used. Preferably, however, the level 2 and level 1 encoding steps may be performed in software (e.g., as performed by one or more central or graphics processing units in the encoding device).

The transform as described herein may use a directional decomposition transform, such as a Hadamard-based transform. Both of these may include a small kernel or matrix of flat coding units (i.e., 2 x 2 or 4 x 4 residual blocks) applied to the residual. Further details regarding the transformation can be found, for example, in patent applications PCT/EP2013/059847 or PCT/GB2017/052632, which are incorporated herein by reference. The encoder may select between different transforms to be used, for example between the size of the kernel to be applied.

The transform may transform residual information to four surfaces. For example, the transform may produce the following components: average, vertical, horizontal, and diagonal. As mentioned earlier in this disclosure, these components output by the transform may be employed in such embodiments as, for example, coefficients to be quantized according to the described method.

Quantization schemes may be used to generate residual signals into quanta such that certain variables may take only certain discrete magnitudes.

Entropy encoding in this example may include run-length encoding (RLE), followed by processing of the encoded output using a Huffman encoder. In some cases, only one of these schemes may be used when entropy coding is required.

In summary, the methods and apparatus herein are based on an overall approach that is constructed via existing encoding and/or decoding algorithms (e.g., MPEG standards such as AVC/h.264, HEVC/h.265, and non-standard algorithms such as VP9, AV 1) that serve as baselines for enhancement layers for different encoding and/or decoding methods, respectively. An example idea behind the overall approach is to encode/decode video frames in a hierarchical manner, as opposed to using block-based approaches used in MPEG series algorithms. Encoding a frame in a hierarchical manner includes generating a residual for a full frame, and then generating a residual for a decimated frame, etc.

The video compression residual data for full-size video frames may be referred to as LoQ-2 (e.g., 1920 x 1080 for HD video frames or higher for UHD frames), while the video compression residual data for decimated frames may be referred to as LoQ-x, where x represents the number corresponding to the hierarchical decimation. In the depicted example of fig. 1 and 2, the variable x may have

values

1 and 2 representing the first and second enhancement streams. Thus, there are 2 hierarchical levels that will produce a compressed residual. Other naming schemes of the hierarchy may also be applied without any functional changes (e.g., the level 1 and level 2 enhanced streams described herein may alternatively be referred to as level 1 and level 2 streams, representing a countdown from the highest resolution).

As noted above, the process may be applied in parallel to coding units or blocks of the color components of the frame, as there are no inter-block dependencies. The encoding of each color component within the set of color components may also be performed in parallel (e.g., such that the operation is replicated in terms of (number of frames) × (number of color components) × (number of coding units per frame)). It should also be noted that different color components may have different numbers of coding units per frame, e.g., a luma (e.g., Y) component may be processed at a higher resolution than a set of chroma (e.g., U or V) components when a change in luminance is larger than a change in color is detectable by human vision.

Thus, as shown and described above, the output of the decoding process is the (optional) base reconstruction, as well as the original signal reconstruction at a higher level. This example is particularly well-suited for generating encoded and decoded video at different frame resolutions. For example, the input signal 30 may be an HD video signal comprising frames at 1920 × 1080 resolution. In some cases, both the base reconstruction and the 2-level reconstruction may be used by the display device. For example, in the case of network traffic, a level 2 stream may be interrupted more severely than a level 1 and base stream (since it may contain up to 4 x the amount of data, with down-sampling reducing the dimensionality in each direction by 2). In this case, the display device may resume displaying the base reconstruction when traffic occurs, while the 2-level stream is interrupted (e.g., while the 2-level reconstruction is not available), and then resume displaying the 2-level reconstruction when network conditions improve. A similar approach may be applied when the decoding device suffers from resource constraints, e.g., a set top box performing a system update may have an operating base decoder 220 to output a base reconstruction, but may not have processing capacity to compute a level 2 reconstruction.

The encoding arrangement also enables the video distributor to distribute the video to the heterogeneous set of devices; only those devices with base decoder 720 may view the base reconstruction, while those with enhancement levels may view the higher quality level 2 reconstruction. In the comparison case, two complete video streams at separate resolutions are needed to serve two sets of devices. Since level 2 and level 1 enhancement streams encode residual data, level 2 and level 1 enhancement streams can be encoded more efficiently, e.g., the distribution of residual data is typically mostly around 0 in quality (i.e., no difference) and typically takes a small range of values around 0. This may be the case especially after quantization. In contrast, a complete video stream at different resolutions will have different distributions of non-zero mean or median values that require a higher bit rate for transmission to the decoder.

In the examples described herein, the residual is encoded by the encoding pipeline. This may include transform, quantization, and entropy coding operations. It may also include residual scaling, weighting and filtering. The residual is then transmitted to the decoder, e.g., as L-1 and L-2 enhancement streams, which can be combined with the base stream as a hybrid stream (or transmitted separately). In one case, the bit rate for a mixed data stream comprising a base stream and two enhancement streams is set, and then different adaptive bit rates are applied to the individual streams based on the data being processed to meet the set bit rate (e.g., perceived high quality video with low-level artifacts may be constructed by adaptively assigning bit rates to the different individual streams (even at the frame-by-frame level) so that the constrained data may be used by the most perceptually affected individual streams, which may change as the image data changes).

A set of residual values as described herein may be considered sparse data, e.g., in many cases there are no differences for a given pixel or region, and the resulting residual values are zero. When looking at the distribution of the residuals, a lot of probability mass is assigned to small residual values located close to zero, e.g. occurring most frequently for certain video values of-2, -1, 0, 1, 2, etc. In some cases, the distribution of residual values is symmetric or approximately symmetric about 0. In some test video cases, the distribution of residual values (e.g., symmetrically or approximately symmetrically) is found to be log-or exponential-like in shape with respect to 0. The exact distribution of the residual values may depend on the content of the input video stream.

The residual may itself be processed into a two-dimensional image, such as a delta image of the difference. In this way, it can be seen that the sparsity of the data relates to features like "points", small "lines", "edges", "corners", etc. that are visible in the residual image. It has been found that these features are often not fully correlated (e.g., spatially and/or temporally). The features have characteristics that are different from the characteristics of the image data from which they are derived (e.g., pixel characteristics of the original video signal).

Since the characteristics of the residual differ from the characteristics of the image data from which the residual originates, it is generally not possible to apply standard encoding methods, such as those found in conventional Motion Picture Experts Group (MPEG) encoding and decoding standards. For example, many comparison schemes use larger transforms (e.g., transforms of larger regions of pixels in normal video frames). Using these relatively large transforms on residual images would be extremely inefficient due to the characteristics of the residual, for example, as described above. For example, it would be very difficult to encode small points in the residual image using large blocks of regions designed for normal images.

Some examples described herein address these issues by using smaller and simple transform kernels instead (e.g., 2 x 2 or 4 x 4 kernels as presented herein — directed decomposition and directed decomposition squares). The transforms described herein may be applied using hadamard matrices (e.g., 4 x 4 matrices for flat 2 x 2 coding blocks or 16 x 16 matrices for flat 4 x 4 coding blocks). This moves in a different direction than the comparative video coding method. Applying these new methods to residual blocks yields compression efficiency. For example, certain transforms produce uncorrelated coefficients (e.g., in space) that can be efficiently compressed. While correlations between coefficients may be exploited, such as for lines in residual images, these can create coding complexity, making implementation difficult on legacy and low-resource devices, and these often create other complex artifacts that require correction. Pre-processing the residuals by setting certain residual values to 0 (i.e., not forwarding these for processing) may provide a controllable and flexible way to manage bit rates and stream bandwidth, as well as resource usage.

Exchanging information

As mentioned above, the present invention considers the principle of exchange of useful information between two (or more encoders) encoding the same content (or portions thereof), in the form of a joint module or a module implemented in one of the encoders but producing information that may be useful to the other. For example, the encoder may be the base encoder and enhancement level encoder of the architecture of fig. 6 or the encoder for each of the steps or step indices (or residual quality levels) of fig. 3.

It will of course be appreciated that there are a variety of possible mechanisms to exchange information between multiple encoders or decoders.

In an example, information may be passed in the form of a data structure to a memory location in a shared memory space where the information is stored, using a common API or as a pointer. Furthermore, information can be exchanged between the two encoder or decoder modules via supplemental enhancement information SEI (https:// mpeg. Another mechanism may be in the bitstream as metadata, or passed via an API.

Depending on how the information (e.g., as modules above) is implemented, the level of integration of the encoder, the allowed latency or storage and memory bandwidth; this information may be communicated per group of pictures, per picture, per group of pixels, or per pixel. Similarly, in the example of fig. 3, information may be conveyed per step, per step index, per plane, or per image.

Based on the information exchanged between the encoders or decoders, each encoder or decoder may adapt the encoding parameters to improve the encoding operation. With such information exchange, the adaptation may be coordinated to improve overall efficiency or quality or to provide a balance in the levels or layers of the overall hierarchy. It can be seen that each level of the hierarchy provides specific benefits and thus by balancing the parameters at each level, an overall improvement can be made.

In an example implementation using the principles of fig. 6, the base encoder works first and sends information to the enhancement encoder so that the enhancement encoder can adapt to it, although the same applies to the ladder of fig. 3. Similarly, the enhancement encoder may send feedback information to the base encoder so it can adapt its parameters for the next frame. For example, at the enhancement encoder, although the quantization level on the correction layer is set very high, the enhancement encoder spends too many bits in correction, so it may be signaled. The base encoder may decide to lower its quantization level or in response apply less aggressive perceptual optimization.

In another example, where the base encoder operates on the input video and the enhancement layer takes the output of the base encoder and operates on that output (in conjunction with the input video, such as by upsampling or reconstructing the video and generating a set of residues), the base encoder may include a message in the stream picked up by the enhancement layer, or may send a separate message to the enhancement layer, allowing the enhancement layer to function (i.e., the message includes information or a pointer to that information). The message may also contain information that allows enhancement of the information exchanged with the stream/frame or the like depending on the granularity of the information.

Examples of information that may be exchanged between multiple encoding or decoding operations are provided below.

In a first example, a user may enter information that may be exchanged by an encoder, such as exchanged information regarding the type of content. For example, if the encoded data represents real-time sports content, it may be expected that there will be a large amount of temporal activity, and in this layer, it is determined to provide more bits to the base encoder as lower frequency artifacts will be more apparent. In another example, if the content contains graphics, high frequencies become more important and the enhancement layer can be given more bits. Such coordination between layers provides improved quality in reconstruction.

In some instances, a user may insert information to be exchanged via a user interface. This information is received by the overall controller which sends it to both the base and enhancement encoder/decoders simultaneously, or the entire encoder may be controlled by the base encoder, in which case the information is passed from the base encoder to the enhancement layer.

In another example, where an encoder includes multiple enhancement layers, such as correction and enhancement layers, if the content is temporally harsh and it can be determined that the base layer may be laborious, it may be decided to place more bits in the correction layer, especially if high frequency details are less noticeable in fast moving content. Otherwise, if there is a large amount of graphics, it may be decided to allocate more bits to the enhancement layer to preserve more high frequency detail.

Information collected from the host device may also be exchanged or passed between the encoders. Examples include device cameras, motion sensors, temperature sensors, and the like. This information may be well suited for setting up both base and enhancement encoders. The processing of this information may be implemented as a separate module that then passes the information to the encoder or may already be implemented in one of the encoders, and in this case the host encoder may pass the information to the enhancement encoder.

The content analysis module may also be used to provide information to be exchanged. Various pre-processing modules may be presented to analyze the picture to be encoded as separate modules or as an integral part of one of the encoders. Examples include: perceptual information describing a region of an image, wherein coding artifacts are less noticeable considering the Human Visual System (HVS); motion information, including temporal activity of the frame, per set of pixel (block) information, such as motion estimation in the form of motion vectors or per-pixel information (e.g., optical flow); and any other information in any other form regarding the estimated complexity of the frame to be encoded, the entropy of the frame, the number of bits required to be estimated, etc.

In another example, the information exchanged between the encoders may include coding decisions taken by any of the encoders. The encoders may pass any information to each other regarding the decisions they make during the encoding process. This may include the type of prediction used for the frame (I, P in h264 or the B frame or an autonomous frame predicted from a previously displayed frame or a frame predicted from both a previous and subsequent frame). The decisions may also include decisions (motion vectors, quantization steps used, mode decisions, etc.) made for the quantization levels of the frame or each group of pixels. This information may be highly beneficial for the encoder to receive the information, as it may be used, for example, to decide how many bits to use for correcting errors of the base encoder and how much to use for enhancement, what type of upsampler to use, and/or how to spatially spread the bit budget.

In the final example, the information exchanged between the encoders may include statistics about the encoded content. The encoder may exchange any information collected about a set of pictures, a picture, a set of pixels, or pixels that it has encoded. For example, how many bits are spent per picture in a particular set of pictures may aid in decisions for a rate control module of an encoder that receives information. In another example of statistics, a target video quality metric computed in one of the encoders may help the other encoders in bit budget decisions and in estimating final picture quality. Any statistics on the modes of each group of pixels used (intra or inter coded) can be used to assist in the decision of the prediction mode used when the encoder receives information.

As noted above, information exchanged between encoders may be exchanged at different levels of granularity. For example: information about the entire video; information of each group of pictures, i.e. the segments of the video to be coded; information about a picture to be encoded; information about the group of pixels to be encoded (block of pixels, tile); and/or per-pixel information.

The granularity selected may provide particular benefits, such as the possibility of exchanging useful information to help give more bits to particular portions of an image. Where the information exchanged is from a device camera, it may represent focus and exposure information, telling which parts of the image contain details. Modern cameras also have depth of field information. This information may be used to provide details in the area of the image. In addition, algorithms that may be implemented in the camera or base encoder may provide useful information to be exchanged. For example, face detection, region of interest extraction, and background extraction can all be used to understand which details are important and which are not, and in this way can help in deciding where to place more bits. Thus, the encoder (and its parameters) may adapt based on the algorithms used and information passed between levels of the hierarchy. Coordination between the levels may be provided by shared information, but the information from the algorithms is not shared itself but is based on adapted information provided or to be provided.

Different types of information exchange will be able to be exchanged depending on how the encoder has been implemented (dedicated hardware, software on CPU, GPU) and the available resources (memory management, processing speed, etc.). For example, if shared memory is limited and one encoder is implemented in hardware and another encoder is implemented in software, information may not be exchanged for each pixel or even group of pixel information, but statistics for each picture may be collected, which would be easier to transfer from one encoder to another. Sending statistical data instead of the original per-pixel information is one way to address these challenges. While the accuracy may be low, it may be sufficient in many cases.

At both an encoder and a decoder, such as implemented in a streaming media server or a client device that decodes from a data store, the methods and processes described herein may be embodied as code (e.g., software code) and/or data. The encoder and decoder may be implemented in hardware or software as is well known in the art of data compression. For example, hardware acceleration using specially programmed Graphics Processing Units (GPUs) or specially designed Field Programmable Gate Arrays (FPGAs) may provide some efficiency. For completeness, such code and data may be stored on one or more computer-readable media, which may include any device or medium that can store code and/or data for use by a computer system. When a computer system reads and executes the code and/or data stored on the computer-readable medium, the computer system performs the methods and processes embodied as data structures and code stored within the computer-readable storage medium. In certain embodiments, one or more of the steps of the methods and processes described herein may be performed by a processor (e.g., a processor of a computer system or a data storage system).

In general, any of the functionality described in this text or shown in the figures can be implemented using software, firmware (e.g., fixed logic circuitry), programmable or non-programmable hardware, or a combination of these implementations. Generally, the term "component" or "functionality" as used herein refers to software, firmware, hardware, or a combination of these. For example, in the case of a software implementation, the terms "component" or "function" may refer to program code that performs specified tasks when executed on one or more processing devices. The illustrated separation of components and functionality into distinct units may reflect any actual or conceptual physical grouping and distribution of such software and/or hardware and tasks.

Claims

1. A method of encoding a signal, the method comprising:

receiving an input signal;

applying a first encoding operation to the input signal using a first codec to produce a first encoded stream; and the number of the first and second groups,

applying a second encoding operation to the input signal to produce a second encoded stream, wherein the first and second encoded streams are for combining at a decoder;

and wherein the method further comprises exchanging information between the first encoding operation and the second encoding operation.

2. The method of claim 1, further comprising adapting the first or second encoding operations, or both, based on the information.

3. The method of claim 1 or 2, wherein the first and second encoding operations are encoding operations of a hierarchical encoding scheme.

4. The method of claim 3, wherein the hierarchical encoding scheme comprises:

generating a base encoded signal by feeding a downsampled version of an input signal to an encoder;

generating a first residual signal by:

obtaining a decoded version of the base encoded signal; and

generating the first residual signal using a difference between the decoded version of the base encoded signal and the downsampled version of the input signal; and the number of the first and second groups,

the first residual signal is encoded to generate a first encoded residual signal.

5. The method of claim 4, further comprising:

generating a second residual signal by:

decoding the first encoded residual signal to generate a first decoded residual signal;

correcting the decoded version of the base encoded signal using the first decoded residual signal to produce a corrected decoded version;

upsampling the corrected decoded version; and

generating the second residual signal using a difference between the upsampled version and the input signal;

wherein the method further comprises:

encoding the second residual signal to generate a second encoded residual signal,

wherein the base encoded signal, the first encoded residual signal and the second encoded residual signal comprise an encoding of the input signal.

6. The method of claim 3, wherein the hierarchical encoding scheme comprises:

generating a base encoded signal by feeding a downsampled version of an input signal to an encoder, said downsampled version having undergone one or more downsampling operations;

generating one or more residual signals by:

upsampling an output of each downsampling operation to produce one or more upsampled signals; and

generating the one or more residual signals using a difference between each upsampled signal and an input to a respective downsampling operation; and the number of the first and second groups,

encoding the one or more residual signals to generate one or more encoded residual signals.

7. The method of claim 3, wherein the hierarchical encoding scheme comprises:

generating a base encoded signal by feeding a downsampled version of an input signal to an encoder, said downsampled version having undergone a plurality of sequential downsampling operations; and

generating an encoded first residual signal by:

upsampling the downsampled version of the input signal;

generating the first residual signal using a difference between an upsampled version of the downsampled version and an input to a last downsampling operation of the plurality of sequential downsampling operations; and the number of the first and second groups,

encoding the first residual signal;

generating a second residual signal by:

upsampling a sum of the first residual signal and an output of the preceding downsampling operation of the last downsampling operation of the plurality of sequential downsampling operations;

generating the second residual signal using a difference between the upsampled sum and the input to the preceding downsampling operation; and the number of the first and second groups,

encoding the second residual signal.

8. The method of any preceding claim, wherein the step of exchanging information comprises sending information with metadata in a stream.

9. The method according to any of claims 1 to 7, wherein said step of exchanging information comprises embedding information in a stream.

10. The method of any of claims 1-7, wherein the step of exchanging information comprises sending information using an Application Programming Interface (API).

11. The method of any of claims 1-7, wherein the step of exchanging information comprises sending pointers to a shared memory space.

12. The method of any of claims 1-7, wherein the step of exchanging information comprises exchanging information using Supplemental Enhancement Information (SEI).

13. The method of any preceding claim, wherein the information exchanged comprises encoding parameters for modifying the encoding operation.

14. The method of any of claims 1-12, wherein the information exchanged comprises one or more selected from the group comprising:

inputting information by a user;

metadata describing the content of the input video;

host device information;

content analysis information;

perceptual information describing a region of an image, wherein coding artifacts are less noticeable considering the Human Visual System (HVS);

motion information;

the complexity of the frame to be encoded;

frame entropy;

the estimated required number of bits;

frame information;

decisions taken during encoding or decoding;

a type of prediction for a frame to be decoded/encoded;

quantization levels for frames to be decoded/encoded,

the decision made by each group of pixels;

counting data information;

a target video quality metric; and the number of the first and second groups,

rate control information.

15. A method according to any preceding claim, wherein the signal is video and wherein the step of exchanging information is performed in accordance with one or more of per video, per group of pictures, per slice, per picture, per group of pixels and per pixel.

16. A method of decoding a signal, the method comprising:

receiving a first encoded signal and a second encoded signal;

applying a first decoding operation to the first encoded signal to generate a first output signal;

applying a second encoding operation to the second encoded signal to generate a second output signal; and the number of the first and second groups,

combining the first output signal and the second output signal to reconstruct an input signal,

wherein the method further comprises exchanging information between the first decoding operation and the second decoding operation.

17. The method of claim 16, further comprising adapting the first or second decoding operations, or both, based on the information.

18. The method of claim 16 or 17, wherein the first and second decoding operations are decoding operations of a hierarchical encoding scheme.

19. The method of claim 18, wherein the hierarchical encoding scheme comprises:

receiving a base encoded signal and indicating decoding of the base encoded signal to produce a base decoded signal;

receiving a first encoded residual signal and decoding the first encoded residual signal to generate a first decoded residual signal;

correcting the base decoded signal using the first decoded residual signal to generate a corrected version of the base decoded signal;

upsampling the corrected version of the base decoded signal to produce an upsampled signal;

receiving a second encoded residual signal and decoding the second encoded residual signal to generate a second decoded residual signal; and

combining the upsampled signal and the second decoded residual signal to generate a reconstructed version of the input signal.

20. The method of claim 18, wherein the first and second encoded signals comprise first and second sets of components, respectively, the first set of components corresponding to a lower image resolution than the second set of components, the method comprising:

for each of the first and second component sets:

decoding the set of components to obtain a decoded set,

the method further comprises:

upgrading a decoded first set of components so as to increase a corresponding image resolution of the decoded first set of components to be equal to a corresponding image resolution of a decoded second set of components, and

the decoded first and second component sets are combined together to produce a reconstructed set.

21. The method of claim 20, wherein the method further comprises receiving one or more additional sets of components, wherein each of the one or more additional sets of components corresponds to a higher image resolution than the second set of components, and wherein each of the one or more additional sets of components corresponds to an increasingly higher image resolution, the method comprising, for each of the one or more additional sets of components, decoding the set of components to obtain a decoded set, the method further comprising, for each of the one or more additional sets of components, in increasing order of corresponding image resolutions:

upgrading the reconstructed set having the highest corresponding image resolution to increase the corresponding image resolution of the reconstructed set to equal a corresponding image resolution of another component set, and

combining the reconstructed set and the other component set together to generate another reconstructed set.

22. The method according to any of claims 16 to 21, wherein the step of exchanging information comprises sending information with metadata in a stream.

23. The method of any of claims 16 to 21, wherein the step of exchanging information comprises embedding information in a stream.

24. The method of any of claims 16 to 21, wherein the step of exchanging information comprises sending information using an API.

25. The method of any of claims 16 to 21, wherein the step of exchanging information comprises sending pointers to a shared memory space.

26. The method of any of claims 16-21, wherein the step of exchanging information comprises exchanging information using Supplemental Enhancement Information (SEI).

27. The method of any of claims 16-26, wherein the information exchanged comprises decoding parameters for modifying the decoding operation.

28. The method of any of claims 16-27, wherein the information exchanged comprises one or more selected from the group comprising:

inputting information by a user;

metadata describing the content of the input video;

host device information;

content analysis information;

motion information;

the complexity of the frame to be encoded;

frame entropy;

the estimated required number of bits;

frame information;

decisions taken during encoding or decoding;

a type of prediction for a frame to be decoded/encoded;

quantization levels for frames to be decoded/encoded,

the decision made by each group of pixels;

counting data information;

a target video quality metric; and the number of the first and second groups,

rate control information.

29. The method according to any one of claims 16 to 21, wherein the signal is a video and wherein the step of exchanging information is performed per video, per group of pictures, per slice, per picture, per group of pixels and per pixel.

30. An encoding device configured to perform the method of any one of claims 1 to 15.

31. A decoding device configured to perform the method of any of claims 16 to 29.

32. A computer-readable medium comprising instructions that, when executed by a processor, cause the processor to carry out the method according to any one of claims 1 to 29.