GB2624686A

GB2624686A - Improvements to audio coding

Info

Publication number: GB2624686A
Application number: GB2217747.1A
Authority: GB
Inventors: Law Malcolm; Graham Craven Peter; Robert Stuart John
Original assignee: Lenbrook Industries Ltd
Current assignee: Lenbrook Industries Ltd
Priority date: 2022-11-25
Filing date: 2022-11-25
Publication date: 2024-05-29
Also published as: GB202217747D0; WO2024110766A1

Abstract

Input blocks of audio are encoded to data packets. A quantisation step size is determined for each of one or more audio channels for each block, in dependence on a rate control mechanism, and an offset is determined for each sample in the input blocks using a seeded pseudorandom sequence. Prequantized blocks are determined for each sample, so that they are equivalent to the pseudorandom offset modulo the quantisation step size. The prequantised blocks are then losslessly encoded using an injection mapping. The losslessley encoded blocks are buffered and used to generate data packets for transmission, with some packets also including data representing the seed used for the pseudorandom offsetting. Also disclosed is a prequantization stage where input blocks are quantised with noise shaping so that, below around 15kHz, the transfer function of the noise shaping approximates a curve for equal loudness of noise. Also disclosed is a method for reducing an audible transient (e.g. a click or artifact) on stopping noise shaping through jointly quantising samples.

Description

IMPROVEMENTS TO AUDIO CODING

Field of Invention

The present invention relates to methods and devices for improved encoding and decoding of audio signals.

Background to the Invention

Audio codecs exploit several properties of audio to reduce data rate, commonly: * Spectrum: Typically power density decreases with frequency * Tonality: Often signal power concentrates into narrow bandwidths * Dynamic range: Volume varies, being quieter at times * Channel similarity Additionally, they may reduce data rate by approximation. Some approximation error can be tolerated, the amount varying with time and frequency and desired quality level. A codec is deemed lossless if it does not use approximation but the decoded audio is an exact replica of the audio supplied to the encoder.

Linear Predictive Coding can be used to exploit the audio spectrum. A model of the audio spectrum is used to predict each sample of the audio from prior values and the prediction error, which is usually smaller, is communicated across the transmission channel.

In adaptive pulse code modulation (ADPCM), the level of this prediction error is modelled and used to normalise the prediction error. This normalised prediction error is observed to have a reasonably stable distribution and so can be entropy coded. The open-source codec FLAC (free lossless audio codec) operates this way, with a constant modelled level for each block of audio.

Additionally, the normalised prediction error can be quanfised to discard information and yield a reasonably stable data rate. This quantisation can be noise shaped to distribute the approximation error across the spectrum for reduced audibility.

Modelling of parameters can either be performed in the encoder and communicated to the decoder in the bitstream (forwards adaptive), or both 30 encoder and decoder can apply the same methods to synchronously adapt their models to the audio (backwards adaptive).

Another strategy for an audio codec is to separate out the approximation stage. An initial prequantization stage reduces the information content of the audio, typically by quantising it more coarsely in conjunction with noise shaping to reduce the audibility of the quanfisation. This reduced information audio is then transmitted with a lossless codec. The advantages of this technique are that cascadability without further loss of quality is a natural consequence and that separating the concerns of reducing information content and efficiently coding the reduced information audio helps both to be well implemented.

Generically codecs that operate sample by sample are termed time domain codecs and have found application in speech and telecoms codecs and applications where low latency is important. Also, time domain techniques are effective for lossless audio codecs (eg FLAC).

But for general wide bandwidth audio use, the dominant approach is to start off with a time-frequency transform. Instead of each sample representing a short timespan but wide bandwidth (eg -21us x 24kHz), the transformed samples represent a narrow bandwidth over a long time span (eg a 1024 point transform converting to -21ms x 24Hz).

The rationale for this transform is that often much of the signal energy concentrates into a few of these samples, and so we can obtain a reasonable impression of the audio from those few values and their coding can be designed to exploit their sparsity.

This approach works well at the data rates it was designed for But quality aspirations increase at higher data rates and forcing values to zero to create sparsity is too crude an operation. Without sparsity, much of the advantage of working in a transformed domain disappears and leaves several disadvantages: * Operating a large transform to obtain fine frequency resolution requires a large block size. Overall encode-decode latency is typically several blocks imposing a large minimum delay and making the codec inappropriate for many real time applications.

* A codec will be based around a certain fixed size transform, reducing customisability since the block size cannot be matched to application requirements.

* The transform has implementation costs.

* The audible effects of operating on a block naturally spread over the window the block decodes to. This can move energy forward from an transient event, flagging its approach to the listener.

* Varying noise floor with frequency requires communicating scale factors to the decoder, costing data rate and constraining the shape to match a given model (eg one scale factor per critical band).

Higher data rates are becoming more widespread and there is therefore a need for improved time domain audio codec techniques for use at high sample rates (>=44.1kHz) and data rates (>=256kbps for 2 channels) whereby superior audio quality can be achieved to prior art frequency domain codecs whilst enjoying lower latency and computational requirements.

Real life data channels often have variable capacity, for example a radio channel may intermittently suffer from interference. There is also a need for audio codecs to be able to seamlessly cope with reductions in channel capacity, degrading quality as required but without gaps, clicks or failure.

Therefore, as will be appreciated, there is a need for improved methods of encoding and decoding of audio signals and the associated encoders, decoders and codecs.

References [1] M.A.Gerzon and P.G.Craven "Lossless coding method for waveform data", 10 W01996037048A2.

[2] P.G.Craven and J.R.Stuart, "Cascadable Lossy Data Compression Using a Lossless Kernel", preprint 4416 102nd AES convention 1997.

[3] L.G.Roberts "Picture Coding Using Pseudo-Random Noise," IRE Trans. Inform. Theory, vol. IT-8, pp. 145-154 1962.

[4] M.A.Gerzon and P.G.Craven, "Optimal noise shaping and dither of digital signals" preprint 2822 ar AES convention 1989.

[5] M.A.Gerzon and P.G.Craven, "Compatible Improvement of 16-Bit Systems Using Subtractive Dither" preprint 3356 93rd AES convention 1992.

[6] J.R. Stuart, "Noise: Methods for Estimating Detectability and Threshold" JAES Volume 42 Issue 3 pp. 124-140; March 1994.

Summary of the Invention

According to a first aspect of the present invention, there is provided a method for encoding input blocks of audio to packets of data, the input blocks containing one or more channels of audio samples, the method comprising the steps of: receiving input blocks of audio; determining a quantisation step size A for each audio channel in each block in dependence on a rate control mechanism; determining a pseudorandom offset for each sample in the input blocks, the pseudorandom offset for each channel being a pseudorandom sequence having a seed; quantizing with noise shaping each sample in the input blocks to produce prequantised blocks, wherein each sample value in the prequantised blocks is equivalent modulo A to the corresponding pseudorandom offset; losslessly encoding the prequantised blocks in dependence on A on to produce blocks of losslessly encoded data, wherein the dependence on A is such that a smaller value of A would cause the losslessly encoded block to be larger and wherein the losslessly encoding is an injection mapping such that, for any prequantised block, losslessly encoding a different prequantised block that was also equivalent modulo A to the corresponding pseudorandom offset would necessarily produce a different block of losslessly encoded data; buffering the losslessly encoded blocks of data in a buffer; and generating packets of data for onward transmission in dependence on the buffered data, wherein at least some of the packets of data comprise data representing the seed of the pseudorandom sequence.

In this way the rate control mechanism can adjust the level of noise through the stream, and optionally direct noise to louder channels that better hide it. The pseudorandom offset beneficially avoids quantisation distortion whilst also avoiding the increase in approximation error associated with additive dither.

Furthermore, the injection constraint requires that what we label "lossless encoding" actually is lossless. We do this by insisting that the step of losslessly encoding does not destroy information, that is no two distinct well-formed inputs can produce identical outputs. Moreover, we require that the lossless encoder exploits A for compression gain. A lossless encoder that was not adapted for pseudorandom offsets could not exploit A for compression gain because its input is apparently too high resolution regardless of A. Finally, the method of encoding is such that the decoder is equipped to replicate the identical pseudorandom sequence to that used by the prequantiser. Data representing the seed may be as straightforward as a block count index (modulo a power of 2) as that is sufficient to allow the decoder to fast forward through a standardised pseudorandom sequence.

Preferably, the rate control mechanism receives information about the buffer and the quantisation step size A is determined in dependence on the fullness of the buffer. In this way the buffer can be servoed to stabilise its occupancy and match the losslessly encoded data rate to that of the channel.

In some embodiments the method further comprises the step of separating the losslessly encoded data in each block into a first portion and a second portion which are buffered separately in the step of buffering, wherein the first portion comprises coarse data and the second portion comprise touchup data such that the coarse data can be decoded without the touchup data to produce a coarse approximation of the prequantised block; and wherein the packets of data are generated such that each packet comprises an integer number of coarse data blocks and is filled up to available capacity with touchup data.

In this way if the decoder has a problem recovering buffered data, it can still produce a coarse approximation to the audio instead of nothing. And yet buffering is still available to decouple the variable datarate from lossless encoding from the data channel characteristics.

Preferably, the touchup data is stored in a first-in-first out (FIFO) buffer and the packets of data are generated from one end with coarse data blocks and from the other end with FIFO buffered touchup data.

In this way the decoder can access touchup data and decode the first block in the packet before it has parsed the coarse data for all the blocks in the packet. This can be accomplished without spending datarate on a length field indicating the total amount of coarse data.

In some embodiments the method further comprises the step of analysing samples in the input blocks, wherein the quantisation stepsize A is further determined in dependence on the analysis of the samples. Preferably, the quantisation stepsize A is increased if the analysis suggests that the buffer might otherwise overflow. In this way the encoder can anticipate and avoid buffer overflow, and consequently the codec can safely operate with less buffering and a shorter codec latency.

According to a second aspect of the present invention, there is provided a encoder adapted to encode input blocks of audio to packets of data using the method of the first aspect.

According to a third aspect of the present invention, there is provided computer readable medium comprising instructions that, when executed by one or more processors, cause said one or more processors to perform the method of the first aspect.

According to a fourth aspect of the present invention, there is provided a method method for decoding packets of data to output blocks of audio containing one or more channels of output audio samples, the method comprising the steps of: receiving packets of data; extracting information indicating a quantisation step size A and a seed for each channel and block dependent on the data; determining an offset for each sample in a block, wherein the offsets for each channel are a pseudorandom sequence dependent on the corresponding seed; decoding the data to produce a prediction residual for each sample in the block dependent on the data; filtering the prediction residuals with quantisation to produce a filtered sample for each sample in the block dependent on the corresponding prediction residual, wherein each filtered sample is equivalent modulo A to the corresponding offset; and generating output blocks of audio in dependence on the filtered samples.

In this way the decoder establishes the quantisation characteristics of the audio presented to the lossless encoder by extracting A and the seed, thus allowing it to ensure its output conforms to those characteristics. Moreover, the decoder expands the quantisation characteristics of the audio presented to the lossless encoder to a specification for each sample by generating the pseudorandom sequence. (This might not apply to all channels in all blocks as the stream may specify that some channels in some blocks don't use pseudorandom offsets). Finally, the decoder ensures each filtered sample conforms to the quantisation specification. As we set out the architecture of such a lossless decoder, the filtering step is not the first nor the last operation, which is why we precede it with a step of decoding prediction residuals and couple it to the output.

In some embodiments a first portion of each packet of data is decoded without a delay and a second portion of each packet of data is buffered and delayed prior to decoding. In this way the decoder applies complementary delays to those applied by corresponding encoder embodiments and is still able to decode a coarse approximation to the audio instead of nothing if there is a problem recovering buffered data.

According to a fifth aspect of the present invention, there is provided a decoder adapted to encode input blocks of audio to packets of data using the method of the fourth aspect.

According to a sixth aspect of the present invention, there is provided a computer readable medium comprising instructions that, when executed by one or more processors, cause said one or more processors to perform the method of the first fourth.

According to a seventh aspect of the present invention, there is provided a codec comprising an encoder according to the second aspect in combination with a decoder according to the fifth aspect.

According to an eighth aspect of the present invention, method for encoding audio to data comprising: receiving input blocks of audio, each input block comprising one or more channels of audio samples quantised to an input audio precision; determining a prequantization precision for each channel in each block, there being at least one channel in one block where the prequantization precision is coarser than the input audio precision; producing prequantised blocks by, where the prequantization precision is coarser than the input audio precision, quantizing each sample in the input blocks to the prequantization precision with noise shaping, wherein below a threshold frequency of approximately 15kHz the transfer function of the noise shaping approximates a curve for equal loudness of noise; and losslessly encoding the prequantised blocks to produce blocks of losslessly encoded data.

In this way the noise introduced by the quantisation operation below the threshold frequency is shaped to a benign curve which draws no attention to itself by more perceptual weighting to any frequency region over another. Above the threshold frequency equal loudness curves rise sharply and it is not beneficial to follow this rise too far.

According to a ninth aspect of the present invention, there is provided an encoder adapted to encode audio to data using the method of the eighth aspect.

According to a tenth aspect of the present invention, there is provided a computer readable medium comprising instructions that, when executed by one or more processors, cause said one or more processors to perform the method of the eighth aspect.

According to an eleventh aspect of the present invention, there is provided a method for reducing an audible transient on stopping noise shaping of an audio signal, the method comprising altering the next n quantised sample values by: multiplying state variables of the noise shaping and/or a difference between one or more previous outputs and corresponding inputs of the noise shaping by a precomputed matrix to yield an intermediate representation containing n or less values; quantising the n or less values in the intermediate representation, either directly or with back substitution, to produce n or less quantised intermediate values; multiplying the n or less quantised intermediate values by a precomputed integer valued matrix to produce n alterations for quantised sample values; and applying the n alterations for quantised sample values.

In this way we implement a good solution to a tricky joint rounding problem which reduces a potentially audible defect. The difficult linear algebra aspects of the problem for a specified frequency weighting are precomputed allowing the real time solution for a particular instance of quenching a noise shaper to be performed by straightforward matrix operations.

According to a twelfth aspect of the present invention, there is provided a device adapted to reduce an audible transient on stopping noise shaping of an audio signal using the method of the eleventh aspect.

According to a thirteenth aspect of the present invention, there is provided a computer readable medium comprising instructions that, when executed by one 10 or more processors, cause said one or more processors to perform the method of the eleventh aspect.

As will be appreciated by those skilled in the art, the present invention is capable of various implementations according to the application, as will be apparent from the following discussion.

Brief Description of the Figures

Embodiments of the invention will now be described by way of example with reference to the accompanying figures in which: Figure 1 shows the main components of an audio encoder according to the invention and how the various components might connect together; Figure 2 illustrates the operation of an audio encoder according to the invention in flowchart form. Packets of data produced by the audio encoder are not constrained to contain a fixed number of blocks of audio, so presentation of a block of audio is shown asynchronously to extraction of a data packet, these operations being coupled by data buffering; Figure 3 shows an overview of the main components of an audio decoder according to the invention. Incoming packets divide into two sections, termed coarse and touchup. The coarse passes directly for decoding, the touchup is delayed in a FIFO buffer before being used for decoding. The output of the lossless decoder is optionally upsampled in order to restore the sample-rate to that presented to the encoder's prequantiser; Figure 4 shows two equivalent architectures for performing noise shaped quanfisation to integer multiples of a step size A with a pseudorandom offset; Figure 5a shows how the prior art proposal of encoding audio by prequantising it followed by a lossless codec can be altered by employing subtractive dither; Figure 5b shows how the prior art proposal of encoding audio by prequantising it followed by a lossless codec can be improved by employing pseudorandom offsets; Figure 6 shows various noise shaping transfer functions useful for the prequantisation operation, with amplitude in db plotted against frequency in Hz. Below a threshold frequency they all have similar shape: following an equal loudness contour adjusted to be appropriate for noise; Figure 7 illustrates the conceptual model used to set up a least squares model for minimising the audibility of artifacts when stopping a noise shaping operation; Figure 8 is a flowchart setting out a sequence of steps for minimising the audibility of artifacts when stopping a noise shaping operation. Some of these are performed at design time, processing a desired spectral weighting of errors into matrices. The others are performed at run-time using these precomputed matrices to minimise audibility on a specific occasion; Figure 9 is a flowchart showing how a block of audio can be analysed to estimate how the encoded bit rate varies depending on prequantization configuration; Figure 10 shows the main signal processing operations in a lossless encoder according to the invention and how data flows from one operation to another; Figure 11 shows the main signal processing operations in a lossless decoder according to the invention and how data flows from one operation to another; Figure 12 shows an example packet format for communicating between the encoder and decoder according to the invention. It contains a coarse description of an integer number of audio blocks and the rest of the packet is filled up with buffered touchup information in reverse order. The touchup data is consumed without regard for block boundaries so there are partial fragments at each end; Figure 13 illustrates how a synchronisation field in the packet header can synchronise the decoder FIFO buffer. Packet t contains such a field in the packet header specifying the number of bits expected to be in the decoder FIFO just before arrival of this packet after decode of block t -1; Figure 14 illustrates how FIFO buffer underflow can be dealt with. Figure 14a shows how coarse blocks flow from the lossless encoder into a delay line and touchup data flows into a FIFO buffer. The encoder furnishes a packet containing two coarse blocks (numbered t and t + 1) from the delay line. It attempts to fill the remainder of the packet with touchup data from the FIFO but there is insufficient so an unused hole is left in the packet. Figure 14b shows the state of the decoder buffer and which packet various data arrived in at a later point in time; Figure 15 shows how quantisation with noise shaping can be viewed as jointly quanfising a pair of samples. Figure 15a shows plain quanfisafion of each sample as a 2-dimensional operation. Figure 15b shows how the skew transformation implied by the noise shaping transforms each Voronoi region and Figure 15c files the transformed Voronoi regions onto the plane; Figure 16 echoes Figure 15 but with a quantisation of half the density (across 2 dimensions). Figure 16a shows the 2-dimensional rhombic quantisation with no noise shaping. Figure 16b shows how the skew transformation implied by the noise shaping transforms each Voronoi region and Figure 16c tiles the transformed Voronoi regions onto the plane; and Figure 17 shows a flow chart illustrating how the rate control servo can incorporate desirable audio considerations.

Detailed Description

In reference [1] p67-71, Gerzon and Craven propose constructing a lossy audio codec out of an initial prequantization stage to reduce the information content of the audio, followed by a lossless audio codec. Craven and Stuart also proposed this approach in reference [2].

Having reduced this concept to practice (which we are unaware of Gerzon, Craven or Stuart doing at the time, or indeed anyone else having done) we find that, with improvements as described below, superior audio quality to state-of-the-art audio codecs can be obtained at high sample rates (>=44.1kHz) and data rates (>=256kbps). Furthermore, this can be achieved at lower latency and computational load for both encoder and decoder. Also the resulting codec can have the ability to scale operation seamlessly between lossy operation at these data rates and lossless operation at suitably higher data rates.

The main advantage of dividing a lossy encoder into a prequantiser and lossless encoder is separation of concerns. The prequantiser can focus on reducing the precision (and hence information content) of the audio whilst paying great attention to ensuring the signal processing gives a high-quality outcome. The lossless codec presents no audio quality concerns by virtue of not altering the audio On normal operation). Consequently, it can focus on coding the audio to a minimum amount of data with good computational efficiency.

A secondary advantage is cascadability. Since the decoded audio is an exact replica of the audio presented to the lossless encoder, the decoded audio can be recompressed to the same data rate without a second stage of prequantization and without further approximation error.

An interesting cascadability use case is streaming, wirelessly retransmitted in a phone out to earbuds. The streaming could be at a data rate that the wireless channel can usually accommodate. But if wireless conditions deteriorate, the phone can requantise to a coarser resolution lower quality rendition, retuming to lossless retransmission when wireless conditions permit.

Nevertheless, although it is preferable to separate the prequantiser from the lossless encoder, it would be perfectly possible to reorganise the signal processing operations so as to integrate the data reducing quantisation into the lossless encoder operations making it a monolithic lossy encoder.

General encoder structure overview The general structure of the encoder is illustrated diagrammatically in Figure 1 and in flowchart form in Figure 2. We first describe the structure with reference to Figure 1.

Incoming digital audio representing one or more channels is presented in blocks, whose size is configurable but preferably represents around 1-2ms of audio. Smaller blocks allow greater flexibility in dynamically adjusting the degree of approximation error in response to the audio, but incur greater data overheads in the lossless encoded stream and also more computational cost since the encoder makes more frequent decisions.

Each block of audio is then prequantised. This is the stage where the information content of the audio is reduced to match the capabilities of the transmission channel. With sufficient channel capacity lossless operation may be possible in which case the prequantiser will pass the block of audio with some or all channels unaltered. But usually the audio is quantised to a suitable precision with a pseudorandom offset and noise shaping. The pseudorandom offset ensures the approximation error is noise like (as opposed to distortion) and the noise shaping adjusts the spectral shape of the approximation error to minimise audibility.

Preferably the prequantiser also has the capability to perform other signal processing operations to reduce information content, such as reduction in sample rate or even reduction of multiple independent audio channels to mono. These capabilities are useful to cover situations when the channel capacity might suddenly degrade. For example, it may be a radio link which encounters interference when another family member starts watching a high-resolution video.

The lossless encoder is responsible for turning each block of audio into a block of data from which a corresponding decoder can reconstruct an exact replica of the audio block. It is the lossless codec which exploits the known characteristics of audio to achieve compression gain. In reference [1] Gerzon and Craven anticipated using a general-purpose lossless audio codec, the design of which was the main topic of the document. However, in practice a prior art lossless codec (currently FLAG is the dominant example) is not suitable as there are many desirable specialisms to the lossless codec that are useful to achieve good performance of the whole system. In particular, the lossless encoder needs to be adapted to operate with pseudorandom offsets as otherwise the apparently high precision audio input would lead it to operate at an undesirably high data rate.

Encoded blocks are then passed on to a packetiser, which is responsible for producing actual packets for transmission across the communications channel.

Formatting the encoded blocks into packets might reasonably be considered part of the lossless encoder, we separate it out as it has a distinctive role in the overall encoder. The size of encoded blocks will vary, especially in lossless operation. For some channels, such as file storage, this does not matter. For many real time communications channels however, it does matter and the packets emerging from the encoder should be at a fixed or peak limited data rate. Perhaps packet size is constant and there is a minimum period between packets. Or perhaps packets should be emitted to a fixed schedule and there is a maximum packet size.

The packetiser preferably comprises buffering which accommodates the conflict between the inherently variable data rate from the lossless encoder and the fixed or peak limited data rate of the channel. When the lossless encoder is producing blocks containing more data than the available data rate, the buffer will fill up and when it's producing shorter encoded blocks the buffer will empty.

In some embodiments the output may not be peak rate limited, for example a codec intended for file-to-file coding. In that case there is no short-terrn capacity constraint to require buffering and the buffering could be omitted.

The whole data stream could be buffered, but it is preferable to divide it into two portions. One of these (which we will call coarse) is capable of decoding on its own to a comparably crude representation of the audio, the other (which we will call touchup) contains the additional information that enables lossless reconstruction. The coarse data experience a constant delay in the buffer (which we will call the latency), but the touchup data experiences a variable delay ranging between zero and the latency. This variable delay allows the data rate out of the lossless encoder to be decoupled from the communication channel capacity. On the communications channel, the touchup data is advanced with respect to the coarse data by a variable amount ranging between zero and the latency.

Preferably the buffer is instrumented to measure how full it is, which we term buffering stress, and this reading is passed onto a rate control servo. The rate control servo is responsible for closing a feedback loop. Quantising the audio finely (or losslessly) causes large encoded blocks from the lossless encoder, filling up the buffer and increasing buffering stress, whilst coarse quantisation causes small encoded blocks, draining the buffer and reducing buffering stress. Preferably the rate control servo adjusts the degree of quantisation performed by the prequantiser so as to keep buffering stress tolerable, whilst having regard to the audible consequences of altering quantisation precision.

Sometimes, when the codec is operated at low latency and there is little buffering available, the feedback mechanism is inadequate to prevent buffer overload. Audio exhibits large dynamic range and quiet gentle, finely quantised audio could be immediately followed by a loud high information content block, such as a cymbal crash. If this block was finely quantised in line with the processing for previous blocks of audio then a very large amount of data would emerge from the lossless encoder potentially overwhelming the buffering.

Preferably the incoming audio block is analysed to estimate its information content and consequently estimate the relationship between quantiser step size and the number of bits the block will encode to and this information is also considered by the rate control servo. We suspect many designers would choose to make analysis of the current block the main rate control mechanism, with feedback from buffer stress at most a secondary influence. For reasons discussed later, we believe better sounding results are obtained by focussing on buffer stress for choosing the degree of quantisation. Preferably the current block analysis is largely ignored, except when it suggests disaster would befall the buffering if immediate action was not taken.

The flowchart of Figure 2 presents a different perspective on the same general encoder organisation.

Preferably there does not have to be a fixed relationship between blocks of audio and the packets they are encoded into. This decouples the coding from the characteristics of the transmission channel which may have constraints around what sizes of packets are supported and when they can be transmitted.

Accordingly, Figure 2 treats receiving an audio block and receiving a request for a packet as separate, asynchronous events which are coupled by the buffering.

On receiving an audio block, preferably the encoder conducts an initial analysis of the block with a view to determining its information content and consequently what the relationship looks like between the precision of prequantization and how much data would be required to encode it.

The encoder decides what step size A should be used to prequanfise the audio to reduce its information content. Optionally A might vary from channel to channel. As discussed in more detail in a later section, preferably the encoder makes this choice mainly on the basis of the current level of stress in the output buffering. The initial analysis above may alter this decision, especially if it looks like the buffering might overrun, but we are wary of difficult audio starting mid-block causing prequantiser noise to rise at the beginning of the block.

The encoder determines pseudorandom offsets for the block of audio. These are generated by pseudorandom number generators.

The prequantiser now quanfises the audio to integer multiples of A offset by the pseudorandom offsets which randomise how it performs its quantisation. We consider this to be different to subtractive dither (as discussed later) but it is numerically equivalent and so obtains the subtractive dither benefits of avoiding quantisation distortion while not increasing quantiser error.

The quantised audio is then losslessly encoded by the encoder component of a lossless codec which is adapted to operate with pseudorandom offsets.

Preferably the output of this lossless encoder divides into two components. In combination they are sufficient to enable the decoder to losslessly reproduce and exact replica of the prequantised audio supplied to the lossless encoder. But one of them, which we name the coarse data, can be used on its own to reconstruct a coarse representation of the audio. We call the other touchup because it improves the quality of reproduction.

The coarse data and the touchup are then pushed into buffering which decouples the variable data rate emerging from the lossless encoder from the characteristics of the transmission channel. Preferably, they are treated separately in the buffering. The coarse data is kept as an indivisible unit so we say it is pushed into a delay line. The touchup data is treated as a sequence of bits which are pushed into a FIFO buffer from which it will be pulled without regard to the block boundaries.

Finally, we update a measure of buffer stress for use choosing A for subsequent blocks. A sensible choice of buffer stress is the excess amount of encoded data in the buffer compared to the average channel data rate. We update this value by adding the total encoded size of the block and subtracting the expected channel capacity over the duration of a block.

Asynchronously requests for packets are handled by pulling an integer number of blocks of coarse data out of the delay line, the number of blocks depending on the duration of audio the packet is desired to span. This number relates to the repetition period of packets on the channel. This leaves a variable amount of space in the packet which is filled by pulling touchup data from the FIFO buffer as a stream of bits without regard for block boundaries. Preferably this touchup data is flowed into the packet starting at the end and working back towards the beginning. This organisation allows the decoder to work with the touchup data in a packet before it has finished parsing the coarse data and hence discovered where the boundary between coarse and touchup is located.

Finally the measure of buffer stress is updated, to accommodate any discrepancy between the actual packet size and the size that is expected from the average data rate and the number of blocks it describes.

General decoder structure overview Figure 3 shows the corresponding decoder structure.

Preferably an incoming packet is divided up into two portions, one of which (the coarse data) is unbuffered and passes directly to the lossless decoder, the other of which (the touchup data) is buffered and experiences a variable delay complementary to the touchup delay in the encoder. The net effect is that all data is delayed by a constant amount between the lossless encoder and the lossless decoder. For the coarse data, this delay is all in the encoder buffer. For the touchup data, a variable amount of this delay occurs in the encoder buffer and the remainder in the decoder buffer.

The advantage of this arrangement is that sometimes buffered data may not be available to decode. For example, the decoder may be starting to decode in the middle of a stream and data sent in earlier packets is unavailable. Or a missing packet may have caused the FIFO buffer in the decoder to lose synchronisation. In these circumstances, the decoder can still decode the coarse data and produce an approximate rendition of the desired audio until the buffer is able to recover synchronisation and fully lossless decode can be restored.

The lossless decoder is adapted to decode data quantised with pseudorandom offsets. Accordingly pseudorandom offsets are computed which replicate corresponding offsets generated in the prequantiser. These pseudorandom offsets are supplied to the lossless decoder so that it can ensure its output satisfies the same modulo constraints that the prequantiser quantised to.

After lossless decode, the audio is optionally upsampled. Upsampling is done when the stream indicates that the prequantiser in the encoder reduced the sampling rate. This upsampling is done so that the decoder can output a consistent sample rate even as the prequantiser dynamically decides to switch decimation in or out in response to varying transmission channel conditions.

Preferably the decimation and upsampling are designed so as to minimise any audible artifacts on changing the sample-rate through the lossless codec.

PreQuantisation The prequantiser is responsible for reducing the information content of the audio in response to control instructions.

The main mechanism for doing so is noise shaped quantisation to a pseudo random offset, as shown in Figure 4. Operation is governed by a parameter A which controls the precision of the quantisation.

Noise shaped quantisation is well known in the prior art and discussed as the prequantisation mechanism in reference [1] (particularly Fig 20b).

We assume the incoming audio signal to be presented as integer values. For example, a 24-bit audio signal will take integer values in the range [-223, +223). In Figure 4a the quanfiser QA quanfises its input to integer multiples of a step size A which is also an integer. However QA is preceded and followed by, respectively subtraction and addition nodes, these three operations having net effect of quantisation to integer multiples of A with a pseudorandom offset.

The error introduced by this operation is filtered by a filter A(z1) and added to the audio input prior to quantisation. The overall error of the whole process is filtered by a filter B(z-I) and added to the audio input prior to quantisation. This has the effect of spectrally shaping the error introduced by the quanfision operation with a transfer function (1 + A(z-1))/(1 + B(z-1)) so as to reduce it in frequency regions where it might be more audible at the expense of boosting it in frequency regions where it might be less audible. Either of A(z1) or B(z1) may be chosen to be trivial and omitted with consequent simplifications.

The auxiliary quantiser box Q' is included in the diagram for a slightly pedantic reason. After adding in the error feedback, we have a high precision signal, which Q' quantises back to some specified precision, for example integer values. It's not required if filter 4(z-1) is trivial and is there to limit the precision of the signal supplied to the filter A(z1) so it can be implemented with fixed precision arithmetic. It benefits from incorporating normal additive dither.

Audio quantisation would normally be to a power of two step size, producing an output with an integer number of zeros as the least significant bits. However powers of two are too widely spaced for a step size in a prequantiser application, as they only allow noise levels to be chosen in increments of 6dB. A prequantiser needs greater precision for adjusting the level of quantisation noise so A needs to be able to take non power of two values. A codec would typically tabulate allowed integer values for A, perhaps increasing in ratios approximating 1.5dB, 2dB or 3dB.

Preferably the pseudorandom value subtracted and added is a uniformly distributed integer in the range [0,A). Figure 4a shows it generated by taking a pseudorandom sequence of values considered to lie in the range [0.0,1.0), multiplying by A and quantising to integer (typically by discarding the fractional component). However other derivations are possible, especially since the pseudorandom value is both subtracted and added and thus is only the remainder modulo A that affects operation. For example, a pseudorandom integer with whose range is substantially greater than A could be used directly since it will have a nearly uniform distribution modulo A. This offset can be applied in various ways. For example, instead of subtracting and adding it immediately around the quanfiser QA as per Figure 4a, Figure 4b shows the offset subtracted from the input signal to the whole noise shaped quantisation and added back to the output of the noise shaped quantisation.

Despite looking quite different, Figure 4a and Figure 4b are arithmetically identical.

Relationship to subtractive dither The concept of adding a pseudo-random value prior to quantisation and subsequently subtracting a synchronised replica of it has previously been proposed. VVhy do we use the descriptive term "pseudorandom offset" instead of the accepted term of art "subtractive dither'? We do so because subtractive dither is a different concept, and the difference does not lie in the arithmetic but in the location of operations.

In 1962 Roberts (reference [3]) proposed adding noise to pixels in a picture before quantising it for transmission and subtracting the same noise in the receiver.

In 1989 Gerzon and Craven (reference [4]) proposed the now accepted term "subtractive dither" for Roberts' technique and defined the term (p12) as "Subtractive dither, whereby the dither added at the quantiser is subtracted at the output of a digital transmission path".

The point is the remoteness (transmission path) between the addition and subtraction operations. It is the reduced width of the transmission path that creates the need for quantisation and the need for synchronised noise sources at both the transmit and receive side. For Roberts this is TV transmission, Gerzon & Craven subsequently proposed (reference [5]) using subtractive dither to quantise high precision audio to 16 bits for transmission on CD with subtraction in the CD player.

Without the reduced capacity channel, there's no need for a quantiser at all! If subtractive dither was to be added to the prequantiser+lossless codec proposals of Gerzon & Craven or Craven & Stuart it would produce a system like that shown in Figure 5a which adds dither in the prequantization and subtracts a synchronised version of it at the decode side after the lossless codec. If there were no noise shaping (or the noise shaping was fixed) then this would be a useful improvement on the prequantiser+lossless codec proposals in references [1] and [2] for all the well-known reasons why dither is beneficial and subtractive dither better.

However as also shown in Figure 5a the subtracted dither on the receive side needs to be filtered to match the noise shaping at the transmit side. In reference [5] this noise shaping was to be fixed and standardised. But in the prequantiser+lossless codec concept the noise shaping is variable and needs to be synchronised between the encoder and decoder. This requirement destroys a major advantage of the prequantiser+lossless codec concept, rendering the use of subtractive dither around the codec impractical.

In contrast our preferred improvement to the prior art prequantiser+lossless codec proposals is of the general form shown in Figure 5b. Here the dither is added and subtracted immediately around the quantiser and there is no requirement to synchronise the noise shaping in the receiver.

This process results in a wider wordwidth than the quanfisation precision and on the face of it does not allow a lossless audio codec to operate at the desired reduced data rate. However, as taught herein, it turns out that it is actually possible to enjoy the desired reduced data rate if the lossless codec is suitably adapted to operate with known offsets.

Spectral shape of prequantiser noise The generally accepted view is that the audibility of codec noise depends on the spectral content of the signal masking it and consequently a lossy audio codec should concentrate its error into those spectral regions that are currently said to be masked by the audio signal.

In reference [1], Gerzon explains (p67-69 with reference to Fig 20a) how this applies to a prequantiser for a lossless audio codec, estimating an auditory masking curve from which noise shaping coefficients can be computed.

In contrast to this approach, we have found it preferable to design noise shaping filters on the basis of equal loudness curves. A selection of suitable noise shaping transfer functions are graphed in Figure 6. Two are drawn for 48kHz sampling rate, two for 96kHz.

Below about 15kHz the noise shaping transfer functions are shaped according to audibility thresholds. This exhibits a dip around 3-4kHz and a further dip around 12kHz. Above this frequency the threshold rises sharply. There is no noise shaping benefit in having prequantiser noise level exceed signal noise level in this region, so the curves drop beneath the equal loudness curve, for example plateauing up to the Nyquist frequency, or perhaps drooping at higher frequencies to reflect lower signal spectral density. The vertical line at 15kHz approximately (within 2-3kHz) marks the transition from one regime to the other.

The choice of curve shape below the knee around 15kHz does appear to be important to the sound imparted by the prequantiser. Less so the shape above the knee for which practical considerations of obtaining good noise shaping advantage without overwhelming the audio signal are relevant.

Data for equal loudness is readily available, for example IS0226:2003 or 1S03897:2019. However, these data are for equally loud sine waves and need adjustment for use with noise as the variable integration bandwidth of the ear means that in different frequency ranges it takes a different noise spectral density to have equivalent loudness to a sine wave of a given sound pressure level. For more details, the topic is explained in reference [6].

Noise, shaped to an equal loudness (adjusted for noise) contour, is very pleasant and smooth sounding, drawing attention to no attention to itself from emphasis on any particular frequency range. We believe it uniformly excites sensors across the span of the cochlear. The benefit of using such curves for noise shaping the prequantiser error is that to the extent that the added noise is perceivable it has a benign and stable character that slips into the background and is readily ignorable. In contrast, a noise spectrum based on masking theory might be imperceptible if the signal genuinely does completely mask it, but if the addition does actually alter perception even slightly then having the noise spectrum closely tied to the signal spectrum risks interpretation by the listener as signal distortion rather than background noise.

Consequently, we believe it is preferable to minimise the audibility and objectionability of the noise added in prequantization when viewed in isolation rather than try to exploit additional spectral regions which the signal is said to mask.

The shape of uniformly exciting noise curves does vary with level, and arguably it is preferable to use a curve appropriate to the actual loudness of the noise.

However, this is a small matter since the uniformly exciting noise curves are broadly parallel and also it is difficult to determine the correct curve to use since an audio codec typically does not know the acoustic gain of the replay system and consequently the actual SPL at the listener. The goal of a high-resolution codec is for the noise floor to be inaudible, so we propose using the curve for uniformly exciting noise at the threshold of audibility.

ISO 389-7:2019 gives thresholds for both free field and diffuse field listening conditions. Experimentally we find that noise shapers designed off the free field threshold sound preferable to those that attempt to integrate the diffuse field thresholds, and Figure 6 shows the resultant shapes in detail with the low frequency noise shaping advantage depending on the amount of noise boost in the plateaued region.

Dynamic prequantiser noise shaping Whether because the noise shaping follows a dynamically computed auditory masking threshold as taught by Gerzon or a variable high frequency boost or shape as taught above, it is desirable to change the noise shaping transfer function (1 + A(z-1))/(1 + B(z-1)) from time to time responsive to changing characteristics of the audio signal.

Indeed the ability to dynamically change the noise shaping transfer function is a key advantage to a prequantised codec. Crucially the decoder does not need to know anything about the noise shaping applied. In contrast a transform codec achieves a frequency dependent noise floor by means of band scale factors which need communicating to the decoder. This costs data rate, but it also means the format specification needs to standardise exactly what the set of possible spectral noise shapes are. In contrast in a prequantised codec, the lack of need for standardisation means the encoder has considerable freedom in how it produces its information-reduced version of the audio and there's great potential for later post standardisation improvement in technique.

Consequently, there is a need to consider how to change (1 + A(z-1))/(1 + B(z-1)) sensibly without troublesome artifacts. One possibility is to change them gradually, which is computationally expensive and requires a suitable coefficient trajectory to be provided. Another is to change them instantaneously at a block boundary (perhaps synchronously with a change in VV), in which case consideration needs to be given to avoiding artifacts on the change. Preferably A(r1) should be kept constant and only 8(z-1) altered. This is because altering A(z1) without carefully adjusting its filter history introduces a discontinuity to the impulse response which varies with delay. There is no such issue with altering B(z1) whose filter history remains valid across a coefficient change as it is the total prequantiser alteration actually heard by the listener.

Reduced sample rate Preferably the prequantiser is able to dynamically decide to reduce sample rate, typically by a factor of 2 from around 96kHz to around 48kHz but other ratios could be implemented. To facilitate this the lossless codec has to be able to accommodate blocks containing half as many samples as usual, and the full sample rate block size should be constrained to be divisible by 2. Preferably still, the reduction in sample rate triggers a balancing upsampling on the output of the decoder.

Since this mode may be engaged or disengaged part way through a stream it is important to minimise any audio artifacts associated with the change. Preferably the operation of the decoder around the change is standardised so that the encoder can act to minimise artifacts in the knowledge of the full signal processing chain.

Even so, it is not desirable that the sample-rate should change frequently, it is better for it to stay reduced than to briefly increase.

Preferably sample-rate reduction is not performed in response to changes in the audio characteristics but in response to changes in transmission conditions causing the available data rate to be insufficient for satisfactory operation at the higher sampling rate. Preferably still, there is delay and hysteresis on the decision to restore the higher sample rate so it takes a higher capacity which has been stably available for a reasonable period before full sample rate operation is restored. This is to guard against the full sample rate being only transiently engaged.

Preferably the lossless codec appropriately adjusts internal state on the change.

For example, a predictor may carry the recent history of the audio across state boundaries for use in predicting the early samples of the next block. On a change of sample rate these would preferably be modified to represent plausible values for what they would have been had the previous block been coded at the new sample rate. The details of this modification need to be standardised so that both encoder and decoder perform the identical modification so as to keep the codec operating losslessly.

Reduction to mono Preferably the lossless encoder is able to code two identical channels to very little more data rate than one of the channels on its own. It is likely to do so by subtracting the first channel from the second channel and then, since the difference is identically zero, this modified channel should encode to very little data This capability can be exploited by the prequantiser by converting such a pair of channels carry to identical audio (perhaps the average of the two channels), thus reducing the data rate. Of course this is quite a perceivable change and not likely to be compatible with a claim of high resolution reproduction. But it's still a useful strategy to extend codec operation to lower data rates below those where satisfactory operation with independent channels is possible.

As with sampling rate reduction, this is an operating mode that should preferably be engaged in response to poor transmission channel capacity rather than in response to characteristics of the supplied audio. Once again it should preferably be engaged or disengaged deliberately, not briefly and care should be taken to avoid artifacts from the disappearance or reappearance of the difference signal.

In particular, since the difference signal is noise shaped (by virtue of each channel individually being noise shaped), the methods of the section "Transition to Lossless" below will be beneficial in stopping a click arising from the cessation of noise shaping when the difference channel becomes identically zero.

We also point out normally channels are quantised to integer multiples of A with a pseudorandom offset and that pseudorandom offset should be a different pseudorandom sequence for each channel. Two channels being identical is a special case that differs from this general policy and the lossless encoder needs to be able to recognise and code this special case.

Transition to Lossless Having discussed various possible means by which the prequantiser might reduce the audio information content, there is also the important possibility that it might choose to leave the audio unmodified in which case the whole codec becomes lossless.

Having this operating mode available opens up the possibility of primarily lossless operation, but smoothly transitioning to lossy if channel capacity degrades or perhaps for the most difficult sections of the audio where the information content exceeds the channel capacity.

In lossless operation, the audio is unaltered by the prequantiser so the audio presented to the lossless encoder will have a zero offset rather than a pseudorandom offset. Consequently, the lossless codec needs to have the flexibility to operate on audio with or without a pseudorandom offset.

It is also important to be able to slip in and out of lossless mode without audible artifacts. Transitioning to lossy operation is straightforward, starting up noise shaped quantisation. But transitioning to lossless operation presents a problem.

Noise shaping operates on the assumption that error committed on this sample can have its audibility reduced (spectrally shaped) by making alterations to future samples. But if we go lossless then those future samples cannot be altered. The error committed on the last lossy sample cannot be shaped at all, the error on the previous lossy sample can only have very limited shaping et cetera. This causes a click at the point of stopping noise shaping.

Now if lossless means 16 01 24 bit audio then the click, whilst regrettable, may be quite hard to perceive. But if lossless means cascading re-encode of a low rate stream prequanfised according to the invention, the quantisation level will be rather higher, the click more of a problem and the need to mitigate the click more important.

To go lossless without introducing a click at the transition, we need a method to quanfise and noise shape a finite set of samples, jointly quantising them so as to minimise the spectrally weighted error. We should transition from normal noise shaping to this technique for the last n lossy samples. Larger values for n allow better shaping of the quantisation errors but will be more computationally expensive. In practise even moderate values of n like 4 or 8 allow worthwhile reductions in the click and it's unlikely to be worth using n larger than 32.

This joint quantisation can be done by least squares. Our model for setting up the least squares problem is shown in Figure 7. The difference cf, between quanfised audio and the original audio is fed through a weighting filter W(z-1) and we evaluate (and seek to minimise) the power of the resulting signal et power evaluated. Since the noise shaping filter expresses our view about how important errors are in various spectral regions, a good choice of weighting filter is the inverse of the noise shaping transfer function, as follows: W(z1) = (1+ B(z-1))/(1 + Let 147(z-1) have impulse response 1 + wiz-i so et = dt+Ei,iwidt_i Normal noise shaping fits this model: At time t, the noise shaper evaluates Ei"widt_i and greedily chooses the permissible value of di that minimises Id, + Ei" widt_il and hence et2. It disregards the influence of this choice on subsequent values of e since it will have freedom to choose later values of d to minimise them.

But when later values of d will be zero because the quantiser will be operating losslessly then this assumption breaks down and those subsequent values of e need taking into account when choosing dt.

Suppose (i, I t < 0) are fixed, having previously been chosen by noise shaping and (d I t > n) will be 0 because the quantiser will be operating losslessly. The task is to choose permissible fdo,di,*** ,d"_11 in order to minimise Z", et2 We will initially discuss how to find suitable [do, di, * * * )11,2_1} c Zn and then how to modify the approach to account for quantiser step size and any pseudorandom offset applied to those n samples.

Expressing this in matrix terms, let H = (di (the historic error), X = d"_2 (el eo (the offsets to be chosen) and E = (the future output of the weighting filter) e2 : Then EL 0 et2 = ET E = IIEII2 and E = Will + W2X where W1 & W2 are Toeplitz matrices containing coefficients from the impulse response of W(z-1).

So we want to find X c Zn to minimise 11W111 + W2XII 2 where matrices W, & W2 are known at design time and H is a vector of the recent quanfisation error.

One method of solving is as follows First W1 & W, both have a large (perhaps even countably infinite) number of rows and VV, has a large number of columns. It would be convenient to work with smaller matrices. Our first act is to decompose W2 = QR where Q is column orthogonal and R is upper triangular.

This reduces the problem to minimising (QT + axii2 where (QTWi) has n rows and]? is n x n.

If R was diagonal, then solving this in integers would be as easy as solving it for real values. But that is typically far from true.

However, there are known lattice reduction techniques, such as LLL (LenstraLenstra-Lovasz) which allow us to find an integer valued unit determinant matrix V such that RV is nearly orthogonal.

Substituting Y = V-IX we can minimise II (QT14/1)H + (RV)Y112 for YE Z" and then transform to a solution to the original problem by X = VY.

Because RV is "nearly" orthogonal, this is a far better behaved problem than our original one. We can be sloppy and leap ahead to an approximate solution at this point: minimising II (RV)-1(QT-w3H + YIP is a closely related problem and is easily solved by rounding each row of -(RV)-1(QTW3H to yield Y. Hopefully such a solution is close to the minimum of II(QTWI.)H + (RV)YI12.

Better however is another round of OR decomposition RV = (21.1?1 which transforms our problem to minimising 11(21772TWIN ± Rig2 * R1 is upper triangular, and thanks to RV being is "nearly" orthogonal, R1 is nearly diagonal. We can produce a reasonable solution for Y by solving for each row in turn (starting with the last) with back substitution. This is not guaranteed to find the Y that achieves the actual global minimum because R1 is not actually diagonal but often does in practice. It gives better results than the sloppy method above and far better results than trying to solve IIQTW1H RXII2 directly, ignoring the ill conditioned nature of the problem.

Having produced a satisfactory value of Y, we can now transform to the desired variables X by computing X = VY A key point is that the majority of this computation can be performed in advance at design time, only leaving a small amount to be performed in real time when the need arises to jointly quantise the final n samples before going lossless.

oirQrwi), R1 and V only depend on W (which specifies how we weight error in different spectral regions) and can be prepared ahead of time and tabulated for later use.

The run-time procedure then is to take recent values of quanfiser error, premulfiply them by a precomputed stored matrix (121TQTWI) and then solve 1121T(2TW1H + R11112 by back substitution where R1 is precomputed and stored. The resultant integer vector Y is then premultiplied by a third precomputed and stored matrix V (which is integer valued and unit determinant) to give the resultant n values for [do, d1,* * * ,c1_1}.

The sloppy approach takes recent values of quantiser error, premultiplies them by 30 a precomputed stored matrix -(RV)-1(QTW1) and then rounds each row of the resultant column vector to give and integer vector Y. This is then premultiplied by a second precomputed and stored matrix V as before.

OR decomposition is not the only approach for solving least squares problems and there will be alternate ways of arranging some of the arithmetic. The key element is that the problem is solved for a transformed set of variables (Y) with respect to which the problem is better conditioned and transformed to the desired values by an integer valued and unit determinant matrix.

Summary of algorithm

The flowchart in Figure 8 summarises the steps involved in the above algorithm.

At design time, a desired frequency weighting filter (which may be the inverse of the noise shaping transfer function) is used to formulate a least squares problem in n variables.

The potentially large matrices in this least squares problem are initially reduced to n x n matrices describing the same minimisation problem.

The problem is probably ill conditioned so a lattice reduction algorithm, for example LLL, is used to find a different basis that can be transformed to the original one by an integer valued unit determinant matrix.

Matrices describing this better conditioned problem in a suitable form for easy solution are stored for run time use, along with the integer valued unit determinant matrix to transform a solution to the better conditioned problem into the original variables.

At runtime, the noise shaping filter state captures all the relevant information about the noise that needs to be quenched on stopping noise shaping. It is premulfiplied by a pre-stored matrix to map it into the n dimensional minimisation problem.

The problem is the solved for integers in the better conditioned basis. This might simply involve rounding each coefficient for a quick and sloppy approach, or more accurately involves back substitution using a pre-stored upper triangular matrix.

The solution is then transformed into the problem basis by multiplying by the pre-stored integer valued unit determinant matrix.

Modification for step size A A non-unit step size A can be accommodated by dividing H by A, solving for integer valued X and then restoring the scale by multiplying X by A. Optionally the multiplication by A might be folded into the prestored matrix V in which case the prestored matrix would have determinant An instead of 1.

Modification for pseudorandom offsets If the n values are to have pseudorandom offsets, then this can be accommodated by extending the vector H with n initial rows containing the negated offsets and similarly appending a copy of W2 on top of W1. Having solved for X the offsets can be added back.

Modification for different forms of noise shaping function As expressed above, we used the potentially infinite impulse response of W to express the weighted error signal in terms arising from prior errors as MIL If the noise shaping filter is all-pole, these prior errors are precisely its state variables. If it has some other form, then it will be operationally convenient to use its state variables for H instead of the prior errors which may not be readily accessible -or may need to be more numerous. This is easily done by altering W, to suit so that WiH is still the weighted error signal. The altered W, probably will not be completely Toeplitz but this is not a problem as the calculations do not make use of that property.

Modification for low influence vectors It may be the case that one or more diagonal elements of R1 are very small, corresponding to coefficients in Y that have very little influence on the metric.

Rather than allowing back substitution to choose large values for these coefficients to achieve minor reductions in the metric, it may be better to decide they will be set to zero. Eliminating these coefficients will reduce the size of the precomputed and stored matrices.

Computational cost Whilst there is considerable computational cost at design time in transforming noise specifications into suitable matrices to store, the runtime cost for solving a particular instance of the problem is small.

The initial multiplication computing (Qir-T WOH is a similar operation to continuing operating the noise shaping filter for another n samples.

Solving Y by backsubstitution involves for each value: subtracting the dot product of previously computed Y values with a precomputed vector from the previously computed vv)1/ and computing the result. This is less than n multiply-accumulates per coefficient and the quantisation that would have happened if the noise shaper was to still be in operation.

Premultiplying Y by R1 is once again n multiply-accumulates per coefficient.

So the incremental computational cost of the technique over a hypothetical alternative of continuing the noise shaping for the n samples is an insignificant 2n multiply-accumulates for each of n samples.

Commonality of step size across channels Different decisions can be taken about whether all channels should be constrained to have a common step size A, or whether channels should be allowed to have different step sizes. There are also intermediate possibilities, for example a 5.1 multichannel signal might sensibly have one step size for {L,R,C}, another for {Ls,Rs} and a third for {Lfe}.

Allowing different step sizes gives the prequantiser more flexibility, but is probably not useful for closely related channels like {L,R,C}. Step sizes need communicating to the decoder, so there is a data rate cost in increasing the number of values to communicate. It is also helpful for channels to have a common step size if they might be strongly correlated to help the lossless encoder take advantage of that correlation for data compression.

If the prequanfiser might reduce sample-rate then channels constrained to the same step size would preferably also be constrained to operate at the same sample-rate.

Current block analysis Preferably the currently supplied block is analysed in order to estimate the amount of data it will losslessly encode to.

Figure 9 illustrates a sensible method of analysis.

Each channel of audio in the block is windowed and an ACF (autocorrelation function) of the windowed audio calculated. Preferably this ACF has one more term than the order of prediction filter that will be used in the lossless encoder.

For each prequantiser configuration of interest, we can perform the following operations on each channel of ACF: * Compute the ACF of the quantisation noise introduced by the quanfiser. This is most easily done by precomputing and storing the ACF of the noise introduced by unit quantisation, and multiplying by a'.

* Add the quantisation ACF to the signal ACF to give us an estimate of the prequantised ACF * Apply the Levinson Durbin algorithm to evaluate the residual power P after filtering with a well chosen FIR filter.

* Finally the encoded data rate per sample can be estimated as log1 (P/blockSize)-1og24 + K where blockSize is the number of samples in the block and K is a constant.

We could derive a value for K from the entropy of the normal distribution and the windowing function but it's better to empirically measure it as this allows for inefficiencies in the lossless coding and non-normal distribution of the prediction residual.

The estimate for losslessly encoding the whole block is then the sum of the channel estimates plus an allowance for bitstream overhead.

Optionally this could be extended to evaluate the benefit of exploiting correlation between channels by also performing the operation for channel difference signals and selecting the lower bit estimate between a channel and the corresponding difference signal.

This analysis comprises a reasonable amount of computation, but it is computation the lossless encode would want to perform anyway in order to design its prediction filter. Preferably the analysis (including the noise ACF for the prequantiser configuration actually applied) is supplied to the lossless encoder to save it duplicating the work.

The discarded analysis work are evaluations of prequantiser configurations that do not end up being used. If desired, this can be minimised with a slight loss in accuracy by only using the early terms of the ACF. This is because, in practice, most spectral variation is exploitable by small (2" order) prediction filters with diminishing returns from increasing order.

Preferably, the measured ACF can also be used to guide choice of the noise shaping filter based on the broad spectral characteristics of the audio. As discussed above, it is preferable that such choices only affect the transfer function shape above a threshold frequency.

Optionally, such choices are made based on the ACF of earlier blocks rather than the current block to that an audio event in mid-block causing a spectral change in the noise at the start of the block.

However the prequantization architecture allows for noise shaping to change mid-block so more sophisticated signal analysis could be used to investigate a change in audio characteristics and narrow down where to apply the noise shaping change.

Lossless Encoder signal processing Figure 10 shows an overview of the lossless encoder signal processing.

A block of multichannel audio is matrixed to exploit inter-channel redundancies.

The opportunity for compression gain here is less significant than one would hope and so we do not advocate anything more sophisticated that conditionally subtracting one channel from another to create a difference channel. This suffices to exploit the situation where a pair of channels carry a mono, or near mono, signal.

Preferably, the potential for matrixing creates a constraint on the prequantiser that potentially matrixable channels should have a common step size A (and indeed sample rate). This ensures that the result of the subtraction still has a known remainder modulo A and avoids issues with further quanfisafion.

Subsequently, each channel is then processed independently.

This starts by predicting each sample value from prior ones and subtracting to create a prediction residual (linear predictive coding). An equivalent perspective is that the encoder filters the audio by a filter with unit first impulse response 1 + P(z-1), the filter coefficients being chosen to whiten the spectrum of the resultant prediction residual. In Figure 10 we have chosen to use the latter perspective, the negated output of P(z1) forms a prediction.

The prediction residual is then quantised to a multiple of A (the prequanfisation step size). This quantisation destroys no information since each range of A consecutive values for a sample only contains one possible quantised value.

Surprisingly there is no need to adjust operation for pseudorandom offsets at this point.

The quantised prediction residual can then be divided by A to yield an integer for further processing. (We separate the operations of quantising to a multiple of A and division by A to ease subsequent discussion, clearly an implementation can combine them into a single integer to integer division operation with suitable rounding behaviour).

The step of quantising the prediction residual has some flexibility. For example choosing the highest multiple of A that does not exceed the input (QA(x) = afloor(x/A)) or choosing the lowest multiple of A that's no less than the input (QA(x) = Aceil(x/A)). Both of these choices however introduce a constant offset to the output, slightly shifting the distribution of the prediction residual. Depending on the entropy coding technique, it may be preferable to avoid this shift by, for example, calculating QA(x) = afloor(x/A + 0.5). Alternatively, a compensating offset can be introduced into the prediction. Whatever choice of rounding behaviour is made, the choice needs to be standardised because faced with an predictor output value the decoder needs to be able to determine precisely what range of W possible input values could have given rise to that output value.

Deferring discussion of the adjustment block, each sample value is then split into two. It turns out that the prediction residual has a pretty stable shape of distribution (something like a thick tailed normal distribution) but a variable standard deviation. So generically we want to divide the prediction residual by a scale factor (which we'll call level) to yield a deviate with a stable distribution for entropy coding.

Typical lossless coding practice is to constrain the scale factor to be a power of two, say 2k, and strip off the k fractional bits after the division (which we'll call the touchup) to leave a value we'll call the msbs (as they are the most significant portion of the binary word). These msbs are entropy coded (often with Rice code, a particular choice of Huffman code). The k stripped off fractional bits are approximately uniformly distributed so there's no benefit in entropy coding them and their verbatim value is appended to the rice coded msbs to make a composite codeword.

However, the msbs have the property of being a coarse approximation to the prediction residual and a coarse approximation to the prediction residual is sufficient for the lossless decoder to decode a coarse approximation to the audio.

Preferably this property is exploited by packaging the touchup data for the block separately to the entropy coded msb data (henceforth the coarse data) for a block. Variable delay FIFO buffering is a key component of the prequantised codec, but it comes with hazards to the buffered data. Mid-stream startup and packet loss are two scenarios where unbuffered data can be accessed immediately but buffered data is not available for several blocks. We propose buffering the touchup but not the coarse data so that in circumstances when the buffered data is unavailable an approximate decode can still be performed from the coarse data.

Since there will be occasions when the approximate decode is heard, we are concerned to minimise the audibility of the approximation. This is the purpose of the adjustment block shown in Figure 10. By adding the previous value of the touchup to the current prediction residual it noise shapes the split with a transfer function having a zero at DC and thus reduces the audibility of the approximation error.

Some additional care is needed across block boundaries. If A changes from A, to A2, then the delayed touchup value needs multiplying by Ai/A, to match the change in scale of the integer prediction residual.

There are other arithmetical rearrangements which achieve the same effect. For example, the adjustment could be multiplied by A and added before the division by A. With this rearrangement there would be no need to adjust the delayed value on a change of A. This adjustment to the split is actually slightly detrimental to the lossless encoder's compression efficiency because it increases the entropy of msbs and hence the amount of coarse data. However the improvement in quality of approximate decode more than justifies the slight increase in data rate.

The technique is not limited to a single zero. More complex adjustment could be performed to implement an arbitrary noise shaping transfer function but a single DC zero is probably the most sensible compromise.

Lossless Decoder buffering and signal processing Figure 11 shows an overview of lossless decoder buffering signal processing.

Processes generally match those in the encoder, but with inverse effect and undertaken in reverse order.

Data representing the msbs and touchup is read from the incoming packet. The msbs data is entropy decoded, inverting the entropy encoding in the encoder, whilst the packet's touchup data is pushed into a FIFO buffer.

For each sample, touchup data is pulled from the FIFO buffer and joined to the entropy decoded msbs inverting the Split operation in the encoder. To invert the encoder adjustment operation, the previous value of touchup is then subtracted and after multiplication by A, this produces a replica of the quantised prediction residual in the encoder.

Decoder prediction We now explain how the decoder prediction block inverts the encoder prediction block.

By an inductive hypothesis prior output values from the decoder prediction unit match prior input values to the encoder prediction unit and so the output from the decoder prediction filter replicates the output from the encoder prediction filter. We will call this common value p. We will also term the current input and output of the encoder predictor unit x and y respectively.

The lossless encoder encoded audio whose remainder modulo A was equivalent on this sample to some value d. The lossless decoder needs a replica of that value d. Figure 11 copies the dither generation means from Figure 4 but there are a couple of special cases where something different is needed. If the prequantiser was operating in lossless mode and not altering the signal, then d E 0 modulo A rather than derived from the pseudorandom generator. Also if the channel is matrixed, and so carries the difference between two prequantised channels then the remainder modulo A is equivalent to the difference between the pseudorandom deviates for the individual channels.

Invertibility follows from noting that in the encoder y=x+p+ ER where ER is the error introduced by the encoder quantiser and x-y E d modulo A. The input to the encoder quantiser is x + p and the input to the decoder quantiser is d+p. Since x d modulo A both quanfisers add the same error £ to their input so long as they're standardised to have the identical rounding behaviour.

So the encoder output is y=x+p+E and the decoder output is x' =y+d(s+d+p) which equals x as required establishing lossless reproduction.

It will be appreciated that there are many equivalent ways of arranging the computation and some of them have a negated signal running through the decoder quantiser compared to the encoder quantiser. In such cases the decoder quantisation behaviour would need to be standardised to be complementary to that of the encoder as noted in reference [1].

If the recent output of the decoder prediction unit does not correctly replicate the input to the encoder prediction unit, then the inductive hypothesis above does not hold and there is no reason to expect the next one to do so. However it sometimes happens that it does, and less frequently two samples happen to replicate the encoder prediction input, and even more occasionally sufficient output samples happen to attain the correct values to ensure that all future ones do. For small orders of prediction filter (eg 4), this stochastic mechanism is adequate to ensure lossless operation is acquired in an acceptable time.

Preferably the quantiser in both the encoder and decoder is be noise shaped. Exact invertibility still holds subject to identical noise shaping in both encoder and decoder. Noise shaping can help to reduce the audibility of noise during the period while the decoder is acquiring matching state to the encoder. It can also accelerate this process of matching state if the noise shaping is chosen to reduce the excursion of the noise at the decoder predictor output.

Decoder matrixing Preferably the lossless codec has the capability to encode the difference between two channels instead of the channel itself This allows it to reduce data rate by exploiting correlation between channels when it is present.

If a channel is matrixed then the decoder should undo the matrixing after the predictor by adding the other decoded channel to the difference channel.

However, matrixing also has implications for the pseudorandom offsets to be used on the difference channel. The pseudorandom sequence defines the offsets used at the output of the prequantiser, which is to be losslessly reproduced at the output of the decoder. However, in the decoder, the pseudorandom offsets are applied in the predictor which is inside the matrixing operation. Consequently, the pseudorandom offsets to be applied in the predictor on a difference channel should be the difference of the pseudorandom sequences for each channel, so that when the other channel is added back the correct pseudorandom offset is restored. There is no need to calculate the difference modulo A, as it does not affect the predictor output.

Errors in touchup If the FIFO buffer is unable to deliver the correct touchup then the touchup signal will be incorrect. However, the adjustment means that each erroneous touchup value is added to one sample and subtracted from the next. The touchup error is thus filtered by (1 -z-') and then filtered by the Prediction unit whose frequency response roughly approximates the current spectrum of the audio. The inclusion of (1-z-1) in the transfer function reduces the audible impact of the error.

If the decoder knows the FIFO buffer is currently unable to deliver the correct touchup then it is also beneficial to minimise the touchup error by feeding a constant value to the Join unit instead of pulling incorrect data from the FIFO 35 buffer.

Packet Structure Figure 12 illustrates a possible structure for an encoded packet. This example packet contains 3 blocks and 2 channels of audio.

The packet starts with a packet header, and then 3 blocks of audio are described each with a block header and then coarse data for each channel. We will term all of this the forward coded data.

The touchup data however is dealt with separately, reflecting the variable delay FIFO buffering it experiences in the encoder and decoder. The rest of the packet, however large or small it might be. is filled with touchup data pulled from the encoder FIFO buffer.

A decoder is likely to want to completely decode each block before it moves onto the next one, which involves pulling touchup data from its FIFO buffer. Sometimes the decoder FIFO buffer may be nearly empty when the packet arrives. In Figure 12 it held touchup for block t and some but not all of the touchup for block t+1. So the touchup data contained in this packet needs reading in order to decode block t+1 and t+2, and consequently it needs pushing into the FIFO buffer on receipt of the packet.

Preferably the touchup data fills the packet starting from the end of the packet working backwards in reverse order towards the end of the forward coded data.

The advantage of this layout is that the forward coded data is variable sized, so the decoder does not know where it ends until it has finished entropy decoding the third block. We wish to avoid wasting data space in the packet explicitly indicating where the forward data finishes and the touchup data starts. However, any decoder which has received a packet of data must know by some means or other how long the packet is. If we start the touchup data at the end of the packet, we can avoid requiring such a length field in the packet.

With the touchup data running from the end of the packet backwards, the decoder does not know where the touchup data finishes until it has finished decoding the whole forward data. But that is not a problem because it can push the whole packet into the FIFO buffer on receipt and later remove the forward data from the FIFO buffer after it has finished decoding the forward data but before receipt of the next packet.

We like to think of the packet as being a stream of bits, but bits are packaged up in computer systems into larger units like bytes and words and it is helpful if there is consistency in their endianness. If, for example, the endianness convention is least significant bit first then the forward data should be written and read least significant bit first. But as the touchup data runs backwards from the end of the packet, touchup words should be written and read the other way, most significant bit first.

Flexible packetization Data is generally transported in packets, and for an audio codec that codes blocks of audio it would be typical to have a one to one relationship between encoded blocks and packets. lithe resulting packets were not suitable for the transmission channel there might be a packet segmentation and reassembly layer such as L2CAP over Bluetooth Having a packet segmentation and reassembly layer has disadvantages. There is data overhead, consuming bandwidth that could have been used for better audio.

Overall delay can be increased and errors in the transport layer, for example lost packets that can not be redelivered in time, may cause two audio codec packets to be damaged instead of one.

Preferably our blocks are fairly short, perhaps 1-2ms. This keeps loop delay down in the encoder servo enabling swift reaction to changes in lossless encoded data rate and allowing the noise floor to closely follow the audio events that give rise to it. An integer number of blocks are included in each packet, the packet in Figure 12 contained three. This integer may vary from packet to packet.

To support this a packet header contains a field specifying how many blocks are contained in the packet (or alternatively each block header contains a flag specifying if it's the last block in the packet). Preferably each block also has an sequential index associated with it and preferably the packet header also contains a field specifying low order bits of the block index for the first block in the packet. Thus, if a packet is corrupt or otherwise fails to be delivered, the decoder can deduce from the block index field in the next received packet how many blocks were described by the missing packet(s) and so decode that packet at the correct time after the correct amount of error concealment.

The benefit of having an variable integer number of blocks in each packet is that it decouples the block encoding from the packet characteristics required by the 25 transmission channel without suffering the disadvantages of a packet segmentation and reassembly layer.

Buffering of the touchup data as described above is critical to this operation as it gives the flexibility to fill packets with slightly more or less touchup data to balance them containing slightly under or over the long-term average number of blocks.

As an illustrative example, suppose blocks describe 1ms of audio and the transmission channel provides 300 packets per second. Successive packets would contain 3, 3 and 4 blocks in sequence so that every 3 packets describe 10 blocks as required. The packets containing 3 blocks would have more space left to convey touchup data and the packets containing 4 blocks less. Generalising this, a desired average rate of p/q blocks per packet On the above example p = and q = 3) with one block peak-to-peak jitter of is implemented by putting (j +1)p/q] -ljp/q] blocks inside the jth packet.

Preferably the format supports all parameters that affect decoding (such as prediction coefficients, changes of prequantised step-size or sample rate, changes in entropy coding tables) changing at arbitrary block boundaries and does not constrain them to only change at packet boundaries. By this we mean that their value (if changed from the previous block) is conveyed in block headers, not by packet headers specifying values to use for the whole packet.

This is advantageous because of the buffering delay in the encoder. At the point when a block is presented to the encoder, prequantised and losslessly encoded those encoding decisions can be made without committing to a decision about where the packet boundaries will lie. A firm decision on packet boundaries can be deferred until the encoded block emerges from the buffer for actual transmission.

Were transmission channel capacity to unexpectedly suddenly degrade, then the plan for where the packet boundaries lie can be expected to change. If there is timely computational capacity available to backtrack and revise prequantization and encoding of the buffered blocks then this will improve the audio outcome. But there may not be, especially in a real time environment. In this case the ability to revise the packetization strategy quickly without change to already encoded blocks is important.

Another advantage is that it allows the packetized encoded audio to be reflowed without reencoding if the data is to travel across another transmission channel with different characteristics. For example the above example where successive packets cyclically contain 3, 3 and 4 blocks could be reflowed onto another channel which had smaller packets but 500 of them per second by parsing the packets sufficiently to establish the boundaries of encoded blocks and touchup and then repackefizing them into new packets each of which contained 2 blocks.

Decoder buffer synchronisation At the start of an encoded stream, the decoder knows its FIFO buffer is empty. If decode starts there and proceeds without errors the decoder can pull the correct amounts out of the FIFO buffer exactly matching the amounts the encoder put in to describe the touchup data. In such a situation, there is no need for synchronisation.

But it is desirable for a streaming audio format to support the decoder starting up mid-stream at an arbitrary packet boundary. Or to recover from missing packets.

Preferably some packet headers include a field which allows the decoder to synchronise its FIFO buffer to contain the correct amount of data at the start of the packet. In this way, so long as at least that amount of touchup data has been delivered in previous packets since decode started (or restarted) the decoder can identify the correct touchup data to use for decoding the first block in the packet and subsequent blocks.

Even if insufficient data has been delivered, since the size of touchup data does not depend on its value, the decoder can synchronise its FIFO buffer to the correct size. In this was buffer occupancy is then correctly synchronised and will remain synchronised. Consequently, although the correct data is not immediately available, correct data will be available as soon as the decoder is consuming data provided in the first available packet. Moreover, the decoder knows how much initial data is missing and preferably can avoid using the missing unknown data to adjust the audio.

Preferably this field is a simple count of how many bits are expected to be in the decoder FIFO, which will be a non-negative number with a format dependent maximum thus suited to being stored in a fixed length field.

Preferably this field is not included in every packet header since it costs data rate.

Increasing the frequency of its inclusion reduces the length of time reduced quality reproduction is experienced after mid-stream startup or a missing packet. However, there is a minimum achievable time for reduced quality experience corresponding to the duration touchup data spends in the decoder FIFO.

Buffer overflow Ideally operation of the rate control servo will make this a rare event, but a strategy should be in place should it occur.

Encoder buffer overflow occurs if the lossless encoder is requiring greater capacity than the channel provides. If a packet contains the coarse coded data for a block, then all the touchup data relating to that block must be transmitted in that packet or earlier ones. Otherwise, the decoder buffer will underflow and lossless decode of that block cannot be performed.

If the encoder finds there is insufficient space in a packet to accommodate the required touchup data, then it could locally increase the data rate by enlarging the packet or reducing the number of coarse coded blocks it contains (thus increasing the local packet density).

Application requirements may however make a local data rate increase impractical, in which case the next best response (from an audio quality perspective) is to backtrack and revise the prequantization decisions for blocks whose touchup data has not yet been partly transmitted to the decoder in earlier packets.

Backtracking however requires computational resources which may not be immediately available. In this case, we have to accept that the decoder buffer will underrun, the decoder will not be able to perform lossless decode and we rely on touchup and predictor noise shaping (described in earlier sections) to ameliorate the consequent loss in audio quality. Having accepted that outcome, all of the touchup might as well be discarded from the block that can not be fully included (and the remaining blocks in the packet), slightly relieving the buffering stress.

Preferably the next packet uses the FIFO synchronisation field to ensure that correct touchup can restart at the earliest opportunity.

Preferably in the decoder, if there is insufficient data in the decoder FIFO to perform touchup for a block then the decoder knows the encoder buffer has overflowed and stops using touchup data to modify the audio until synchronisation is reset.

Buffer underflow Encoder buffer underflow results if the channel is providing greater capacity than the lossless encoder is using and the packetiser finds itself with insufficient data to fill the packet. In situations like silent audio, the lossless encoder produces a low data rate and this is a likely situation.

Resolving buffer underflow requires dropping the data rate, either by reducing the packet size, or putting an extra block into the packet (thereby resulting in slightly fewer packets than planned) or leaving a hole in the packet (so not all the data rate is used for audio).

Figure 14 illustrates operation with a hole. In Figure 14a, the lossless encoder has encoded up to block t + 3 and pushed encoded coarse blocks into the buffer's delay line and touchup into the buffer's FIFO. The buffer's delay line can accommodate 4 coarse blocks and contains coarse blocks t, t + 1, t + 2 and t + 3. Corresponding touchup data has also been pushed into the FIFO but much of this has flowed out into previous packets leaving part of touchup t + 3 and all of touchup t + 4 in the buffer.

A packet is requested to contain 2 blocks, so coarse blocks t and t + 1 are pulled out of the delay line into the start of the packet. Touchup is now flowed into the remainder of the packet but there is insufficient available in the FIFO buffer to fill the packet, leaving a hole in the middle. The encoder has no audio data that can use the remainder of the packet, but there may be other data that is useful to convey to the decoder in the hole when opportunities like this arise. Album cover art would be an example.

In Figure 14b we show the decoder FIFO slightly later. The packet above has been received and the next packet containing coarse blocks t + 2 and t + 3 and touchup for blocks t + 4 and t + 5. Up to and including block t + 3 has been decoded so the data in the hole is next to be pulled out of the decoder FIFO.

Having set the scene, the decoder needs to be able to identify the hole so as to discard it (or direct it to non-audio decode as, eg, cover art). Otherwise it would pull touchup data for block t+4 from data in the whole instead of the correct area. It can do this by keeping track of what data came from which packets and being configured to know how many blocks the encoder buffers (in this case 4).

Preferably, the decoder tracks how old the data it buffers is. Or more practically the place in each buffer where touchup data from a new packet starts and the first block index associated with that packet. Prior to decode of each block, the decoder discards any data that's at least as old as the number of buffered blocks. So before decoding block t + 4, data that arrived in packet (t + 4) -4 = t is identified as the hole, discarded (or processed) as such and the decoder continues to pull correct touchup data.

Half bit touchup We described above how typical lossless coding practice is for the encoder to split an integer number of bits off each sample of prediction residual, and entropy code the resulting most significant portion.

With this technique, the scaling of this most significant portion varies over a factor of 2, which is a fairly wide range. Multiple entropy coding tables could be used to cover this range, but the improvement in compression performance from doing so is minor. The bigger issue is that the data rate of the entropy coded prediction residual varies by a whole bit per sample over the range. At small data rates, this can leave very little data rate for the buffered touchup data which greatly reduces the utility of any given duration of buffering. Operation at low data rates and low latencies benefits from the data rate allocated to the entropy coded msbs being

both low and stable.

It is therefore helpful to split an odd (2k + 1) number of touchup bits off a pair of samples, so that the data rate variability of the entropy coded msbs is reduced to 0.5 bits per sample.

To do so, we want to view the action of splitting off lsbs as a 2D quantisation 25 operation.

There is a sign convention difference between viewing the action as quantisation and as splitting off the lsbs. Splitting off the lsbs subtracts a non-negative value from the msbs. However, quanfiser output is generally considered to be the input value plus quanfiser error. Splitting off!sips also always rounds down, whilst quanfiser error takes signed values since a quanfiser will typically round to the nearest quantised value sometimes up, sometimes down.

Figure 15a shows ordinary quantisation as a 20 operation. A point fro, xi) is in the plane is quanfised to the regular grid of solid circles. The squares show the Voronoi regions which map to each output value. However, we want to apply noise shaping to this quantisation. Quanfisafion error in xifrom the previous 20 quantisation is subtracted from x, prior to quantisation, just as if we were quantising each value individually in turn.

But the quantisation error from quantising xo needs to be taken into account in quantising x1. This has the effect of the skew transformation shown in Figure 15b and Figure 15c shows this expanded to the whole plane, with changed Voronoi regions to give a high pass characteristic to the 2D quantisation. We quantise x, to implement the vertical lines and (x, -erre) where erro is the error added by quantising x, to implement the diagonal lines. The error to feed forward to subtract from the next x, is err,. The actual operations here are no different to 1D quantisation with first order noise shaping.

Alternatively, instead of quantising x, & (x1 -erro), we could have implemented the diagonal lines by quantising (x, -x"). This performs the same quantisation but with the 2D quanfised output points differently parameterised.

Figure 15a shows a quantisation to half as many points, but still lying on (a subset of) the integer grid points. The Voronoi regions are now diamonds. Figure 15b shows the effect of the noise shaping skew transformation and Figure 15c shows how this tessellates to the whole plane. We quantise x" to a multiple of 2k+1 with quantisation error err, to implement the horizontal lines and quantise (x, -0.5err1) (or (x, -0.5x1)) to a multiple of 2" with quantisation error erro to implement the diagonal lines.

We now consider the error to feed forward to subtract from the next xo.The overall alteration to x, is not erro but (erro -0.5err1). Consequently, the desired noise shaping of the deviation in x, is responsible for -(erro -0.5err1) of the alteration to x, and the amount to carry over for future noise shaping is err, + (erro -0.5err1)= 0.5err1 + erro.

This has discussed the large-scale behaviour of half-bit touchup with noise shaping where it is reasonable to draw Voronoi regions with straight line boundaries. But at low data rates k is small, and rounding details matter.

In particular 0.5err1 may not be an integer. We will round it to an integer, but it turns up twice and we need to be careful that the rounding is dealt with consistently to ensure none of the quantisation error escapes noise shaping. Some of erri is used to reduce x, and some is fed forward to reduce the next x,. Preferably the two quantities sum to erri so that all of erri gets noise shaped. So if we quantise (x0 -[0.5err1J) then we carry over (10.5err1l + erro). Or if we quantise (x, -10.5errip then we carry over ([0.5erni + erro).

Pseudorandom offset synchronisation For lossless reproduction the decoder needs to be able to furnish itself with a replica of the pseudorandom offsets used by the prequantiser.

To accomplish this, seed information needs to be conveyed in some (but probably not all) of the block or packet headers.

Preferably each channel is associated with a different pseudorandom sequence, which is chosen long enough that repeating the sequence will not cause audible patterning.

Good sounding pseudorandom generators have at least 32 bits of state, probably more. So it would be expensive to explicitly transmit the generator's state for each channel in order to seed the generators.

Preferably we maintain a sample count (modulo some repeat period) and the pseudorandom generation method is chosen so that state can be efficiently fast forwarded. The decoder seeds the generator for each channel with an initial standardised seed that is different for each channel, and then fast forwards the state by a sample index derived from the stream. The generators are then synchronised to generate pseudorandom offsets. When the sample count hits the repeat period, both encoder and decoder reset the generator seeds on all channels to the standardised values.

More preferably we maintain a block index count modulo a suitable power of 2 and the sample index count is the block index count times the number of samples in a block. Each packet header then contains low order bits of the block index count, with occasional packet headers carrying higher order bits.

The attraction of this approach is that it also satisfies another desirable system property. If a packet failed to be delivered then we might not know how many blocks the missing packet contained. However, when the next packet arrives, reading the low order bits of the block index in the packet header allows the number of missing blocks in the missing packet(s) to be deduced within limits. Consequently, the decoder knows how many samples are missing genuine data and need to be interpolated and also the correct timing for replaying the received packet.

There are many known pseudorandom generators which could be used and the choice of pseudorandom generator is beyond the scope of this document. However, we do want to explain what we mean by fast forwarding.

For example, linear congruential generators have state update equations of the form: = (ax" + c) modulo in Consequently xk = (akxo (a" _ 1)(a -1)-1c) modulo in Well known fast exponentiation algorithms efficiently calculate a" modulo in in lag2k time so if (a -1) has an inverse modulo in and we precompute and store it then we can efficiently calculate xk from the initial state xo and synchronise the decoder's pseudorandom generators to an arbitrary point in the stream.

If one of the prequantisers information reduction strategies is to reduce the sample rate of the losslessly encoded audio then we need to ensure that both the prequantiser and lossless decoder consume a full block's worth of pseudorandom offsets even though the operations actually need fewer. This is to keep the pseudorandom generators seeds synchronised with the sample index at block boundaries.

Entropy coding Rice coding is a traditional approach to coding the coarse data in a lossless codec.

But it is not ideal for the application. It's a Huffman code tuned for a Laplacian distribution which is an acceptable but not particularly close match to the distribution of prediction residuals. And it encodes to 3 bits per sample, which sets a limit to the ability to operate at lower data rates and at slightly higher ones degrades the usefulness of buffering touchup data because there is little data rate allocated to it.

There are other well-known methods of entropy encoding, but particularly interesting is ANS (Asymmetric Numeral Systems) invented by Jarek Duda (eg arXiv:0902.0271 or arXiv:1311.2540). tANS (table driven ANS) is particularly appropriate but benefits from some adaptation to code coarse prediction residuals.

The issue to address is that tANS using k bits of state is inefficient for coding symbols with probability less than 2-k. We will limit the range of coarse prediction residuals (by splitting them more coarsely if they contain large outliers) but the extremal values will still have low probability.

Preferably this is addressed by coding pairs of coarse prediction residuals by the following process: * List the pairs in order of decreasing probability.

* Partition the list into groups, each group (except the last) containing a power of 2 pairs.

* The alphabet for tANS coding is now the set of groups, with extra bits to specify which pair within the group is coded.

The idea of compiling groups of symbols, with extra bits to distinguish group members, is reminiscent of Huffman's recursive procedure of combining two symbols with similar probability into a composite symbol with a trailing bit in the code to distinguish them.

The process can also be understood as coding the pairs in polar coordinates. Each tANS symbol represents a group of pairs roughly forming an annular ring. Within the ring, pairs have comparable probability.

Coding pairs instead of single samples has the advantage that there are half as many entropy codings or decodings to perform per block. For all the computational efficiency of tANS coding, it still involves parsing a bitstream into variable length fields which is an awkward process that is not particularly cheap computationally.

We could code larger units than pairs, but pairs appear to be the sweet spot as implementations use lookup tables for mapping between pairs and tANS symbols and those tables would be inconveniently large for triples or 4-tuples.

tANS decode decodes the symbol directly from the decoding state without reading the bitstream. The bitstream is read after decode to reload the decoding state prior to the next tANS state. This makes it easy to combine both the extra bits to resolve which pair within the tANS symbol should be decoded and the bits to reload tANS state into a single variable length read from the bitstream.

Servo Dynamics In Figure 1, the rate control servo is responsible for taking information about buffer stress and the currently supplied block of audio and choosing the quanfisafion step size A for the prequantiser to use.

Loop control is a well-studied area and there is no need to discuss the topic in general. However, the choice of A has implications for how the level of prequanfisation noise varies in response to the audio signal and there are audio considerations to take into account.

Firstly it is desirable that a transient event in the audio should not cause an increase in the noise level preceding that transient.

Secondly it is preferable for the level of the noise to be stable.

Figure 17 suggests a method for combining these considerations with the practical loop control considerations.

Preferably we avoid increases in A arising from analysis of the current block because this would increase the noise level at the start of the block, whilst an audio feature causing this block to contain more information than previous blocks probably starts somewhere mid-block.

Consequently, we provisionally choose A -LOOP based on feedback from previous blocks encoded sizes and the resultant buffer stress.

Next we estimate from analysing the current block of audio how much data would be required to encode the current block at max(A,ALoop)* If this is less than the channel capacity then we have no need to increase A. Even if the buffer is currently stressed enough to request an increase, it will be less stressed next block and we can defer any increase in the hope it may never happen. Consequently, we set A to min(A, A -Loop) and finish.

Next we consider if we are at risk of approaching buffer overflow.

If we are safely away from risking buffer overflow then we can largely ignore what we know about the current block. However, since we know that buffer stress will be worse after encoding, we can defer any proposed decrease in A since it might be reverted soon. Consequently, we set A to max(a, -Loop) and finish.

Alternatively, if we are at risk of buffer overflow, then suffering buffer overflow is a worse outcome than allowing the noise to increase in advance of a transient. We must abandon that objective for now and make a decision to stabilise buffer stress based on all the information available including the current block of audio.

Claims

Claims 1. A method for encoding input blocks of audio to packets of data, the input blocks containing one or more channels of audio samples, the method comprising the steps of: receiving input blocks of audio; determining a quantisation step size A for each audio channel in each block in dependence on a rate control mechanism; determining a pseudorandom offset for each sample in the input blocks, 10 the pseudorandom offset for each channel being a pseudorandom sequence having a seed; quantizing with noise shaping each sample in the input blocks to produce prequantised blocks, wherein each sample value in the prequantised blocks is equivalent modulo A to the corresponding pseudorandom offset; losslessly encoding the prequantised blocks in dependence on A on to produce blocks of losslessly encoded data, wherein the dependence on A is such that a smaller value of A would cause the losslessly encoded block to be larger and wherein the losslessly encoding is an injection mapping such that, for any prequantised block, losslessly encoding a different prequantised block that was also equivalent modulo A to the corresponding pseudorandom offset would necessarily produce a different block of losslessly encoded data; buffering the losslessly encoded blocks of data in a buffer; and generating packets of data for onward transmission in dependence on the buffered data, wherein at least some of the packets of data comprise data representing the seed of the pseudorandom sequence.
2. A method according to claim 1, wherein the rate control mechanism receives information about the buffer and the quantisation step size A is determined in dependence on the fullness of the buffer.
3 A method according to claim 1 or claim 2, further comprising the step of separating the losslessly encoded data in each block into a first portion and a second portion which are buffered separately in the step of buffering, wherein the first portion comprises coarse data and the second portion comprise touchup data such that the coarse data can be decoded without the touchup data to produce a coarse approximation of the prequantised block; and wherein the packets of data are generated such that each packet comprises an integer number of coarse data blocks and is filled up to available capacity with touchup data.
4. A method according to claim 3, wherein the touchup data is stored in a first-in-first out (FIFO) buffer and the packets of data are generated from one end with coarse data blocks and from the other end with FIFO buffered touchup data.
5. A method according to any of claims 1 to 4, further comprising the step of analysing samples in the input blocks, wherein the quantisation stepsize A is further determined in dependence on the analysis of the samples.
6. A method according to claim 5, wherein the quantisation stepsize A is increased if the analysis suggests that the buffer might otherwise overflow.
7. An encoder adapted to encode input blocks of audio to packets of data using the method of any of claims 1 to 6.
8. A computer readable medium comprising instructions that, when executed by one or more processors, cause said one or more processors to perform the method of any of claims 1 to 6.
9: A method for decoding packets of data to output blocks of audio containing one or more channels of output audio samples, the method comprising the steps of: receiving packets of data; extracting information indicating a quantisation step size A and a seed for each channel and block dependent on the data; determining an offset for each sample in a block, wherein the offsets for each channel are a pseudorandom sequence dependent on the corresponding seed; decoding the data to produce a prediction residual for each sample in the block dependent on the data; filtering the prediction residuals with quantisation to produce a filtered sample for each sample in the block dependent on the corresponding prediction residual, wherein each filtered sample is equivalent modulo A to the corresponding offset; and generating output blocks of audio in dependence on the filtered samples
10. A method according to claim 9, wherein a first portion of each packet of data is decoded without a delay and a second portion of each packet of data is buffered and delayed prior to decoding.
11. A decoder adapted to decode packets of data to output blocks of audio using the method of claim 10 or claim 11.
12. A computer readable medium comprising instructions that, when executed by one or more processors, cause said one or more processors to perform the method of claim 10 or claim 11.
13. A codec comprising an encoder according to claim 7 in combination with a decoder according to claim 11. 25
14. A method for encoding audio to data comprising: receiving input blocks of audio, each input block comprising one or more channels of audio samples quantised to an input audio precision; determining a prequantization precision for each channel in each block, there being at least one channel in one block where the prequantization precision is coarser than the input audio precision; producing prequantised blocks by, where the prequantization precision is coarser than the input audio precision, quantizing each sample in the input blocks to the prequantization precision with noise shaping, wherein below a threshold frequency of approximately 15kHz the transfer function of the noise shaping approximates a curve for equal loudness of noise; and losslessly encoding the prequantised blocks to produce blocks of losslessly encoded data.
15. An encoder adapted to encode audio to data using the method of claim 14.
16. A computer readable medium comprising instructions that, when executed by one or more processors, cause said one or more processors to perform the method of claim 14.
17: A method for reducing an audible transient on stopping noise shaping of an audio signal, the method comprising altering the next n quantised sample values by: multiplying state variables of the noise shaping and/or a difference between one or more previous outputs and corresponding inputs of the noise shaping by a precomputed matrix to yield an intermediate representation containing n or less values; quantising the n or less values in the intermediate representation, either directly or with back substitution, to produce n or less quantised intermediate values; multiplying the n or less quantised intermediate values by a precomputed integer valued matrix to produce n alterations for quantised sample values; and applying the n alterations for quantised sample values.
18. A device adapted to reduce an audible transient on stopping noise shaping of an audio signal using the method of claim 17.
19. A computer readable medium comprising instructions that, when executed by one or more processors, cause said one or more processors to perform the method of claim 17.