WO2007035147A1

WO2007035147A1 - Adaptive source signal encoding

Info

Publication number: WO2007035147A1
Application number: PCT/SE2006/000340
Authority: WO
Inventors: Anisse Taleb; Jonas Svedberg
Original assignee: Telefonaktiebolaget Lm Ericsson (Publ)
Priority date: 2005-09-23
Filing date: 2006-03-16
Publication date: 2007-03-29

Abstract

A flexible scalable encoder (30) includes a scalability mode controller (50) controlling various encoder stages (42, 44, 46, 48) to produce coding layers with different functionality depending on the measured network/ channel conditions. This encoder is able to both maximize functionality of the bitstream as well as coding efficiency, thus delivering both a robust as well as efficient bitstream.

Description

ADAPTIVE SOURCE SIGNAL ENCODING

TECHNICAL FIELD

The present invention relates generally to scalable signal encoding.

BACKGROUND

The need for offering voice and video services over packet switched networks has been dramatically increasing and is today stronger than ever. A lot of efforts at diverse standardization bodies are being mobilized to define efficient solutions for the delivery of delay sensitive content to the users. Noticeably, two major challenges still await solutions. First, the diversity of deployed networking technologies and user-devices imply that the same service offered for different users may have different user-perceived quality due to the different properties of the transport networks. Hence, improving quality mechanisms is necessary to adapt services to the actual transport characteristics. Second, the properties of wireless links provide no favorable condition for conversational and streaming services. Over a wireless link the available bandwidth, transmission and transport delay vary severely. These highly unpredictable variations make it difficult to deliver real-time traffic that demands a constantly available bandwidth and predictable delay. Adaptive service delivery seems to be an excellent approach in order to overcome the above mentioned issues.

Today, scalable audiovisual and in general media content codecs are available. In fact one of the early design guidelines of MPEG (Moving Picture Experts Group) was scalability from the beginning. However, although these codecs are attractive due to their functionality, they lack the efficiency to op- erate at low bit-rates, which do not really map to the current mass market wireless devices. With the high penetration of wireless communications, more sophisticated scalable-codecs are needed. Despite the efforts being put on adaptive services and scalable codecs, scalable services will not happen unless more attention is given to the transport issues. Therefore, besides efficient codecs, an appropriate network architecture and transport framework must be considered as an enabling technology to fully utilize scalability in service delivery. Basically, three scenarios can be considered:

• Adaptation at the end-points. That is, if a lower transmission rate must be chosen, the sending side is informed and it performs scaling or codec changes.

• Adaptation at intermediate gateways. If a part of the network becomes congested, or has a different service capability, a dedicated network entity performs a transcoding of the service. With scalable codecs this could be as simple as dropping or truncating media frames.

• Adaptation inside the network. If a router or wireless interface becomes congested, adaptation is performed right at the place of the problem by dropping or truncating packets. This is a desirable solu- tion for transient problems like handling of severe traffic bursts or channel quality variations of wireless links.

Recently proposed scalable codecs for speech, audio and video will be briefly described in the next few paragraphs. The concepts of the proposed methods for scalable service delivery will also be highlighted.

Audio coding (Non-conversational, streaming /download)

In general the current audio research trend is to improve the compression efficiency at low bit-rates (provide good enough stereo quality at bit-rates be- low 32 kbps). Recent low bit-rate audio improvements are the finalization of the Parametric Stereo (PS) tool development in MPEG, the standardization of a mixed Code Excited Linear Predictive (CELP) /and transform codec "Extended Adaptive Multi-rate Wideband" (a.k.a. AMR- WB+) in the 3rd Generation Partnership Project (3GPP). There is also an ongoing MPEG standardiza- tion activity around Spatial Audio Coding (Surround/ 5.1 content), where a first reference model (RMO) has been selected.

With respect to scalable audio coding, recent standardization efforts in MPEG have resulted in a scalable to lossless extension tool, MPEG4-SLS.

MPEG4-SLS provides progressive enhancements to the core Advanced Audio

Coding/ Bit Slice Arithmetic Coding (AAC/ BSAC) all the way up to lossless coding with a granularity step down to 0.4 kbps. Furthermore, within MPEG a Call for Information (CfI) has been issued in January 2005 [6] targeting the area of scalable speech and audio coding. In CfI the key issues addressed are scalability, consistent performance across content types (e.g. speech and music) and encoding quality at low bit rates (< 24kbps).

Speech coding (conversational mono) In general speech compression, the latest standardization effort is an extension of the 3GPP2 /Variable-rate Multimode Wideband (VMR-WB) codec to also support operation at a maximum rate of 8.55 kbps. In ITU-T an extension of the Multirate G.722.1 audio/video conferencing codec has been extended with two new modes providing super wideband (14 kHz audio band- width, 32 kHz sampling) capability operating at 24, 32 and 48 kbps.

There are a number of scalable conversational codecs that can increase Signal to Noise Ratio (SNR) with increasing amounts of bits/layer. E.g. MPEG4- CELP [24], G.727 (Embedded Adaptive Differential Pulse Code Modulation (ADPCM)) are SNR-scalable, where each additional layer increases the fidelity of the reconstructed signal. Recently Kδvesi et al have proposed a flexible SNR and bandwidth scalable codec [25], that achieves fine grain scalability from a certain core rate, enabling a fine granular optimization of the transport bandwidth, applicable for speech/ audio conferencing servers or open loop network congestion control.

Further, in [23] Leung et al shows how SNR scalability can be employed for a basic CELP(VSELP) codec with an embedded Multi-Pulse SNR enhancement layer. There are also codecs that can increase bandwidth with increasing amounts of bits. E.g. G.722 (Sub band ADPCM), a candidate to the 3GPP WB speech codec competition [15] and the academic AMR-BWS [13] codec. For these codecs addition of a specific bandwidth layer increases the audio bandwidth of the synthesized signal from ~4 kHz to ~7 kHz. Another example of a bandwidth scalable coder is the 16 kbps bandwidth scalable audio coder based on G.729 described by Koishida in [16]. Also, in addition to being SNR-scalable, MPEG4-CELP specifies a SNR scalable coding system for 8 and 16 kHz sampled input signals [25].

With respect to scalable conversational speech coding, the main standardization effort is taking place in ITU-T, (Working Party 3, Study Group 16). There the requirements for a scalable extension of G.729 have been defined recently (Nov. 2004), and the qualification process was ended in July 2005. This new G.729 extension will be scalable from 8 to 32 kbps with at least 2 kbps granularity steps from 12 kbps. The main target application for the G.729 scalable extension is conversational speech over shared and bandwidth limited xDSL-links, i.e. the scaling is likely to take place in a Digital Residential Gateway that passes the Voice over IP (VoIP) packets through specific controlled Voice channels (Vc's). ITU-T is also in the process of defining the requirements for a completely new scalable conversational codec in SG 16/ WP3/ Question 9. The requirements for the Q.9 /Embedded Variable rate (EV) codec were finalized in July 2005; currently the Q.9/EV requirements state a core rate of 8.0 kbps and a maximum rate of 32 kbps. The Q.9/EV core is not restricted to narrowband (8 kHz sampling) like the G.729 extension will be, i.e. Q.9/EV may provide wideband (16 kHz sampling) from the core layer and onwards.

With regards to improving channel robustness of conversational codecs, this has been done in various ways for existing standards and codecs:

• The Enhance Variable Rate Coder (EVRC, 1995), transmits a delta Delay parameter, which is a partially redundant coded parameter, making it possible to reconstruct the Adaptive Codebook State after a channel erasure, and thus enhancing error recovery. A detailed overview of EVRC is found in [27].

• In [14], McCree et all shows how one can construct various channel robustness schemes for a G.729 based CELP codec using Redundancy- coding and/ or Multiple Description technology.

• In [20], Lefevbre at all investigates and compares the efficiency of redundancy coding versus state less coding in G.729 and iLBC respec- tively.

• AMR-NB [28], a speech service specified for GSM networks operates on a maximum source rate adaptation principle. The trade-off between channel coding and source coding from a given gross bit rate is con- tinuously monitored and adjusted by the GSM-system, and the encoder source rate is adapted to provide the best quality possible. The source rate may be varied from 4.75 kbps to 12.2 kbps. The channel gross rate is either 22.8 kbps or 11.4 kbps.

• The AMR Real-time Transport Protocol (RTP) payload format [17] allows for the retransmission of whole past packets, significantly increasing the robustness to random frame errors. In [26] a multimode adaptive AMR system using the full and partial redundancy concepts adap tively is described. Further, the RTP payload allows for interleav- ing of packets, thus enhancing the robustness for non-conversational applications.

• Multiple Descriptive Coding in combination with AMR-WB is described in [21], further, an adaptive codec mode selection scheme is proposed where AMR-WB is used for low error conditions and the described channel robust MD-AMR(WB) coder is used during severe error conditions. • A channel robustness technology variation to the transmitting redundant data technique is to adjust the encoder analysis to reduce the dependency of states; this is done in the AMR 4.75 encoding mode. The application of a similar encoder side analysis technique for AMR- WB was described by Lefebvre et al in [22].

• In [29] Chen et al describe a multimedia application that uses multi- rate audio capabilities to adapt the total rate and also the actually used compressing schemes based on information from a slow (1 sec) feedback channel. In addition Chen et al extend the audio application with a very low rate base layer that uses text, as a redundant parameter, to be able to provide speech synthesis for really severe error conditions.

Basically, audio scalability can be achieved by:

• Changing the quantization of the signal, i.e. SNR-like scalability

• Extending or tightening the bandwidth of the signal,

• Dropping audio channels (e.g., mono consist of 1 channel, stereo 2 channels, surround 5 channels). This is called spatial scalability.

Currently available is the fine-grained scalable audio codec AAC-BSAC. It can be used for both audio and speech coding, and it also allows for bit- rate scalability in small increments. It produces a bit-stream, which can be decoded even if certain parts of the stream are missing. There is a minimum requirement on the amount of data that must be available to permit decoding of the stream. This is referred to as the base-layer. The remaining set of bits correspond to quality enhancements, hence their reference as enhancement-layers. The AAC-BSAC supports enhancement layers of around 1 kbps/ channel or smaller for audio signals. Referring to [6]: "To obtain such fine grain scalability, a bit- slicing scheme is applied to the quantized spectral data. First the quantized spectral values are grouped into frequency bands, each of these groups containing the quantized spectral values in their binary representation. Then the bits of the group are processed in slices according to their significance and spectral content. Thus, first all most significant bits (MSB) of the quantized values in the group are processed and the bits are processed from lower to higher frequencies within a given slice. These bit-slices are then encoded using a binary arithmetic coding scheme to obtain entropy coding with minimal re- dundancy."

Furthermore, in [6] it is stated that: "With an increasing number of enhancement layers utilized by the decoder, providing more LSB information refines quantized spectral data. At the same time, providing bit-slices of spectral data in higher frequency bands increases the audio bandwidth. In this way, quasi-continuous scalability is achievable."

That is, scalability can be achieved in a two-dimensional space. Quality, corresponding to a certain signal bandwidth, can be enhanced by transmitting more LSBs, or the bandwidth of the signal can be extended by providing more bit-slices to the receiver. Moreover, a third dimension of scalability is available by adapting the number of channels available for decoding. For example, a surround audio (5 channels) could be scaled down to stereo (2 channels) which, on the other hand, can be scaled to mono (1 channels) if, e.g., transport conditions make it necessary.

Video coding

The H.264/MPEG-4 Advanced Video Codec (AVC) is the current state-of-the- art in video coding [I]. Technically, the design of H.264/MPEG-4 AVC is based on the traditional concept of hybrid video coding using motion- compensated temporal and spatial prediction in conjunction with block- based residual transform coding. Within this framework, H.264/MPEG-4 AVC contains a large number of innovative technical features, both in terms of improved coding efficiency and network friendliness [2]. Recently, a new standardization initiative has been launched by the Joint Video Team of ITU-T/VCEG and ISO/IEC MPEG with the objective of extending the H.264/ MPEG-4 AVC standard towards scalability [3, 4]. The scalability extensions of H.264/MPEG-4 AVC are referred to as Scalable Video Coding (SVC). The targeted functionality of scalability should allow the removal of parts of the bit-stream while achieving a reasonable coding efficiency of the decoded video at reduced SNR, temporal or spatial resolution. Conceptually, a scalable bit-stream consists of a base or core layer and one or more nested enhancement layers.

Basically, video scalability can be achieved by:

• Changing the quality of the video frames (SNR scalability).

• Reducing or increasing the frame-rate (temporal scalability).

• Changing the resolution of the sequence (spatial scalability).

The scalability enhancements of the H.264 video codec are described in [7, 8]. A scalability enhancement of the H.264 RTP header format is also proposed. In [5], an RTP Payload format for the H.264 video codec is specified. The RTP payload format allows for packetization of one or more Network Abstraction Layer Units (NALUs), produced by an H.264 video encoder, in each RTP payload. NALUs are the basic transport entities of the H.263/AVC framework. With the introduction of the Scalable Video Coding (SVC) extension, a new NALU header extension is proposed. The first three bits (L2, Ll, LO) indicate a Layer. Layers are used to increase spatial resolution of a scalable stream. For example, slices corresponding to Layer-0 describe the scene at a certain resolution. If an additional set of Layer- 1 slices is available, the scene can be decoded at a higher spatial resolution. The next three bits (T2,

Tl, TO) indicate a temporal resolution. Slices assigned to temporal resolution 0 (TR-O) correspond to the lowest temporal resolution, that is only I-frames are available. If TR-I slices are also available, the frame-rate can be increased (temporal- scalability). The last two bits (Ql, QO) specify a quality level (QL). QL-O corresponds to the lowest quality. If additional QL slices are available, the quality can be increased (SNR- scalability).

Based on the information contained in the header, during congestion or un- favorable wireless channel conditions, network entities, e.g., routers, Radio

Network Controllers (RNCs), Media Gateways (MGWs), etc, can discard packets in a way that affects user-perceived quality as little as possible.

In the past years, several methods have been developed for efficient and adaptive audiovisual content delivery. As a result the necessary elements of a sophisticated content delivery infrastructure are available. However, there was/ is a need for a conceptual architecture that specifies or at least considers the relation and interoperation of the individual elements of the delivery chain. The aim for MPEG-21 Media Adaptation Framework is to describe how these various elements fit together. "The vision for MPEG-21 is to define a multimedia framework to enable transparent and augmented use of multimedia resources across a wide range of networks and devices used by different communities.", see [9] Today, the scope includes adaptation to terminals or networks. However, there is also a growing trend towards a user- centric multimedia content adaptation. The key components of adaptation are: (i) the adaptation engine, and (ii) standardized descriptors for adaptation.

The adaptation engine has the role of bridging the gap between media for- mat, terminal, network, and user characteristics. Content providers permit access to multimedia content through various connections such as Internet, Ethernet, DSL, W-LAN, cable, satellite, and broadcast networks. Moreover, users with various terminals such as desktop computers, handheld devices, mobile phones, and TV-sets are allowed to access the content. This high level of difference between content delivery to various users demand for a system that resolves the complexity of service provisioning, service delivery, and service access. For adaptation of the content to the user, three types of descriptions, namely multimedia content description, service provider environment description and a user environment description, are necessary. To allow for wide deployment and good interoperability these descriptors must follow a standard- ized form. While the MPEG-7 standard plays a key role in content description [11 , 12], the MPEG-21 standard, especially Part 7 Digital Item Adaptation (DIA), in addition to standardized descriptions provides tools for adaptation engines as well [9, 10].

One goal of MPEG-21 DIA, is to provide standardized descriptions and tools that can be used by adaptation engines. DIA adaptation tools are divided into seven groups. In the following we highlight the most relevant groups.

Usage Environment Description Tools: The usage environment includes the description of user characteristics and preferences, terminal capabilities, network characteristics and limitations, and natural environment characteristics.

The standards provide means to specify the preferences of the user related to the type and content of the media. It can be used to specify the interest of the user, e.g., in sport events, or movies of a certain actor. Based on the usage preference information a user agent can search for appropriate content or might call the attention of the user to a relevant multimedia broadcast content.

The user can set the "Audio Presentation Preferences" and the "Display Presentation Preferences". These descriptors specify certain properties of the media like, audio volume and color saturation, which reflect the preferred rendering of multimedia content.

With the "Conversion Preference" and "Presentation Priority Preference" descriptors the user can guide the adaptation process. For example, a user might be interested in high-resolution graphics even if this would require the loss of video contents. With "Focus of Attention" the user can specify the most interesting part of the multimedia content. For example, a user might be interested in the news in text form rather than in video form. In this way the text part might be rendered to a larger portion of the display while the video playback resolu- tion is severely reduced or even neglected.

Bit-stream Syntax Description (BSD): The BSD describes the syntax (high level structure) of a binary media resource. Based on the description the adaptation engine can perform the necessary adaptation as all required infor- mation about the bit-stream is available through the description. The description is based on the XML language. This way, the description is very flexible but, on the other hand, the result is a quite extensive specification.

Terminal and Network Quality of Service: There are descriptors specified that aid the adaptation decisions at the adaptation engine. The adaptation engine has the task to find the best trade-off among network and terminal constraints, feasible adaptation operations satisfying these constraints, and quality degradation associated to each adaptation operation. The main constraints in media resource adaptation are bandwidth and computation time. Adaptation methods include selection of frame dropping and/ or coefficient dropping, requantization, MPEG-4 Fine-Granular Scalability (FGS), wavelet reduction, and spatial size reduction.

The focus of the work carried out is on adaptation at dedicated network ele- ments, where a sophisticated adaptation engine is located. On the other hand, transport networks with the wide spread of wireless links are becoming more and more an important contributor to multimedia quality degradation. To ease the problem, methods within the transport network are necessary to handle transient resource problems. MPEG-21 and MPEG-7 only provide a general framework for adaptation operations.

Other Prior Art

In [32] a system and a method for delivery of scalable media is described.

The system is based on rate-distortion packet selection and organization. The method used by the system consists of scanning the encoded scalable media and scoring each data unit based on a rate-distortion score. The scored data units are then organized from the highest to the lowest into network packets, which are transmitted to the receiver based on the available network bandwidth. Although there is a clear advantage with this scheme to prioritize important data over non-important data, it has a drawback in that for proper operation it needs a back channel that signals the status of the network, which prevents its usage in broadcast scenarios. Also, the ordering of data units in a packet is done once and for all at the sender side. It is hence static in the sense that the ordering inside a certain packet is fixed during transmission. This kind of static ordering prevents intelligent network nodes to perform a re-ordering depending on the actual in-place network conditions. Furthermore, the ordering of data units inherently suggests priority transmission of highly scored packets. This implies that most of the low score packets corresponding to the same timed media, if transmitted, may arrive too late to be decoded and played back. This severely restricts the ability to successfully reach an optimal compromise.

In the AES convention paper [30], a real-time audio streaming method that dynamically adjusts to time-varying network loads is described. In order to cope with time varying networks, BSAC is used in order to encode media and because of its fine-grain scalability, it is possible to adjust the sent bit-rate depending on the network load. Network load and bandwidth are estimated by using a variant of the packet-pair algorithm. Although this approach suc- cessfuUy uses fine-grain scalability, it suffers from the lack of efficiency of the BSAC codec and in general, the lack of efficiency of any fine-grain scalable codec, especially at low bit-rates.

There are problems with existing solutions with regard to how to efficiently use a scalable source codec. For example:

• If there is too much embedded scalability (i.e. Fine Grain Scalability (FGS)), the flexibility of the codec is improved thus providing increased functionality, however, codec efficiency is reduced. On the other hand, if the codec generates a bitstream with too low embedded scalability (the extreme case being no embedded scalability at all, then functionality of the bitstream is reduced, while the efficiency of the bitstream is increased.

• One does not know the best correct enhancement layer setup at a given time (since network load/ characteristics varies) e.g. If robustness enhancement scalability is provided it is not certain that these bits will be useful, especially if the channel condition stabilizes into a good state after an initial startup phase with worse channel conditions.

SUMMARY

An objective of the present invention is to more efficiently use a scalable codec.

This objective is achieved in accordance with the attached claims.

The overall idea of the invention is to allow the coding to adaptively switch from highly functional modes to highly efficient modes dependent on channel conditions. Briefly, this is accomplished by adapting the layering structure of the scalable encoder to network/ channel conditions.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention, together with further objects and advantages thereof, may best be understood by making reference to the following description taken together with the accompanying drawings, in which: Fig. 1 is a diagram illustrating the relation between encoding distortion and encoder granularity;

Fig. 2 is a simple block diagram of a real-time media streaming system;

Fig. 3 is a diagram illustrating the mechanism for receiver reports from a receiver to a media server; Fig. 4 is a diagram illustrating the generation of a network granularity function from receiver reports obtained from a set of receivers;

Fig. 5 illustrates a first example of a network granularity function and a corresponding frame format;

Fig. 6 illustrates a second example of a network granularity function and a corresponding frame format;

Fig. 7 illustrates a third example of a network granularity function and a corresponding frame format; Fig. 8 a tree structure for finding the optimum granularity configuration;

Fig. 9 is a diagram illustrating the relationship between rate and distortion for a given path in the tree of Fig. 8;

Fig. 10 is a block diagram of a multistage vector quantizing encoder; Fig. 11 is a block diagram of a decoder corresponding to the encoder of

Fig. 10;

Fig. 12 is a block diagram of the encoder of Fig. 10 with jointly encoded indices;

Fig. 13 is a block diagram of a typical prior art conversational service system;

Fig. 14 is a block diagram of an embodiment of a conversational service system based on the principles of the present invention;

Fig. 15 is a diagram illustrating the relationship between frame error rate and decoded signal quality for a first layering structure; Fig. 16 is a diagram illustrating the relationship between frame error rate and decoded signal quality for a second layering structure;

Fig. 17 is a block diagram of an embodiment of an encoding apparatus in accordance with the present invention;

Fig. 18 is a diagram illustrating the relationship between frame error rate and decoded signal quality for a third layering structure;

Fig. 19 is a diagram illustrating the relationship between frame error rate and decoded signal quality for a fourth layering structure;

Fig. 20 is a block diagram of an embodiment of a decoding apparatus in accordance with the present invention;

Fig. 21 is a diagram illustrating the relationship between decodability and received bit-rates for different encoding modes.

DETAILED DESCRIPTION

The inventive concept of adapting the layering structure of a scalable source signal encoder to network/ channel conditions will be described with reference to specific examples. One such example is adaptation of layer granularity, which will now be described in detail.

It is well known that the performance of a scalable codec in terms of coding efficiency depends on the granularity of the bit- stream. Fine grain scalable codecs are less efficient than large grain scalable codecs. This property is due to the low level quantizers, which are operated in a multistage fashion.

The granularity is therefore dependent on the size of the codebooks in each stage. Fig. 1 shows a typical example of encoding a source signal with different granularities. One clearly sees that the coding efficiency at a certain rate is lower for the fine-grained encoder. The large grained encoder achieves lower distortion, however, the resulting bitstream is less flexible, since the embedded scalability has large steps of 5 bits/ sample as compared to the fine-grained encoder of 1 bit/ sample.

In real-time media delivery, in order to maximize the end-user quality, a trade-off has to be made between granularity of the embedded scalable bit- stream and encoder efficiency. In accordance with the present invention this is achieved by adapting the granularity of the encoder to the expected network conditions, as illustrated in Fig. 2. In a typical real-time media streaming application, a media server 10 receives real-time media, which is to be distributed over a network. The media content is first encoded in an encoder

12. Thereafter the resulting scalable bitstream is put into packets in a pack- etizer 14 and transmitted. Since the bitstream has embedded scalability, it is possible to truncate the bitstream at certain network nodes without loosing the ability to decode the media at the receiver, albeit at a lower quality than that obtained without truncation. A network-monitoring unit 16 in the media server 10 receives receiver reports (described with reference to Fig. 3) on feedback channels and builds averaged statistics of the available bit-rates in the network. The histogram of these available bit-rates is called the "network granularity function" (described with reference to Fig. 4) . The network granularity function informs about the effectively available bit-rates in the network and is used to adapt the encoder granularity to expected network conditions.

Fig. 3 illustrates the feedback of a receiver report to the media server 10 in Fig. 2. As illustrated in Fig. 3, possibly truncated packets P(n-l), P(n),

P(n+1), etc are received by a receiver. The receiver reports how many packets of different lengths (corresponding to different bit-rates) that have been received over a given time interval. This informs the server of where in the bit- stream truncations occur most often.

As illustrated in Fig. 4, the receiver reports from several receivers are combined (after synchronization) by the network monitoring unit 10 in Fig 2 to form a histogram denoted the network granularity function, which is updated as new receiver reports arrive. Because of latency, the network granu- larity function is a slow average of the status of the network and does not provide any information about the instantaneous bit-rate available trough the network. After computation of the network granularity function, the encoder granularity is adapted such that it corresponds to the network granularity. Several mappings are possible, as illustrated by the examples illus- trated in Fig. 5-7.

If the network granularity function consists only of three rates Rl, R2 and R3 as illustrated in Fig. 5, then, based on this information, the optimal encoder granularity would be such that the core layer is allocated a rate Rl , the first enhancement layer is allocated a rate R2-R1 and the second and last enhancement layer is allocated a rate R3-R2.

Fig. 6 illustrates another example. Here the network granularity function consists of two rates Rl, R2₃ and a continuous region of rates [R3-R4]. In this case, it is advantageous to have 3 large steps for scalability, i.e. similar to Fig. 5, and thereafter to have fine grain scalability in the range of rates [R3-R4].

In certain situations the network granularity function is such that certain rates occur very seldom, as illustrated in Fig. 7. In Fig. 7 the rate R2 occurs with very low probability. Then typically, instead of loosing efficiency by introducing an enhancement layer that corresponds exactly to the network granularity, an alternative is to ignore rates, which occur with a probability below a predetermined probability threshold. For example, this probability threshold can be set to 0.05 (i.e. 5%) which is equivalent to ignoring a rate that serves only 5% of the total number of users. The probability threshold can also be made adaptive.

If a model of the rate-distortion of the source encoder is available, as for instance the one given in Fig. 1, it is possible to analytically optimize the granularity of the codec with respect to network/ channel conditions.

A model of the rate distortion of the source encoder is here taken in the gen- eral sense, I.e. the distortion may consists of objective distortion measures, such as signal-to-noise ratio, weighted signal-to-noise ratio, noise-to-mask ratios for audio coding, Peak Signal-to-Noise Ratio (PSNR) for video sequences, etc. One could also consider subjective distortion measures obtained by extensive encoder characterization.

The models considered here are step models, i.e. at each rate a number of delta-rate steps are available with the corresponding distortion measure. This can be best visualized as a tree structure where the root is the zero rates and infinite distortion point, see Fig. 8. Each tree node corresponds to the encoder rate and distortion for a given increase in the rate. The result is that an encoder granularity configuration corresponds to a path in the tree and a corresponding rate /distortion curve, see Fig. 9. The objective is therefore to find the path in the tree that leads to the best distortion for a given network granularity. This optimization depends on how one defines the overall distortion for a given network. Since the network granularity function is a discrete probability distribution, one can analyti- cally define several overall distortions. An efficient way is to use the average user distortion, defined as:

A VERA GE _ USER _ DISTORTION = ∑ NGF(R) x D(R)

R

where NFG(R) denotes the network granularity function and D(R) denotes the distortion.

In practice, an optimization algorithm is used in order to find the best frame format, i.e. granularity configuration, which leads to a minimum AVERAGE JJSER _DISTORTION . As an example, a tree search algorithm can be used in order to explore the tree and find the optimal path.

In the examples above it has been assumed that the encoder is flexible enough to allow granularity switching. In general it is possible to design an encoder such that this functionality is available. An example will be described with reference to Fig. 10 and 11.

Consider an encoder in the form of a successively refinable multistage vector quantizer. The structure of the encoder is depicted in Fig. 10. At each stage the quantizers Q_k, k = l,...,n generate the codebook indices i_k that are transmitted to the decoder. The decoder in Fig. 11, upon reception of the indices, generates the reproduction vector x.

The encoder in Fig. 10 is typically operated in a fine-grain successive refine- ment mode. However it is also possible to operate the encoder in a larger grained mode by jointly deriving the indices of two or more stages. The finest grain is obtained by generating indices that minimize the reproduction error at each stage, i.e. each i_k is such that \e_k f is minimized, the grain size will therefore correspond to the size of the codebook at each stage. Each successive stage will therefore add a refinement of the previous stages. For larger grains, stages are grouped and indices are jointly optimized, for instance by- grouping stage k -l and stage k and jointly optimizing the indices i_k__λ and i_k such that ||e_t|| is minimized. Here decoding will require both the index i_k__λ and i_k and therefore the size of the additional refinement layer is increased.

Since the indices are jointly optimized, grouping stages will lead to better approximation errors. Thus, the codec can be switched to any desired granularity. An example of this procedure is illustrated in Fig. 12. This example corresponds to the frame structure of Fig. 5 with a core layer and two enhancement layers.

So far, only the case where media encoding is performed on a source signal in real-time has been discussed. However, often the media is already stored in a certain "encoded" format. In order to implement the adaptive scalability scheme, one possibility is to decode the "encoded" media to an intermediary format, for example if the media is music then a suitable intermediary format would be PCM raw data. From this intermediary format, the media is encoded with the adaptive granularity codec. In this case the source signal is represented in the intermediary format.

Another, more computationally efficient way to implement the adaptive scalability scheme is to store the media directly in a pre-processed format. This format is stored in a "master file". This pre-processed format may consist of all the steps required for encoding without the actual quantization/ encoding steps. The quantization and final encoding steps are performed in real time, as described above, and made dependent on the network granularity function. In this case the source signal is represented by the master file.

In practice, at the receiver side the decoder has to be informed about which frame format or granularity configuration is in use, this information needs also to be available at network nodes which would perform packet operations such as truncation. These granularity modes of a packet can for example be conveyed to the decoder in a mode field in the packet header. The mode field will indicate the structure of the selected layers, including the size of the various layers and the fine grain granularity. The actual transmitted information for the mode field may be a pointer to a receiver acknowledged table of allowed granularity modes or a general flexible description of the granularity mode e.g. in the form of a list of layers, layers sizes and layer properties. The signaling of the mode field may also be implicit if the data corresponding to the different encoding layers is transmitted in different transport packets. The decoder can correctly assign and decode the layers by using, for in- stance, packet time-stamp information.

The description above has been based on an embodiment where the encoder granularity was adapted to network/ channel conditions. The following embodiment will instead describe network/ channel adaptive layering for a con- versational service.

The state of the art solutions (see, for example, [25]) for scalable applications have been able to drop layers on the application side based on feedback or external bandwidth information available from e.g. hardware limitations. This control situation is depicted in Fig. 13, where an encoder 20 with fixed layering outputs a scalably encoded signal to a rate limiter 22 controlled by a rate controller 28. The deployment scenario is essentially an ad-hoc transport network which may or may not have media aware "intelligent" network nodes. The conversational media stream will be subjected to scaling in media servers or media aware network nodes. It will also be subjected to packet loss in the network transport (e.g. air interface loss) and also on the receiver side de-jitter buffers due to late arrival of media packets (late loss). A measurement unit 26 on the receiver side analyzes packet loss and jitter and feeds this information back to rate controller 28.

However, the scalability concept above will not be optimal for all channel conditions and deployment scenarios, since the layering structure is fixed and only the maximum rate is scaled based on congestion concerns. To overcome this shortcoming an advanced flexible layering structure, which is illustrated in Fig. 14-16, is proposed. Here the layering structure of an encoder 30 is controlled by a scalability mode controller 34 based on feedback information received from a measurement unit 32 on the receiver side.

The layering structure A of Fig. 15 emphasizes robustness over quality when setting up the enhancement layers. Thus, enhancement layer L2_A does not increase the decoded signal quality at low frame error rates over the core (Rl) layer quality, but assists in maintaining near core layer quality as the frame error rate increases. Maximum quality is not increased until en- hancement layer L3A is added.

On the other hand, the layering structure B of Fig. 16 emphasizes quality over robustness when setting up the enhancement layers. Thus, enhancement layer L2B increases the decoded signal quality at low frame error rates over the core layer quality. Robustness is increased when enhancement layer

L3B is added.

In Fig. 15 and 16 it is noted that the core layer (Rl) is the same, but that the functionality (robustness or quality enhancement) of the enhancement layers has been altered, the total gross rate is the same in both layering structures.

An embodiment of a flexible SNR, Bandwidth and Channel Robustness scalable encoder taking advantage of this flexible layering structure will now be described. This embodiment will be implemented by building blocks from existing compression standards in order to make the description readily understandable. However, it should be kept in mind that this may not necessarily be the most optimal use of the new scalability concept.

This embodiment will focus on:

• A Description of an SNR, Bandwidth and 'Channel Robustness' scalable codec. Spatial scalability is not described.

• A description of two different scalability layering structures (A, B). • A description on how to adapt between the layering structures using a feedback channel to switch between scalability layering structures A and scalability layering structures B. The feedback channel may be based on out-of-band receiver estimations /reports to sender (RTCP) or on in-band channel control signaling between receiver and transmitter.

The Adaptive Multi Rate-Narrow Band (AMR-NB) standard of the GSM system is a multimode codec supporting activity rates of 12.2, 10.2, 7.95, 7.4, 6.7, 5.9, 5.15, 4.75 kbps, a Silence Insertion Descriptor (SID) frame (1.75 kbps) and silence intervals (0 kbps). The AMR-NB standard may be transported using the RTP Payload format described in RFC3267. According to this RFC, the Codec may transport redundant data (repeated old frames) as a part of the payload, resulting in a repetition code that enhances robust- ness considerably.

The AMR-NB codec can be extended to a wideband (WB) (50-7.5 kHz bandwidth) codec using the existing transport format, by assigning the NB-modes to separately encode a downsampled higher band (HB, 4-7kHz) signal. As an example, extending the coder with an additional layer, coding the high band, has been investigated in [13], where a coding concept called AMR-BWS (AMR-Band Width Scalable) was investigated. The AMR-BWS is described as a pure bandwidth scalable codec.

Below, an extended AMR-NB codec supporting SNR, Bandwidth and Robustness Scalability is described with reference to Fig. 17 and 18. The core compression scheme is based on reusing the individual AMR-NB modes for SNR, Bandwidth and Channel Robustness purposes. This Flexible SCalable AMR codec is called FSC-AMR in the following description.

Fig. 17 is a block diagram of an embodiment of an encoder 30 in accordance with the present invention. A 16 kHz audio signal is forwarded to a band splitter 40, for example implemented by Quadrature Mirror Filters (QMF), which splits the signal into a narrow band (NB) signal occupying the frequency band 50Hz-4 kHz and a higher band (HB) signal occupying the frequency band 4kHz-7.5kHz. The NB signal is encoded into a core layer signal by a coder 42. The HB signal is encoded into an enhancement layer signal by a coder 44. This enhancement signal is intended either for layer 2 or layer 3, as illustrated in Table 1 below. The core layer signal and NB signal are forwarded to an NB residual coder 46, which determines and encodes the residual signal from encoder 42. This coded signal may be used as a quality enhancement on layer 3 in scalability mode B, see Table 1. The core layer signal is also forwarded to a robustness/ redundancy encoding module 48, which outputs a channel robustness layer for the narrow band. This may be used as layer 2 in scalability mode A, see Table 1. A scalability mode controller 50 controls the described encoders and a multiplexing module 52 to combine the layers into mode A or B.

TABLE 1

As can be seen from Table 1, mode A is primarily a robustness enhancing mode, while mode B is a quality enhancement mode only. It is especially noted that the HB quality layer can be added either as layer 2 or layer 3, depending on the mode. The two modes are also illustrated in the quality ver- sus frame error rate diagrams in Fig. 18 and 19.

Fig. 20 is a block diagram of an embodiment of a decoding apparatus 36 in accordance with the present invention. A demultiplexing module 60 receives the media data from the IP/ UDP stack and separates the different layers to respective decoders. A robustness/redundancy decoding module 62 determines and forwards a coded NB signal (core layer signals) to an NB decoder 64. As an example, if the core layer has been properly received, the core layer signal of the current frame is forwarded to decoder 64. Otherwise a robustness/redundancy layer based signal is forwarded to decoder 64 (if pre- sent). A HB layer signal (if present) is forwarded to a HB decoder 66. An NB residual layer signal (if present) is forwarded to an NB residual decoder 68. The decoded residual signal is added to the decoded core layer signal in an adder 70, thereby forming a 50 Hz - 4 kHz NB audio signal. This is combined with the decoded 4 kHz - 7.5 kHz HB audio signal from decoder 66 in a band merger 72, which is typically formed by Quadrature Mirror Filters (QMF). A mode decomposition unit 74 controls the decoding modes of decoders 62, 64, 66, 68 depending on coding mode signals received from demultiplexing module 60. Unit 32 performs measurements and mode analysis on received frames to determine whether truncations and/ or frame loss have been per- formed by the network, and reports the results back to the transmitter. Each packet has an application header containing the Layering Mode information; in this embodiment a 4 bit field according to Table 2 is used. By analyzing the difference between the mode total payload and the RTP-payload size, the measurement and mode analysis module 32 can find out if the packet has been truncated by the network or not.

In Table 2 the above concepts have been extended to more than two modes. For example, in modes C and D the quality has been increased by increasing the bit-rate of the core layer and one of the enhancement layers. As another example, in mode O the functionality of the second enhancement layer has been switched to spatial enhancement instead of bandwidth enhancement.

TABLE 2

Using out-of-band signaling means that the receiver will inform the transmitter regularly of the current channel conditions. E.g. Real Time Transport Control Protocol (RTCP) can be used to send the frame loss rate for the last second (50 frames). The loss rate statistics can then be used by the transmitter to enforce a more robust mode.

Below is an example scalability adaptation scheme based on reported loss rate Ir and the reported inter-arrival jitter value ij:

• An optimistic application starts up in SNR-optimize Scalability Mode B.

• A value Ir value larger than 2 % for 10 measurement periods will make the transmitter use a more robust mode, e.g. Mode A.

• If the Ir values are still higher than 2% after switching to mode A, the transmitter will scale its packets by removing the BW(HB) layer to reduce the transmission rate further.

• When the loss rate has been lower than 1% for 30 measurement periods, the transmitter may probe the channel by increasing the rate using a Robust mode with slightly higher bit rate, e.g. Mode C.

• Successful use of the higher rate robust mode for a given measurement time with non-increasing jitter ij will enable the use of a SNR optimized mode like Scalability Mode B.

If RTCP extended Reports (RTCP-XR) are used, the receiver may additionally send an application specific report parameter showing how much bit-rate (kbps) capacity the media aware network has removed (scaled down). RTCP operation is described in the RTP specification [2]. RTCP-XR reports are described in [19].

It is also possible to signal a scalability mode change using an in-band channel. Assume that the payload format for the flexible scalable codec in- eludes a header with additional room for an in-band channel, as indicated by Fig. 21. The in-band channel may be relative to the transmitted mode or absolute. Possible relative in-band parameters are Rate increase/ decrease, Audio bandwidth increase/ decrease, Channel Robustness increase/decrease, SNR increase /decrease. For simplicity an absolute in-band channel is used in this embodiment (in a similar way as the rate control in-band channel used in AMR/RFC3267).

Assigning 4 bits (b4-b7) (4 bits/20 ms = 0.2 kbps) to the in-band channel according to Table 3 enables the communication from receiver to transmitter to adjust its encoding properties. The receiver may request a scalability mode on any information available in the receiver. E.g. Frame loss (%), Late loss (%), jitter (ms), jitter buffer depth, frame error distribution (binary sequence), removed bandwidth (kbps), external information about the transport channel. As an example embodiment, the very same control as described for the out-of-band signaling may be performed on the receiving side, with the difference that the resulting mode is requested via the in-band channel.

TABLE 3

In the previous sections example methods for providing a flexible real-time coding scheme was described. To enable a similar functionality for on demand media delivery, one can encode the layers separately and let the media server multiplex them together at delivery time. E.g. in the case of a voice signal the media server would have layers (core (Rl), L2A, L2B=L3A, L3B, see Fig. 15 and 16) available in a pre-encoded repository. At the time of media delivery the media server would multiplex the correct scalability configuration to the receiver. This is analogous with the provision of an intermediate resolution format as described in the adaptive granularity section. But in this case the intermediate format would consist of the sum of all available layers in the pre-encoded configurations.

The functionality of the various blocks of the encoder and decoder is typi- cally achieved by one or several micro processors or micro/ signal processor combinations and corresponding software.

With the solutions the encoder is able to both maximize functionality of the bitstream as well as coding efficiency, thus delivering both a robust as well as an efficient bitstream.

It will be understood by those skilled in the art that various modifications and changes may be made to the present invention without departure from the scope thereof, which is defined by the appended claims. REFERENCES

[1] ITU-T and ISO /IEC JTC 1, "Advanced Video Coding for Generic

Audiovisual Services", ITU-T Recommendation H.264 & ISO /IEC 14496- 10 (MPEG-4 AVC), version 3: 2005.

[2] T. Wiegand et al, "Overview of the H.264/ AVC video coding standard," IEEE Trans. CSVT, vol. 13 (7), pp. 560-576, July 2003.

[3] ISO/IEC JTC 1 /SC29 / WGI l, "MPEG Press Release", ISO/IEC

JTC1/SC29/WG11 Doc. N6874, Hong Kong (China), Jan. 2005.

[4] ITU-T and ISO/IEC JTC 1 , "Working Draft 1.0 of 14496- 10:200x/ AMD 1 Scalable Video Coding", Joint Video Team of ITU-T

VCEG and ISO/IEC MPEG, JVT-N022, Hong Kong, Jan. 2005.

[5] S. Wenger, M. M. Hannuksela, T. Stockhammer, M. Westerlund, and D. Singer "RTP Payload Format for H.264 Video", RFC 3984, February 2005.

[6] ISO/IEC JTC 1, SC 29, WG 11 /Ml 1657, "Performance and functionality of existing MPEG-4 technology in the context of CfI on Scalable Speech and Audio Coding", Jan. 2005.

[7] H. Schwarz, D. Marpe, and T. Wiegand, "MCTF and Scalability Extension of H.264/AVC," Proc. of PCS, San Francisco, CA, USA, Dec. 2004.

[8] H. Schwarz, D. Marpe, and T. Wiegand, "Combined Scalability Extension of H.264/AVC", submitted to ICIP'05.

[9] ISO/IEC TR 21000-1 :2004, Information technology — Multimedia framework (MPEG-21) — Part 1: Vision, Technologies and Strategy

[10] ISO/IEC 21000-7:2004, Information technology — Multimedia framework (MPEG-21) — Part 7: Digital Item Adaptation

[11] ISO/IEC 15938-5: "Multimedia Content Description Interface Part

5: MDS", (2002). [12] ISO /IEC 15938-3: "Multimedia Content Description Interface Part

3: Visual", (2002).

[13] Hui Dong Gibson, JD Kokes, MG, "SNR and bandwidth scalable speech coding", Circuits and Systems, 2002. ISCAS 2002

[14] McCree et al, "Efficient, CELP-based diversity schemes for VoIP",

ICASSP 2000

[15] McCree et al, "AN EMBEDDED ADAPTIVE MULTI-RATE

WIDEBAND SPEECH CODER", ICASSP 2001

[16] Koishida et al, "A 16-KBIT/S BANDWIDTH SCALABLE AUDIO CODER BASED ON THE G.729 STANDARD" ,ICASSP 2000

[17] Sjδberg et al, "Real-Time Transport Protocol (RTP) Payload Format and File Storage Format for the Adaptive Multi-Rate (AMR), and Adaptive Multi-Rate Wideband (AMR-WB) Audio Codecs", RFC 3267, IETF, June 2002

[18] Schulzrinne et al., "RTP Profile for Audio and Video Conferences with Minimal Control", RFC 3551, IETF, July 2003

[19] Friedman et al, "RTP Control Protocol Extended Reports",

RFC3611, IETF, November 2003

[20] Lefebvre, "Design compromises for SPC", ICASSP 2004

[21] H. Dong et al, "Multiple description speech coder based on AMR-

WB for Mobile ad-hoc networks", ICASSP 2004

[22] Chibani, M.; Gournay, P.; Lefebvre, R, "Increasing the Robustness of CELP-Based Coders By Constrained Optimization", ICASSP 2005

[23] Leung et al, "Speech coding over frame relay networks", IEEE WS on SPC, 1993 [24] Herre, "OVERVIEW OF MPEG-4 AUDIO AND ITS APPLICATIONS

IN MOBILE COMMUNICATIONS", ICCT 2000

[25] Kovesi,"A SCALABLE SPEECH AND AUDIO CODING SCHEME

WITH CONTINUOUS BIT-RATE FLEXIBILITY", ICASSP2004

[26] Johansson et al, "Bandwidth Efficient AMR Operation for VoIP",

IEEE WS on SPC, 2002

[27] Recchione, "The Enhanced Variable Rate Coder Toll Quality

Speech For CDMA", Journal of Speech Technology, 1999

[28] Uvliden et al, "Adaptive Multi-Rate - A speech service adapted to Cellular Radio Network Quality", Asilomar , 1998

[29] Chen et al, "Experiments on QoS Adaptation for Improving End

User Speech Perception Over Multi-hop Wirless Networks", ICC, 1999

[30] Sang-Jo Lee, Sang-Wook Kim, Eunmi Oh, "A real-time audio streaming method for time-varying network loads", 112^th AES

Convention, May 2002, Munich, Germany. Paper 5515.

[31] Kevin Lai, Mary Baker, "Measuring Bandwidth", Proceedings of the

IEEE INFOCOM'99, March 1999, Volume 1, p 235-245.

[32] US Patent No. 6,789, 123.

Claims

1. A signal encoding method, including adapting the layering structure of a scalable source signal encoder to network/ channel conditions.

2. The method of claim 1 , including adapting the granularity of layers produced by said scalable source signal encoder to network/ channel conditions.

3. The method of claim 2, including the step of forming a network granularity function from receiver reports describing network throughput, said network granularity function being used for adapting the granularity of layers produced by said scalable source signal encoder to network/ channel conditions.

4. The method of claim 1, including adapting the functionality of layers pro- duced by said scalable source signal encoder to network/ channel conditions.

5. The method of claim 4, including the step of receiving network/ channel condition information on a feedback channel, said information being used for adapting the functionality of layers produced by said scalable source signal encoder to network/ channel conditions.

6. The method of claim 5, wherein said feedback channel is an out-of-band channel.

7. The method of claim 5, wherein said feedback channel is an in-band channel.

8. A signal decoding method, including the steps of receiving a signal with time-varying layering structure; receiving a time-varying control signal describing said time varying layering structure; using said control signal to adapt decoding of said received signal to the appropriate layering structure.

9. A signal encoding apparatus, including means (16, 50) for adapting the layering structure of a scalable source signal encoder (12, 26) to network/ channel conditions.

5 10. The encoding apparatus 9, including a network monitoring unit (16) for adapting the granularity of layers produced by said scalable source signal encoder (12) to network/ channel conditions.

11. The encoding apparatus of claim 10, wherein said network monitoring unit 0 (16) is adapted to form a network granularity function from receiver reports describing network throughput, said network granularity function being used for adapting the granularity of layers produced by said scalable source signal encoder (16) to network/ channel conditions.

5 12. The encoding apparatus of claim 9, including a scalability mode controller

(50) for adapting the functionality of layers produced by said scalable source signal encoder (30) to network/ channel conditions.

13. The encoding apparatus of claim 12, including a scalability mode control- 0 ler (34) for receiving network/ channel condition information on a feedback channel and for adapting the functionality of layers produced by said scalable source signal encoder to network/ channel conditions based on said received network/ channel condition information.

5 14. A signal decoding apparatus, including means (60) For receiving a signal with time-varying layering structure; means (74) for receiving a time-varying control signal describing said time varying layering structure; decoder means (62, 64, 66, 68, 70, 72) using said control signal to o adapt decoding of said received signal to the appropriate layering structure.

15. A scalable media signal having a time-varying layering structure adapted to network/ channel conditions.

16. The signal of claim 15, wherein the layer granularity is adapted to network/channel conditions.

17. The signal of claim 16, wherein the layer functionality is adapted to network/channel conditions.