EP1952631A1 - Système et procédé pour vidéoconférence échelonnable à faible retard utilisant un codage vidéo échelonnable - Google Patents

Système et procédé pour vidéoconférence échelonnable à faible retard utilisant un codage vidéo échelonnable

Info

Publication number
EP1952631A1
EP1952631A1 EP06788106A EP06788106A EP1952631A1 EP 1952631 A1 EP1952631 A1 EP 1952631A1 EP 06788106 A EP06788106 A EP 06788106A EP 06788106 A EP06788106 A EP 06788106A EP 1952631 A1 EP1952631 A1 EP 1952631A1
Authority
EP
European Patent Office
Prior art keywords
base layer
layer
terminal
layers
picture
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP06788106A
Other languages
German (de)
English (en)
Other versions
EP1952631A4 (fr
Inventor
Reha Civanlar
Alexandros Eleftheriadis
Danny Hong
Ofer Shapiro
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Vidyo Inc
Original Assignee
Vidyo Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Vidyo Inc filed Critical Vidyo Inc
Publication of EP1952631A1 publication Critical patent/EP1952631A1/fr
Publication of EP1952631A4 publication Critical patent/EP1952631A4/fr
Withdrawn legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/141Systems for two-way working between two video terminals, e.g. videophone
    • H04N7/147Communication arrangements, e.g. identifying the communication as a video-communication, intermediate storage of the signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/164Feedback from the receiver or from the transmission channel
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/30Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using hierarchical techniques, e.g. scalability
    • H04N19/31Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using hierarchical techniques, e.g. scalability in the temporal domain
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/30Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using hierarchical techniques, e.g. scalability
    • H04N19/36Scalability techniques involving formatting the layers as a function of picture distortion after decoding, e.g. signal-to-noise [SNR] scalability
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/30Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using hierarchical techniques, e.g. scalability
    • H04N19/37Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using hierarchical techniques, e.g. scalability with arrangements for assigning different transmission priorities to video input data or to video coded data
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/60Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding
    • H04N19/61Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding in combination with predictive coding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/2343Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
    • H04N21/234327Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements by decomposing into layers, e.g. base layer and one or more enhancement layers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/25Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
    • H04N21/266Channel or content management, e.g. generation and management of keys and entitlement messages in a conditional access system, merging a VOD unicast channel into a multicast channel
    • H04N21/2662Controlling the complexity of the video stream, e.g. by scaling the resolution or bitrate of the video stream based on the client capabilities
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/478Supplemental services, e.g. displaying phone caller identification, shopping application
    • H04N21/4788Supplemental services, e.g. displaying phone caller identification, shopping application communicating with other users, e.g. chatting
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/60Network structure or processes for video distribution between server and client or between remote clients; Control signalling between clients, server and network components; Transmission of management data between server and client, e.g. sending from server to client commands for recording incoming content stream; Communication details between server and client 
    • H04N21/63Control signaling related to video distribution between client, server and network components; Network processes for video distribution between server and clients or between remote clients, e.g. transmitting basic layer and enhancement layers over different transmission paths, setting up a peer-to-peer communication via Internet between remote STB's; Communication protocols; Addressing
    • H04N21/631Multimode Transmission, e.g. transmitting basic layers and enhancement layers of the content over different transmission paths or transmitting with different error corrections, different keys or with different transmission protocols
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/60Network structure or processes for video distribution between server and client or between remote clients; Control signalling between clients, server and network components; Transmission of management data between server and client, e.g. sending from server to client commands for recording incoming content stream; Communication details between server and client 
    • H04N21/63Control signaling related to video distribution between client, server and network components; Network processes for video distribution between server and clients or between remote clients, e.g. transmitting basic layer and enhancement layers over different transmission paths, setting up a peer-to-peer communication via Internet between remote STB's; Communication protocols; Addressing
    • H04N21/647Control signaling between network components and server or clients; Network processes for video distribution between server and clients, e.g. controlling the quality of the video stream, by dropping packets, protecting content from unauthorised alteration within the network, monitoring of network load, bridging between two different networks, e.g. between IP and wireless
    • H04N21/64746Control signals issued by the network directed to the server or the client
    • H04N21/64753Control signals issued by the network directed to the server or the client directed to the client
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/60Network structure or processes for video distribution between server and client or between remote clients; Control signalling between clients, server and network components; Transmission of management data between server and client, e.g. sending from server to client commands for recording incoming content stream; Communication details between server and client 
    • H04N21/63Control signaling related to video distribution between client, server and network components; Network processes for video distribution between server and clients or between remote clients, e.g. transmitting basic layer and enhancement layers over different transmission paths, setting up a peer-to-peer communication via Internet between remote STB's; Communication protocols; Addressing
    • H04N21/647Control signaling between network components and server or clients; Network processes for video distribution between server and clients, e.g. controlling the quality of the video stream, by dropping packets, protecting content from unauthorised alteration within the network, monitoring of network load, bridging between two different networks, e.g. between IP and wireless
    • H04N21/64784Data processing by the network
    • H04N21/64792Controlling the complexity of the content stream, e.g. by dropping packets
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems

Definitions

  • the present invention relates to multimedia and telecommunications technology.
  • the invention relates to systems and methods for videoconferencing between user endpoints with diverse access equipment or terminals, and over inhomogeneous network links.
  • Videoconferencing systems allow two or more remote participants/endpoints to communicate video and audio with each other in real-time using both audio and video.
  • direct transmission of communications over suitable electronic networks between the two endpoints can be used.
  • a Multipoint Conferencing Unit or bridge, is commonly used to connect to all the participants/endpoints.
  • the MCU mediates communications between the multiple participants/endpoints, which may be connected, for example, in a star configuration.
  • the participants/endpoints or terminals are equipped with suitable encoding and decoding devices.
  • An encoder formats local audio and video output at a transmitting endpoint into a coded form suitable for signal transmission over the electronic communication network.
  • a decoder processes a received signal, which has encoded audio and video information, into a decoded form suitable for audio playback or image display at a receiving endpoint.
  • a received signal which has encoded audio and video information
  • a decoded form suitable for audio playback or image display at a receiving endpoint.
  • an end-user's own image is also displayed on his/her screen to provide feedback (to ensure, for example, proper positioning of the person within the video window).
  • the quality of an interactive videoconference between remote participants is determined by end-to-end signal delays.
  • End-to-end delays of greater than 200ms prevent realistic live or natural interactions between the conferencing participants.
  • Such long end-to-end delays cause the videoconferencing participants to unnaturally restrain themselves from actively participating or responding in order to allow in-transit video and audio data from other participants to arrive at their endpoints.
  • the end-to-end signal delays include acquisition delays (e.g., the time it takes to fill up a buffer in an A/D converter), coding delays, transmission delays (the time it takes to submit a packet-full of data to the network interface controller of an endpoint), and transport delays (the time a packet travels in a communication network from endpoint to endpoint). Additionally, signal-processing times through mediating MCUs contribute to the total end-to-end delay in the given system.
  • An MCU' s primary tasks are to mix the incoming audio signals so that a single audio stream is transmitted to all participants, and to mix video frames or pictures transmitted by individual participants/endpoints into a common composite video frame stream, which includes a picture of each participant.
  • frame and picture are used interchangeably herein, and further that coding of interlaced frames as individual fields or as combined frames (field-based or frame- based picture coding) can be incorporated as is obvious to persons skilled in the art.
  • the MCUs which are deployed in conventional communication networks systems, only offer a single common resolution (e.g., CIF or QCIF resolution) for all the individual pictures mixed into the common composite video frame distributed to all participants in a videoconferencing session.
  • conventional communication networks systems do not readily provide customized videoconferencing functionality by which a participant can view other participants at different resolutions.
  • Such desirable functionality allows the participant, for example, to view another specific participant (e.g., a speaking participant) in CIF resolution and view other, silent participants in QCIF resolution.
  • MCUs can be configured to provide this desirable functionality by repeating the video mixing operation, as many times as the number of participants in a videoconference. '
  • the MCU operations introduce considerable end-to-end delay.
  • the MCU must have sufficient digital signal processing capability to decode multiple audio streams, mix, and re-encode them, and also to decode multiple video streams, composite them into a single frame (with appropriate scaling as needed), and re-encode them again into a single stream.
  • Video conferencing solutions (such as the systems commercially marketed by Polycom Inc., 4750 Willow Road, Desion, CA 94588, and Tandberg, 200 Park Avenue, New York, NY 10166) must use dedicated hardware components to provide acceptable quality and performance levels. The performance levels of and the quality delivered by a videoconferencing solution are also a strong function of the underlying communication network over which it operates. Videoconferencing solutions, which use ITU H.261, H.263, and H.264 standard video codecs, require a robust communication channel with little or no loss for delivering acceptable quality. The required communication channel transmission speeds or bitrates can range from 64 Kbps up to several Mbps.
  • IP Internet Protocol
  • IP Internet Protocol
  • IP PBX Internet Protocol
  • instant messaging web conferencing
  • the communication networks on which videoconferencing solutions are implemented can be categorized as providing two basic communication channel architectures.
  • a guaranteed quality of service (QoS) channel is provided via a dedicated direct or switched connection between two points (e.g., ISDN connections, Tl lines, and the like).
  • the communication channels do not guarantee QoS, but are only "best- effort" packet delivery channels such as those used in Internet Protocol (IP) -based networks (e.g., Ethernet LANs).
  • IP Internet Protocol
  • IP-based networks typically operate on a best-effort basis, i.e., there is no guarantee that packets will reach their destination, or that they will arrive in the order they were transmitted.
  • QoS quality of service
  • the techniques may include protocols such as DiffServ for specifying and controlling network traffic by class so that certain types of traffic get precedence and RSVP. These protocols can ensure certain bandwidth and/or delays for portions of the available bandwidth.
  • Techniques such as forward error correction (FEC) and automatic repeat request (ARQ) mechanisms may also be used to improve recovery mechanisms for lost packet transmissions and to mitigate the effects of packet loss.
  • FEC forward error correction
  • ARQ automatic repeat request
  • Standard video codecs such as the standard H.261, H.263 codecs designated for videoconferencing and standard MPEG- 1 and MPEG-2 Main Profile codecs designated for Video CDs and DVDs, respectively, are designed to provide a single bitstream ("single-layer") at a fixed bitrate. Some of these codecs may be deployed without rate control to provide a variable bitrate stream (e.g., MPEG-2, as used in DVDs). However, in practice, even without rate control, a target operating bitrate is established depending on the specific infrastructure.
  • the standard video codecs are based on "single-layer" coding techniques, which are inherently incapable of exploiting the differentiated QoS capabilities provided by modern communication networks.
  • An additional limitation of the single-layer coding techniques for video communications is that even if a lower spatial resolution display is required or desired in an application, a full resolution signal must be received and decoded with downscaling performed at a receiving endpoint or MCU. This wastes bandwidth and computational resources.
  • the aforementioned single-layer video codecs in contrast to the aforementioned single-layer video codecs, in
  • scalable video codecs based on "multi-layer” coding techniques, two or more bitstreams are generated for a given source video signal: a base layer and one or more enhancement layers.
  • the base layer may be a basic representation of the source signal at a minimum quality level. The minimum quality representation may be reduced in the SNR (quality), spatial, or temporal resolution aspects or a combination of these aspects of the given source video signal.
  • the one or more enhancement layers correspond to information for increasing the quality of the SNR (quality), spatial, or temporal resolution aspects of the base layer.
  • Scalable video codecs have been developed in view of heterogeneous network environments and/or heterogeneous receivers.
  • the base layer can be transmitted using a reliable channel, i.e., a channel with guaranteed Quality of Service (QoS).
  • QoS Quality of Service
  • Enhancement layers can be transmitted with reduced or no QoS. The effect is that recipients are guaranteed to receive a signal with at least a minimum level of quality (the base layer signal).
  • a small picture size signal may be transmitted to, e.g., a portable device, and a full size picture may be transmitted to a system equipped with a large display.
  • Desirable scalable codec solutions will offer improved bandwidth, temporal resolution, spatial quality, spatial resolution, and computational power scalability. Attention is in particular directed to developing scalable video codecs that are consistent with simplified MCU architectures for versatile videoconferencing applications. Desirable scalable codec solutions will enable zero-delay MCU architectures that allow cascading of MCUs in electronic networks with no or minimal end-to-end delay penalties.
  • the present invention provides scalable video coding (SVC) systems and methods (collectively, “solutions”) for point-to-point and multipoint conferencing applications.
  • SVC solutions provide a coded "layered" representation of a source video signal at multiple temporal, quality, and spatial resolutions. These resolutions are represented by distinct layer/bitstream components that are created by endpoint/terminal encoders.
  • the SVC solutions are designed to accommodate diversity in endpoint/receivers devices and in heterogeneous network characteristics, including, for example, the best-effort nature of networks such as those based on the Internet Protocol.
  • the scalable aspects of the video coding techniques employed allow conferencing applications to adapt to different network conditions, and also accommodate different end-user requirements (e.g., a user may elect to view another user at a high or low spatial resolution).
  • Scalable video codec designs allow error-resilient transmission of video in point-to-point and multipoint scenarios, and allow a conferencing bridge to provide continuous presence, rate matching, error localization, random entry and personal layout conferencing feature's, without decoding or recoding in-transit video streams and without any decrease in the error resilience of the stream.
  • An endpoint terminal which is designed for video communication with other endpoints, includes video encoders/decoders that can encode a video signal into one or more layers of a multi-layer scalable video format for transmission.
  • the video encoders/decoders can correspondingly decode received video signal layers, simultaneously or sequentially, in as many video streams as the number of participants in a videoconference.
  • the terminal maybe implemented in hardware, software, or a combination thereof in a general-purpose PC or other network access device.
  • the scalable video codecs incorporated in the terminal may be based on coding methods and techniques that are consistent with or based on industry standard encoding methods such as H.264.
  • a scalable video codec creates a base layer that is based on standard H.264 AVC encoding.
  • the scalable video codec further creates a series of SNR enhancement layers by successively encoding, using again H.264 AVC, the difference between the original signal and the one coded at the previous layer with an appropriate offset.
  • DC values of the direct cosine transform (DCT) coefficients are not coded in the enhancement layers, and further, a conventional deblocking filter is not used.
  • DCT direct cosine transform
  • SVC solution which is designed to use SNR scalability as a means of implementing spatial scalability, different quantization parameters (QP) are selected for the base and enhancement layers.
  • the base layer which is encoded at higher QP, is optionally low-pass filtered and downsampled for display at receiving endpoints/terminals .
  • the scalable video codec is designed as a spatially scalable encoder in which a reconstructed base layer H.264 low-resolution signal is upsampled at the encoder and subtracted from the original signal. The difference is fed to the standard encoder operating at high resolution, after being offset by a set value.
  • the upsampled H.264 low-resolution signal is used as an additional possible reference frame in the motion estimation process of a standards-based high-resolution encoder.
  • the SVC solutions may involve adjusting or changing threading modes or spatial scalability modes to dynamically respond to network conditions and participants' display preferences.
  • FIGS. IA and IB are schematic diagrams illustrating exemplary architectures of a videoconferencing system, in accordance with the principles of the present invention.
  • FIG. 2 is a block diagram illustrating an exemplary end-user terminal, in accordance with the principles of the present invention.
  • FIG. 3 is a block diagram illustrating an exemplary architecture of an encoder for the base and temporal enhancement layers (i.e., layers 0 though 2), in accordance with the principles of the present invention.
  • FIG. 4 is a block diagram illustrating an exemplary layered picture coding structure for the base, temporal enhancement, and SNR or spatial enhancement layers, in accordance with the principles of the present invention.
  • FIG. 5 is a block diagram illustrating the structure of an exemplary SNR enhancement layer encoder, in accordance with the principles of the present invention.
  • FIG. 6 is a block diagram illustrating the structure of an exemplary single-loop SNR video encoder, in accordance with the principles of the present invention.
  • FIG. 7 is a block diagram illustrating an exemplary structure of a base layer for a spatial scalability video encoder, in accordance with the principles of the present invention.
  • FIG. 8 is a block diagram illustrating an exemplary structure of a spatial scalability enhancement layer video encoder, in accordance with the principles of the present invention.
  • FIG. 9 is a block diagram illustrating an exemplary structure of a spatial scalability enhancement layer video encoder with inter-layer motion prediction, in accordance with the principles of the present invention.
  • FIGS. 10 and 11 are block diagrams illustrating exemplary base layer and SNR enhancement layer video decoders, respectively, in accordance with the principles of the present invention.
  • FIG. 12 is a block diagram illustrating an exemplary SNR enhancement layer, single-loop video decoder, in accordance with the principles of the present invention.
  • FIG. 13 is block diagram illustrating an exemplary spatial scalability enhancement layer video decoder, in accordance with the principles of the present invention.
  • FIG. 14 is block diagram illustrating an exemplary video decoder for spatial scalability enhancement layers with inter-layer motion prediction, in accordance with the principles of the present invention.
  • FIGS. 15 and 16 are block diagrams illustrating exemplary alternative layered picture coding structures and threading architectures, in accordance with the principles of the present invention.
  • FIG. 17 is a block diagram illustrating an exemplary Scalable Video Coding Server (SVCS), in accordance with the principles of the present invention.
  • FIG. 18 is a schematic diagram illustrating the operation of an SVCS switch, in accordance with the principles of the present invention.
  • SVCS Scalable Video Coding Server
  • FIGS. 19 and 20 are illustrations of exemplary SVCS Switch Layer and Network Layer Configuration Matrices, in accordance with the principles of the present invention.
  • the same reference numerals and characters are used to denote like features, elements, components or portions of the illustrated embodiments.
  • the present invention will now be described in detail with reference to the figures, it is done so in connection with the illustrative embodiments.
  • the present invention provides systems and techniques for scalable video coding (SVC) of video data signals for multipoint and point-to-point video conferencing applications.
  • SVC scalable video coding
  • the SVC systems and techniques are designed to allow the tailoring or customization of delivered video data in response to different user participants/endpoints, network transmission capabilities, environments, or other requirements in a videoconference.
  • the inventive SVC solutions provide compressed video data in a multi-layer format, which can be readily switched layer-by-layer between conferencing participants using convenient zero- or low-algorithmic delay switching mechanisms. Exemplary zero- or low- algorithmic delay switching mechanisms - Scalable Video Coding Servers (SVCS), are described in co-filed U.S. Patent application No. [SVCS] .
  • FIGS. IA and IB show exemplary videoconferencing system 100 arrangements based on the inventive SVC solutions.
  • Videoconferencing system 100 may be implemented in a heterogeneous electronic or computer network environment, for multipoint and point-to-point client conferencing applications.
  • System 100 uses one or more networked servers (e.g., an SVCS or MCU 110), to coordinate the delivery of customized data to conferencing participants or clients 120, 130, and 140.
  • MCU 110 may coordinate the delivery of a video stream 150 generated by endpoint 140 for transmission to other conference participants.
  • a video stream is first suitably coded or scaled down using the inventive SVC techniques into a multiplicity of data components or layers.
  • the multiple data layers may have differing characteristics or features (e.g., spatial resolutions, frame rates, picture quality, signal- to-noise ratio qualities (SNR), etc.).
  • the differing characteristics or features of the data layers may be suitably selected in consideration, for example, of the varying individual user requirements and infrastructure specifications in the electronic network environment (e.g., CPU capabilities, display size, user preferences, and bandwidths).
  • MCU 110 is suitably configured to select an appropriate amount of information (i.e., SVC layers) for each particular participant/recipient in the conference from a received data stream (e.g., SVC video stream 150), and to forward only the selected or requested amounts of information/layers to the respective participants/recipients 120-130.
  • MCU 110 may be configured to make the suitable selections in response to receiving-endpoint requests (e.g., the picture quality requested by individual conference participants) and upon consideration of network conditions and policies.
  • FIG. IB which is reproduced from the referenced patent application [SVCS] , shows an exemplary internal structure of SVC video stream 150 that represents a media input of endpoint 140 to the conference.
  • the exemplary internal structure of SVC video streata. 150 includes a "base" layer 150b, and one or more distinct "enhancement" layers 150a.
  • FIG. 2 shows an exemplary participant/endpoint terminal 140, which is designed for use with SVC-based videoconferencing systems (e.g., system 100).
  • Terminal 140 includes human interface input/output devices (e.g., a camera 210A, a microphone 210B, a video display 250C, a speaker 250D), and a network interface controller card (NIC) 230 coupled to input and output signal multiplexer and demultiplexer units (e.g., packet MUX 220A and packet DMUX 220B).
  • NIC 230 may be a standard hardware component, such as an Ethernet LAN adapter, or any other suitable network interface device.
  • Camera 210A and microphone 210B are designed to capture participant video and audio signals, respectively, for transmission to other conferencing participants.
  • video display 250C and speaker 250D are designed to display and play back video and audio signals received from other participants, respectively.
  • Video display 250C may also be configured to optionally display participant/terminal 140's own video.
  • Camera 210A and microphone 210B outputs are coupled to video and audio encoders 210G and 210H via analog-to-digital converters 210E and 210F, respectively.
  • Video and audio encoders 210G and 210H are designed to compress input video and audio digital signals in order to reduce the bandwidths necessary for transmission of the signals over the electronic communications network.
  • the input video signal may be live, or pre-recorded and stored video signals.
  • Video encoder 210G has multiple outputs connected directly to packet MUX 220A. Audio encoders 210H output is also connected directly to packet MUX 220A. The compressed and layered video and audio digital signals from encoders 210G and 210H are multiplexed by packet MUX 220A for transmission over the communications network via NIC 230. Conversely, compressed video and audio digital signals received over the communications network by NIC 230 are forwarded to packet DMUX 220B for demultiplexing and further processing in terminal 140 for display and playback over video display 250C and speaker 250D.
  • Captured audio signals may be encoded by audio encoder 210H using any suitable encoding techniques including known techniques, for example, G.711 and MPEG-I. In an implementation of videoconferencing system 100 and terminal 140, G.711 encoding is preferred for audio encoding. Captured video signals are encoded in a layered coding format by video encoder 210G using the SVC techniques described herein. Packet MUX 220A may be configured to multiplex the input video and audio signals using, for example, the RTP protocol or other suitable protocols. Packet MUX 220A also may be configured to implement any needed QoS-related protocol processing.
  • each stream of data from terminal 140 is transmitted in its own virtual channel (or port number in IP terminology) over the electronics communication network.
  • QoS may be provided via Differentiated Services (DiffServ) for specific virtual channels or by any other similar QoS-enabling technique.
  • DiffServ or the similar QoS-enabling technique used
  • HRC high reliability
  • LRC low reliability
  • the endpoint may (i) proactively transmit the information over the HRC repeatedly (the actual number of repeated transmissions may depend on channel error conditions), or (ii) cache and retransmit information upon the request of a receiving endpoint or SVCS, for example, in instances where information loss in transmission is detected and reported immediately.
  • These methods of establishing an HRC can be applied in the client-to-MCU, MCU-to-client, or MCU-to-MCU connections individually or in any combination, depending on the available channel type and conditions.
  • terminal 140 is configured with one or more pairs of video and audio decoders (e.g., decoders 230A and 230B) designed to decode signals received from the conferencing participants who are to be seen or heard at terminal 140.
  • the pairs of decoders 230A and 230B may be designed to process signals individually participant-by-participant or to sequentially process a number of participant signals.
  • the configuration or combinations of pairs of video and audio decoders 230A and 230B included in terminal 140 may be suitably selected to process all participant signals received at terminal 140 with consideration of the parallel and/or sequential processing design features of the encoders.
  • packet DMUX 220B may be configured to receive packetized signals from the conferencing participants via NIC 230, and to forward the signals to appropriate pairs of video and audio decoders 230A and 230B for parallel and/or sequential processing.
  • audio decoder 230B outputs are connected to audio mixer 240 and a digital-to-analog converter (OAJC) 250B, which drives speaker 250D to play back received audio signals.
  • Audio mixer 240 is designed to combine individual audio signals into a single signal for playback.
  • video decoder 230A outputs are combined in frame buffer 250A by a compositor 260.
  • a combined or composite video picture from frame buffer 250A is displayed on monitor 250C.
  • Compositor 260 may be suitably designed to position each decoded video picture at a corresponding designated position in the composite frame or displayed picture. For example, monitor 250C display may be split into four smaller areas.
  • Compositor 260 may obtain pixel data from each of video decoders 230A in terminal 140 and place the pixel data in an appropriate frame buffer 250A position (e.g., filling up the lower right picture). To avoid double buffering (e.g., once at the output of decoder 230B and once at frame buffer 250A), compositor 260 may, for example, be configured as an address generator that drives the placement of output pixels of decoder 230B. Alternative techniques for optimizing the placement of individual video decoder 230A outputs on display 210 C may also be used to similar effect.
  • terminal 140 components shown in FIG. 2 may be implemented in any suitable combination of hardware and/or software components, which are suitably interfaced with each other.
  • the components may be distinct stand-alone units or integrated with a personal computer or other device having network access capabilities.
  • FIGS. 3-9 respectively show various scalable video encoders or codecs 300-900 that may be deployed in terminal 140.
  • FIG. 3 shows exemplary encoder architecture 300 for compressing input video signals in a layered coding format (e.g., layers LO, Ll, and L2 in SVC terminology, where LO is the lowest frame rate).
  • Encoder architecture 300 represents a motion-compensated, block-based transform codec based, for example, on a standard H.264/MPEG-4 AVC design or other suitable codec designs.
  • Encoder architecture 300 includes a FRAME BUFFERS block 310, an ENC REF CONTROL block 320, and a DeBlocking Filter block 360 in addition to conventional "text-book" variety video coding process blocks 330 for motion estimation (ME), motion compensation (MC), and other encoding functions.
  • ME motion estimation
  • MC motion compensation
  • the motion-compensated, block- based codec used in system 100/terminal 140 may be a single-layer temporally predictive codec, which has a regular structure of I, P, and B pictures.
  • a picture sequence (in display order) may, for example, be "IBBPBBP".
  • IBBPBBP image-to-frame prediction
  • the 'P' pictures are predicted from the previous P or I picture
  • the B pictures are predicted using both the previous and next P or I picture.
  • the number of B pictures between successive I or P pictures can vary, as can the rate in which I pictures appear, it is not possible, for example, for a P picture to use as a reference for prediction another P picture that is earlier in time than the most recent one.
  • Standard H.264 coding advantageously provides an exception in that two reference picture lists are maintained by the encoder and decoder, respectively. This exception is exploited by the present invention to select which pictures are used as references and also which references are used for a particular picture that is to be coded.
  • FRAME BUFFERS block 310 represents memory for storing the reference picture list(s).
  • ENC REF CONTROL block 310 is designed to determine which reference picture is to be used for the current picture at the encoder side.
  • ENC REF CONTROL block 310 is placed in context further with reference to an exemplary layered picture coding "threading" or “prediction chain” structure shown in FIG. 4.
  • FIG. 4 FIG. 4
  • Codecs 300 utilized in implementations of the present invention may be configured to generate a set of separate picture "threads" (e.g., a set of three threads 410-430) in order to enable multiple levels of temporal scalability resolutions (e.g., L0-L2) and other enhancement resolutions (e.g., S0-S2).
  • a thread or prediction chain is defined as a sequence of pictures that are motion-compensated using pictures either from the same thread, or pictures from a lower level thread.
  • Threads 410-420 have a common source LO but different targets and paths (e.g., targets L2, L2, and LO, respectively).
  • targets L2, L2, and LO targets and paths
  • ENC REF CONTROL block may use only P pictures as reference pictures.
  • B pictures also may be used with accompanying gains in overall compression efficiency. Using even a single B picture in the set of threads (e.g., by having L2 be coded as a B picture) can improve compression efficiency.
  • the use of B pictures with prediction from future pictures increases the coding delay and is therefore avoided.
  • the present invention allows the design of MCUs with practically zero processing delay.
  • encoder 300 output LO is simply a set of P pictures spaced four pictures apart.
  • Output Ll has the same frame rate as LO, but only prediction based on the previous LO picture is allowed.
  • Output L2 pictures are predicted from the most recent LO or Ll picture.
  • Output LO provides one fourth (1 :4) of the full temporal resolution, Ll doubles the LO frame rate (1:2), and L2 doubles the L0+L1 frame rate (1:1).
  • a lesser number e.g., less than 3, L0-L2 or an additional number of layers may be similarly constructed by encoder 300 to accommodate different bandwidth/scalability requirements or different specifications of implementations of the present invention.
  • each compressed temporal video layer may include or be associated with one or more additional components related to SNR quality scalability and/or spatial scalability.
  • FIG. 4 shows one additional enhancement layer (SNR or spatial). Note that this additional enhancement layer will have three different components (S0-S2), each corresponding to the three different temporal layers (L0-L2).
  • FIGS. 5 and 6 show SNR scalability encoders 500 and 600, respectively.
  • FIGS. 7-9 show spatial scalability encoders 700-900, respectively. It will be understood that SNR scalability encoders 500 and 600 and spatial scalability encoders 700-900 are based on and may use the same processing blocks (e.g., blocks 330, 310 and 320) as encoder 300 (FIG. 3).
  • FIG. 5 shows tlie structure of an exemplary SNR enhancement encoder
  • 500 is the difference between the original picture (INPUT, FIG. 3) and the reconstructed coded picture (REF, FIG. 3) as recreated at the encoder.
  • FIG. 5 also shows use of encoder 500 based on H.264 for encoding the coding error of the previous layers.
  • Non-negative inputs are required for such encoding.
  • the input (INPUT-REF) to encoder 500 is offset by a positive bias (e.g., by OFFSET 340).
  • the positive bias is removed after decoding and prior to the addition of the enhancement layer to the base layer.
  • a deblocking filter that is typically used in H.264 codec implementations (e.g., Deblocking filter 360, FIG. 3) is not used in encoder 500.
  • DC direct cosine transform (DCT) coefficients in the enhancement layer may be optionally ignored or eliminated in encoder 500.
  • DCT DC direct cosine transform
  • all motion prediction within a temporal layer and across temporal layers may be performed using the base layer streams only.
  • This feature is shown in FIG. 4 by the open arrowheads 415 indicating temporal prediction in the base layer block (L) rather than in the combination of L and S blocks.
  • all layers may be coded at CIF resolutions.
  • QCIF resolution pictures may be derived by decoding the base layer stream having a certain temporal resolution, and downsampling in each spatial dimension by a dyadic factor (2), using appropriate low-pass filtering. In this manner, SNR scalability can be used to also provide spatial scalability.
  • CIF/QCIF resolutions are referred to only for purposes of illustration.
  • Other resolutions e.g., VGA/QVGA
  • the codecs may also include traditional spatial scalability features in the same or similar manner as described above for the inclusion of the SNR scalability feature.
  • Techniques provided by MPEG-2 or H.264 Annex F may be used for including traditional spatial scalability features.
  • the architecture of codecs designed to decouple the SNR and temporal scalabilities described above, allows frame rates in ratios of 1 :4 (LO only), 1 :2 (LO and Ll), or 1:1 (all three layers).
  • a 100% bitrate increase is assumed for doubling the frame rate (base is 50% of total), and a 150% increase for adding the S layer at its scalability point (base is 40% of total).
  • the total stream may, for example, operate at 500 Kbps, with the base layer operating at 200 Kbps.
  • scalability (total vs. base) is available when the total stream and the base layer operate at 500 Kbps and 200 Kbps, respectively.
  • TABLE I shows examples of the different scalability options available when SNR scalability is used to provide spatial scalability.
  • FIG. 6 shows alternate SNR scalable encoder 600, which is based on a single encoding loop scheme.
  • the structure and operation of SNR scalable encoder 600 is based on that of encoder 300 (FIG. 3). Additionally in encoder 600, DCT coefficients that are quantized by QO are inverse-quantized and subtracted from the original unquantized coefficients to obtain the residual quantization error (QDIFF 610) of the DCT coefficients.
  • the residual quantization error information (QDIFF 610) is further quantized with a finer quantizer Ql (Block 620), entropy coded (VLC/BAC), and output as the SNR enhancement layer S.
  • Ql finer quantizer
  • VLC/BAC entropy coded
  • Terminal 140/video 230 encoders may be configured to provide spatial scalability enhancement layers in addition to or instead of the SNR quality enhancement layers.
  • the input to the encoder is the difference between the original high-resolution picture and the upsampled reconstructed coded picture as created at the encoder.
  • the encoder operates on a downsampled version of the input signal.
  • FIG. 7 shows exemplary encoder 700 for encoding the base layer for spatial scalability.
  • Encoder 700 includes a downsampler 710 at the input of low-resolution base layer encoder 720.
  • base layer encoder 720 may with suitable downsampling operate at QCIF, HCIF (half CIF), or any other resolution lower than CIF.
  • base layer encoder 720 may operate at HCIF.
  • HCIF- mode operation requires downsampling of a CIF resolution input signal by about a factor of Y2 in each dimension, which reduces the total number of pixels in a picture by about one-half of the original input. It is noted that in a video conferencing application, if a QCIF resolution is desired for display purposes, then the decoded base layer will have to be further downsampled from HCIF to QCIF.
  • an inherent difficulty in optimizing the scalable video encoding process for video conferencing applications is that there are two or more resolutions of the video signal being transmitted. Improving the quality of one of the resolutions may result in corresponding degradation of the quality of the other resolution(s). This difficulty is particularly pronounced for spatially scalable coding, and in current art video conferencing systems in which the coded resolution and the display resolutions are identical.
  • the inventive technique of decoupling the coded signal resolution from the intended display resolution provides yet another tool in a codec designer's arsenal to achieve a better balance between the quality and bitrates associated with each of the resolutions.
  • the choice of coded resolution for a particular codec may be obtained by considering the rate-distortion (R-D) performance of the codec across different spatial resolutions, taking into account the total bandwidth available, the desired bandwidth partition across the different resolutions, and the desired quality difference differential that each additional layer should provide.
  • R-D rate-distortion
  • a signal may be coded at CIF and one-third CIF (1/3 CIF) resolutions.
  • CIF and HCIF resolution signals may be derived for display from the CIF-coded signal.
  • 1/3 CIF and QCIF resolution signals may similarly be derived for display from the l/3CIF-coded signal.
  • the CIF and 1/3 CIF resolution signals are available directly from the decoded signals, whereas the latter HCIF and QCIF resolution signals may be obtained upon appropriate downsampling of the decoded signals. Similar schemes may also be applied in the case of other target resolutions (e.g., VGA and one-third VGA, from which half VGA and quarter VGA can be derived).
  • the schemes of decoupling the coded signal resolution from the intended display resolution, together with the schemes for threading video signal layers (FIG. 4, and FIGS. 15 and 16), provide additional possibilities for obtaining target spatial resolutions with different bitrates, in accordance with the present invention.
  • spatial scalability may be used to encode the source signal at CIF and 1/3 CIF resolutions.
  • SNR and temporal scalabilities may be applied to the video signal as shown in FIG. 4.
  • the SNR encoding used may be a single loop or a double loop encoder (e.g., encoder 600 FIG. 6 or encoder 500 FIG. 5), or may be obtained by data partitioning (DP).
  • the double loop or DP encoding schemes will likely introduce drift whenever data is lost or removed.
  • the use of the layering structure will limit the propagation of the drift error until the next LO picture, as long as the lost or removed data belongs to the Ll, L2, Sl, or S2 layers.
  • the perception of errors is reduced when the spatial resolution of the displayed video signal is reduced, it is possible to obtain a low bandwidth signal by eliminating or removing data from the Ll, L2, Sl, and S2 layers, decoding the 1/3 CIF resolution, and displaying it downsampled at a QCIF resolution.
  • FIG. 8 shows the structure of an exemplary spatially scalable enhancement layer encoder 800, which, like encoder 500, uses the same H.264 encoder structure for encoding the coding error of the previous layers but includes an upsampler block 810 on the reference (REF) signal. Since non-negative input is assumed for such an encoder, the input values are offset (e.g., by offset 340) prior to coding. Values that still remain negative are clipped to zero. The offset is removed after decoding and prior to the addition of the enhancement layer to the upsampled base layer.
  • REF reference
  • the spatial enhancement layer encoding like for the SNR layer encoding (FIG. 6), it may be advantageous to use frequency weighting in the quantizers (Q) of the DCT coefficients.
  • coarser quantization can be used for the DC and its surrounding AC coefficients. For example, a doubling of the quantizer step size for the DC coefficient may be very effective.
  • FIG. 9 shows the exemplary structure of another spatially scalable video encoder 900.
  • encoder 900 unlike in encoder 800, the upsampled reconstructed base layer picture (REF) is not subtracted from the input, but instead serves as an additional possible reference picture in the motion estimation and mode selection blocks 330 of the enhancement layer encoder.
  • Encoder 900 can accordingly be configured to predict the current full resolution picture either from a previous coded full resolution picture (or future picture, for B pictures), or an upsampled version of the same picture coded at the lower spatial resolution (inter-layer prediction).
  • encoder 900 requires that the enhancement layer encoder's motion estimation (ME) block 330* is modified. It is also noted that enhancement layer encoder 900 operates on the regular pixel domain, rather than a differential domain.
  • This can be accomplished by modifying the B picture prediction reference for the high resolution signal so that the first picture is the regular or standard prior high resolution picture, and the second picture is the upsampled version of the base layer picture.
  • the encoder then performs prediction as if the second picture is a regular B picture, thus utilizing all the high-efficiency motion vector prediction and coding modes (e.g., spatial and temporal direct modes) of the encoder.
  • B picture coding stands for 'bi-predictive' rather than 'bi-directional', in the sense that the two reference pictures could both be past or future pictures of the picture being coded, whereas in traditional 'bi-directional' B picture coding (e.g., MPEG-2) one of the two reference pictures is a past picture and the other is a future picture.
  • MPEG-2 traditional 'bi-directional' B picture coding
  • the SNR and spatial scalability encoding modes may be combined in one encoder.
  • video-threading structures e.g., shown in two dimensions in FIG. 4
  • SNR or spatial additional third scalability layer
  • An implementation in which SNR scalability is added on the full resolution signal of a spatially scalable codec may be attractive in terms of range of available qualities and bitrates.
  • FIGS. 10-14 show exemplary architectures for a base layer decoder 1000, SNR enhancement layer decoder 1100, a single-loop SNR enhancement layer decoder 1200, a spatially scalable enhancement layer decoder 1300 and spatially scalable enhancement layer decoder 1400 with interlayer motion prediction, respectively.
  • These decoders complement encoders 300, 500, 600, 700, 800 and 900.
  • Decoders 1000, 1100, 1200, 1300, and 1400 may be included in terminal 140 decoders 230A as appropriate or needed.
  • the scale video coding/decoding configurations of terminal 140 present a number of options for transmitting the resultant layers over the HRC and LRC in system 100.
  • (LO and SO) layers or (LO, SO and Ll) layers may be transmitted over HRC.
  • Alternate combinations also may be used as desired, upon due consideration of network conditions, and the bandwidths of high and low reliability channels.
  • the frequency of intra-mode coding which does not involve prediction, may depend on network conditions or may be determined in response to losses reported by a receiving endpoint.
  • the SO prediction chain may be refreshed in this manner (i.e.
  • FIGS. 15 and ' 16 show alternative threading or prediction chain architectures 1500 and 1600, which may be used in video communication or conferencing applications, in accordance with the present invention. Implementations of threading structures or prediction chains 1500 and 1600 do not require any substantial changes to the codec designs described above with reference to FIGS. 2- 14.
  • architecture 1500 an exemplary combination of layers (SO, LO, and Ll) is transmitted over high reliability channel 170. It is noted that, as shown, Ll is part of the LO prediction chain 430, but not for Sl. Architecture 1600 shows further examples of threading configurations, which also can achieve non-dyadic frame rate resolutions.
  • System 100 and terminal 140 codec designs described above are flexible and can be readily extended to incorporate alternative SVC schemes.
  • coding of the S layer may be accomplished according to the forthcoming ITU-T H.264 SVC FGS specification.
  • FGS FGS
  • the S layer coding may be able to utilize arbitrary portions of a 'S' packet due to the embedded property of the produced bitstream. It may be possible to use portions of the FGS component to create the reference picture for the higher layers. Loss of the FGS component information in transmission over the communications network may introduce drift in the decoder.
  • the threading architecture employed in the present invention advantageously minimizes the effects of such loss. Error propagation may be limited to a small number of frames in a manner that is not noticeable to viewers.
  • the amount of FGS to include for reference picture creation may change dynamically.
  • a proposed feature of the H.264 SVC FGS specification is a leaky prediction technique in the FGS layer. See Y. Bao et al, , Joint Video
  • the leaky prediction technique consists of using a normalized weighted average of the previous FGS enhancement layer picture and the current base layer picture.
  • the weighted average is controlled by a weight parameter alpha; if alpha is 1 then only the current base layer picture is used, whereas if it is 0 then only the previous FGS enhancement layer picture is used.
  • alpha is 0 is identical to the use of motion estimation (ME 330, FIG. 5) for the SNR enhancement layer of the present invention, in the limiting case of using only zero motion vectors.
  • the leaky prediction technique can be used in conjunction with regular ME as described in this invention. Further, it is possible to periodically switch the alpha value to 0, in order to break the prediction loop in the FGS layer and eliminate error drift.
  • FIG. 17 shows the switch structure of an exemplary MCU/SVCS 110 that is used in videoconferencing system 100 (FIG. 1).
  • MCU/SVCS 110 determines which packet from each of the possible sources (e.g., endpoints 120-140) is transmitted to which destination and over which channel (high reliability vs. low reliability) and switches signals accordingly.
  • the designs and the switching functions MCU/SVCS 110 are described in co-filed U.S. Patent Application No. [SVCS] , incorporated by reference herein. For brevity, only limited details of the switch structure and switching functions of MCU/SVCS 110 are described further herein.
  • FIG. 18 shows the operation of an exemplary embodiment of MCU/SVCS switch 110.
  • MCU/SVCS switch 110 maintains two data structures in its memory — an SVCS Switch Layer Configuration Matrix HOA and an SVCS Network Configuration Matrix 110, examples of which are shown in FIGS. 19 and 20, respectively.
  • SVCS Switch Layer Configuration Matrix HOA (FIG. 19) provides information on how a particular data packet should be handled for each layer and for each pair of source and destination endpoints 120-140. For example, a matrix 11OA element value of zero indicates that the packet should not be transmitted; a negative matrix element value indicates that the entire packet should be transmitted; and a positive matrix element value indicates that only the specified percentage of the packet's data should be transmitted. Transmission of a specified percentage of the packet's data may be relevant only when an FGS-type of technique is used to scalably code signals.
  • FIG. 18 also shows an algorithm 1800 in MCU/SVCS HO for directing data packets utilizing Switch Layer Configuration Matrix 11OA information.
  • MCU/SVCS 110 may examine received packet headers (e.g., NAL headers, assuming use of H.264).
  • MCU/SVCS 110 evaluates the value of relevant matrix 11 OA elements for source, destination, and layer combinations to establish processing instructions and designated destinations for the received packets. In applications using FGS coding, positive matrix element values indicate that the packet's payload must be reduced in size. Accordingly, at step 1806, the relevant length entry of the packet is changed and no data is copied. At step 1808, the relevant layers or combination of layers are switched to their designated destinations.
  • SVCS Network Configuration Matrix HOB tracks the port numbers for each participating endpoint.
  • MCU/SVCS 110 may use Matrix HOB information to transmit and receive data for each of the layers.
  • MCU/SVCS 110 based on processing Matrices HOA and HOB allows signal switching to occur with zero or minimal internal algorithmic delay, in contrast to traditional MCU operations.
  • Traditional MCUs have to compose incoming video to a new frame for transmission to the various participants. This composition requires full decoding of the incoming streams and recoding of the output stream. The decoding/recoding processing delay in such MCUs is significant, as is the computational power required.
  • MCU/SVCS 110 is required only to filter incoming packets to select the appropriate layer(s) for each recipient destination.
  • MCU/SVCS 110 can advantageously allow MCU/SVCS 110 to be implemented with very little cost, offer excellent scalability (in terms of numbers of sessions that can be hosted simultaneously on a given device), and with end-to-end delays which may be only slightly larger than the delays in a direct endpoint-to-endpoint connection.
  • Terminal 140 and MCU/SVCS 100 may be deployed in different network scenarios using different bitrates and stream combinations.
  • Terminal 140 and like configurations of the present invention allow scalable coding techniques to be exploited in the context of point-to-point and multipoint videoconferencing systems deployed over channels that can provide different QoS guarantees.
  • the selection of the scalable codecs described herein, the selection of a threading model, the choice of which layers to transmit over the high reliability or low reliability channel, and the selection of appropriate bitrates (or quantizer step sizes) for the various layers are relevant design parameters, which may vary with particular implementations of the present invention. Typically, such design choices may be made once and the parameters remain constant during the deployment of a videoconferencing system, or at least during a particular videoconferencing session. However, it will be understood that SVC configurations of the present invention offer the flexibility to dynamically adjust these parameters within a single videoconferencing session.
  • Dynamic adjustment of the parameters may be desirable, talcing into account a participant' s/endpoint's requirements (e.g., which other participants should be received, at what resolutions, etc.) and network conditions (e.g., loss rates, jitter, bandwidth availability for each participant, bandwidth partitioning between high and low reliability channels, etc.).
  • individual participants/endpoints may interactively be able to switch between different threading patterns (e.g., between the threading patterns shown in FIGS. 4, 8, and 9), elect to change how layers are assigned to the high and low reliability channels, elect to eliminate one or more layers, or change the bitrate of individual layers.
  • MCU/SVCS 110 may be configured to change how layers are assigned to the high and low reliability channels linking various participants, eliminate one or more layers, scale the FGS/SNR enhancement layer or some participants.
  • a videoconference may have three participants, A, B, and C.
  • Participants A and B may have access to a high-speed 500 Kbps channel that can guarantee a continuous rate of 200 Kbps.
  • Participant C may have access to a 200 Kbps channel that can guarantee 100 Kbps.
  • Participant A may use a coding scheme that has the following layers: a base layer (“Base”), a temporal scalability layer (“Temporal”) that provides 7.5 fps, 15 fps, 30 fps video at CIF resolutions, and an SNR enhancement layer (“FPS”) that allows increase of the spatial resolution at either of the three temporal frame rates.
  • the Base and Temporal components each require 100 Kbps, and FGS requires 300 Kbps for a total of 500 Kbps bandwidth.
  • Participant A can transmit all three Base, Temporal, and FPS components to MCU 110. Similarly, participant B can receive all three components. However, since only 200 Kbps are guaranteed to participant B in the scenario, FGS is transmitted through the non-guaranteed 300 Kbps channel segment. Participant C can receive only the Base and Temporal components with the Base component guaranteed at 100 Kbps. If the available bandwidth (either guaranteed or total) changes, then Participant A's encoder (e.g., Terminal 140) can in response dynamically change the target bitrate for any of the components. For example, if the guaranteed bandwidth is more than 200 Kbps, more bits may be allocated to the Base and Temporal components. Such changes can be implemented dynamically in real-time response since encoding occurs in real-time (i.e., the video is not pre-coded).
  • Participant A's encoder e.g., Terminal 140
  • participant A may elect to only transmit the Base component. Similarly, if participants B and C select to view received video only at QCIF resolution, participant A can respond by not transmitting the FGS component since the additional quality enhancement offered by the FGS component will be lost by downsampling of the received CIF video to QCIF resolution. It will be noted that in some scenarios, it may be appropriate to transmit a single-layer video stream (base layer or total video) and to completely avoid the use of scalability layers.
  • the quality difference between the base layer picture and 'base plus enhancement layer' picture may be deliberately kept low by using suitable rate control techniques to increase the quality of the base layer itself.
  • One such rate control technique may be to encode all or some of the LO pictures with a lower QP value (i.e., a finer quantization value). For example, every LO picture may be encoded with a QP lowered by a factor of 3.
  • Such finer quantization may increase the quality of the base layer, thus minimizing any flickering effect or equivalent spatial artifacts caused by the loss of enhancement layer information.
  • the lower QP value may also be applied every other LO picture, or every four LO pictures, with similar effectiveness in mitigating flickering and like artifacts.
  • the specific use of a combination of SNR and spatial scalability e.g., using HCIF coding to represent the base layer carrying QCIF quality
  • the scalable codecs described herein may be implemented using any suitable combination of hardware and software.
  • the software (i.e., instructions) for implementing and operating the aforementioned scalable codecs can be provided on computer-readable media, which can include without limitation, firmware, memory, storage devices, microcontrollers, microprocessors, integrated circuits, ASICS, online downloadable media, and other available media.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

L'invention concerne des codecs vidéo échelonnables destinés à être utilisés dans des systèmes et des applications de vidéoconférence hébergés sur des points d'extrémité/récepteurs hétérogènes et des environnements de réseau. Les codecs vidéo échelonnables fournissent une représentation codée d'un signal vidéo source à des résolutions temporelles, de qualité et spatiales multiples.
EP06788106A 2005-09-07 2006-07-21 Système et procédé pour vidéoconférence échelonnable à faible retard utilisant un codage vidéo échelonnable Withdrawn EP1952631A4 (fr)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US71474105P 2005-09-07 2005-09-07
US72339205P 2005-10-04 2005-10-04
PCT/US2006/028365 WO2008060262A1 (fr) 2005-09-07 2006-07-21 Système et procédé pour vidéoconférence échelonnable à faible retard utilisant un codage vidéo échelonnable

Publications (2)

Publication Number Publication Date
EP1952631A1 true EP1952631A1 (fr) 2008-08-06
EP1952631A4 EP1952631A4 (fr) 2012-11-21

Family

ID=39402785

Family Applications (1)

Application Number Title Priority Date Filing Date
EP06788106A Withdrawn EP1952631A4 (fr) 2005-09-07 2006-07-21 Système et procédé pour vidéoconférence échelonnable à faible retard utilisant un codage vidéo échelonnable

Country Status (3)

Country Link
EP (1) EP1952631A4 (fr)
JP (2) JP2009508454A (fr)
WO (1) WO2008060262A1 (fr)

Families Citing this family (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7733868B2 (en) 2005-01-26 2010-06-08 Internet Broadcasting Corp. Layered multicast and fair bandwidth allocation and packet prioritization
CN101371312B (zh) 2005-12-08 2015-12-02 维德约股份有限公司 用于视频通信系统中的差错弹性和随机接入的系统和方法
EP1989877A4 (fr) 2006-02-16 2010-08-18 Vidyo Inc Systeme et procede d'amincissement de flux binaires de codage video a echelle modifiable
US8320450B2 (en) 2006-03-29 2012-11-27 Vidyo, Inc. System and method for transcoding between scalable and non-scalable video codecs
CN101588252B (zh) 2008-05-23 2011-07-20 华为技术有限公司 一种多点会议的控制方法及装置
WO2009154704A1 (fr) * 2008-06-17 2009-12-23 Thomson Licensing Procédés et appareil pour diviser et combiner des flux de transport de codage vidéo extensible
US8319820B2 (en) * 2008-06-23 2012-11-27 Radvision, Ltd. Systems, methods, and media for providing cascaded multi-point video conferencing units
EP2339853A4 (fr) 2008-10-22 2011-08-31 Nippon Telegraph & Telephone Procédé de codage d'image mobile extensible, appareil de codage d'image mobile extensible, programme de codage d'image mobile extensible, et support d'enregistrement lisible par un ordinateur sur lequel ce programme est enregistré
JP5395621B2 (ja) * 2009-11-05 2014-01-22 株式会社メガチップス 画像生成方法および画像再生方法
JP5999873B2 (ja) 2010-02-24 2016-09-28 株式会社リコー 伝送システム、伝送方法、及びプログラム
US9172980B2 (en) 2010-03-25 2015-10-27 Mediatek Inc. Method for adaptively performing video decoding, and associated adaptive complexity video decoder and adaptive audio/video playback system
US8259808B2 (en) * 2010-03-25 2012-09-04 Mediatek Inc. Low complexity video decoder
US10063873B2 (en) 2010-03-25 2018-08-28 Mediatek Inc. Method for adaptively performing video decoding, and associated adaptive complexity video decoder and adaptive audio/video playback system
JP5718456B2 (ja) * 2010-05-25 2015-05-13 ヴィディオ・インコーポレーテッド 複数のカメラおよび複数のモニタを使用する拡張可能なビデオ通信用のシステムおよび方法
US20130148717A1 (en) * 2010-08-26 2013-06-13 Freescale Semiconductor, Inc. Video processing system and method for parallel processing of video data
JP5740969B2 (ja) * 2010-12-22 2015-07-01 株式会社リコー Tv会議システム
JP6079174B2 (ja) 2011-12-27 2017-02-15 株式会社リコー 通信管理システム、通信システム、プログラム、及びメンテナンスシステム
KR102001415B1 (ko) * 2012-06-01 2019-07-18 삼성전자주식회사 다계층 비디오 코딩을 위한 레이트 제어 방법, 이를 이용한 비디오 인코딩 장치 및 비디오 신호 처리 시스템
US9525895B2 (en) * 2012-08-27 2016-12-20 Sony Corporation Transmission device, transmission method, reception device, and reception method
EP2804374A1 (fr) 2013-02-22 2014-11-19 Thomson Licensing Procédés de codage et de décodage d'un bloc d'images, dispositifs correspondants et flux de données
EP2804375A1 (fr) 2013-02-22 2014-11-19 Thomson Licensing Procédés de codage et de décodage d'un bloc d'images, dispositifs correspondants et flux de données
CN105230017B (zh) * 2013-03-21 2019-08-06 索尼公司 图像编码装置和方法以及图像解码装置和方法
JP2015192230A (ja) * 2014-03-27 2015-11-02 沖電気工業株式会社 会議システム、会議サーバ、会議方法及び会議プログラム
JP6349997B2 (ja) * 2014-06-17 2018-07-04 株式会社リコー 通信装置、通信システム、通信制御方法およびプログラム
JP6588801B2 (ja) * 2015-10-30 2019-10-09 キヤノン株式会社 画像処理装置、画像処理方法、及び、プログラム
KR101770070B1 (ko) * 2016-08-16 2017-08-21 라인 가부시키가이샤 비디오 컨퍼런스를 위한 비디오 스트림 제공 방법 및 시스템
CN115720257B (zh) * 2022-10-13 2023-06-23 华能信息技术有限公司 一种视频会议系统的通信安全管理方法及系统

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003063505A1 (fr) * 2002-01-23 2003-07-31 Nokia Corporation Groupage d'images pour codage video

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH09116903A (ja) * 1995-10-16 1997-05-02 Nippon Telegr & Teleph Corp <Ntt> 階層化符号化装置および階層化復号化装置
JP2000295608A (ja) * 1999-04-06 2000-10-20 Matsushita Electric Ind Co Ltd 映像信号階層符号化伝送装置、映像信号階層復号化受信装置およびプログラム記録媒体
AU2003280512A1 (en) * 2002-07-01 2004-01-19 E G Technology Inc. Efficient compression and transport of video over a network
US20040252758A1 (en) * 2002-08-14 2004-12-16 Ioannis Katsavounidis Systems and methods for adaptively filtering discrete cosine transform (DCT) coefficients in a video encoder
US20040086041A1 (en) * 2002-10-30 2004-05-06 Koninklijke Philips Electronics N.V. System and method for advanced data partitioning for robust video transmission
TWI249356B (en) * 2002-11-06 2006-02-11 Nokia Corp Picture buffering for prediction references and display
EP1445958A1 (fr) * 2003-02-05 2004-08-11 STMicroelectronics S.r.l. Procédé et système de quantification, par example pour applications MPEG, et programme informatique correspondant
US20040218669A1 (en) * 2003-04-30 2004-11-04 Nokia Corporation Picture coding method
JP2005167962A (ja) * 2003-11-11 2005-06-23 Secom Co Ltd 符号化信号分離装置、符号化信号合成装置および符号化信号分離合成システム

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003063505A1 (fr) * 2002-01-23 2003-07-31 Nokia Corporation Groupage d'images pour codage video

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
"H.263+(or H.263 version 2), VIDEO CODING FOR LOW BITRATE COMMUNICATION", 9.2.1998,, no. H.263, version 2, 9 February 1998 (1998-02-09), XP030001506, ISSN: 0000-0509 *
"Joint Scalable Video Model (JSVM) 3", 73. MPEG MEETING;25-07-2005 - 29-07-2005; POZNAN; (MOTION PICTUREEXPERT GROUP OR ISO/IEC JTC1/SC29/WG11),, no. n7311, 6 September 2005 (2005-09-06), XP030013911, ISSN: 0000-0345 *
ALEXANDROS ELEFTHERIADIS ET AL: "Multipoint videoconferencing with scalable video coding", JOURNAL OF ZHEJIANG UNIVERSITY SCIENCE A; AN INTERNATIONAL APPLIED PHYSICS & ENGINEERING JOURNAL, SPRINGER, BERLIN, DE, vol. 7, no. 5, 1 May 2006 (2006-05-01), pages 696-705, XP019385029, ISSN: 1862-1775, DOI: 10.1631/JZUS.2006.A0696 *
BLAKE S ET AL: "RFC 2475: An Architecture for Differentiated Services", INTERNET CITATION, December 1998 (1998-12), XP002220520, Retrieved from the Internet: URL:ftp://ftp.isi.edu/in-notes/rfc2475.txt [retrieved on 2002-11-11] *
IHOR KIRENKO ET AL: "Flexible Scalable Video Compression", 69. MPEG MEETING; 19-07-2004 - 23-07-2004; REDMOND; (MOTION PICTUREEXPERT GROUP OR ISO/IEC JTC1/SC29/WG11),, no. M10997, 14 July 2004 (2004-07-14), XP030039776, ISSN: 0000-0254 *
See also references of WO2008060262A1 *

Also Published As

Publication number Publication date
EP1952631A4 (fr) 2012-11-21
JP2013141284A (ja) 2013-07-18
JP2009508454A (ja) 2009-02-26
WO2008060262A1 (fr) 2008-05-22

Similar Documents

Publication Publication Date Title
US8289370B2 (en) System and method for scalable and low-delay videoconferencing using scalable video coding
EP1952631A1 (fr) Système et procédé pour vidéoconférence échelonnable à faible retard utilisant un codage vidéo échelonnable
US20160360155A1 (en) System and method for scalable and low-delay videoconferencing using scalable video coding
US9270939B2 (en) System and method for providing error resilience, random access and rate control in scalable video communications
CA2633366C (fr) Systeme et procede pour la videoconference utilisant le decodage video echelonnable et serveurs de videoconference de composition d&#39;images
US8436889B2 (en) System and method for videoconferencing using scalable video coding and compositing scalable video conferencing servers
US8442120B2 (en) System and method for thinning of scalable video coding bit-streams
CA2796882A1 (fr) Systeme et methode pour videoconference echelonnable et a faible delai faisant appel au codage echelonnable
JP6309463B2 (ja) スケーラブルビデオ通信でエラー耐性、ランダムアクセス、およびレート制御を提供するシステムおよび方法
CA2640246C (fr) Systeme et procede d&#39;amincissement de flux binaires de codage video a echelle modifiable
WO2007103889A2 (fr) Système et procédé permettant de fournir la robustesse aux erreurs, l&#39;accès direct et la commande de débit dans des communications vidéo échelonnables
Eleftheriadis et al. Multipoint videoconferencing with scalable video coding
CA2615346C (fr) Systeme et methode pour videoconference echelonnable et a faible delai faisant appel au codage echelonnable
AU2011254031B2 (en) System and method for providing error resilience, random access and rate control in scalable video communications

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20080209

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LI LT LU LV MC NL PL PT RO SE SI SK TR

AX Request for extension of the european patent

Extension state: AL BA HR MK RS

DAX Request for extension of the european patent (deleted)
A4 Supplementary search report drawn up and despatched

Effective date: 20121018

17Q First examination report despatched

Effective date: 20131219

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20140701