WO2007035606A2 - Broadcasting video content to devices having different video presentation capabilities - Google Patents

Broadcasting video content to devices having different video presentation capabilities Download PDF

Info

Publication number
WO2007035606A2
WO2007035606A2 PCT/US2006/036236 US2006036236W WO2007035606A2 WO 2007035606 A2 WO2007035606 A2 WO 2007035606A2 US 2006036236 W US2006036236 W US 2006036236W WO 2007035606 A2 WO2007035606 A2 WO 2007035606A2
Authority
WO
WIPO (PCT)
Prior art keywords
video
frame
devices
display
location
Prior art date
Application number
PCT/US2006/036236
Other languages
French (fr)
Other versions
WO2007035606A3 (en
Inventor
Adam L. Berger
Gregory Schohn
Original Assignee
Penthera Technologies, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Penthera Technologies, Inc. filed Critical Penthera Technologies, Inc.
Publication of WO2007035606A2 publication Critical patent/WO2007035606A2/en
Publication of WO2007035606A3 publication Critical patent/WO2007035606A3/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/16Analogue secrecy systems; Analogue subscription systems
    • H04N7/162Authorising the user terminal, e.g. by paying; Registering the use of a subscription channel, e.g. billing
    • H04N7/163Authorising the user terminal, e.g. by paying; Registering the use of a subscription channel, e.g. billing by receiver means only
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs
    • H04N21/2343Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
    • H04N21/234318Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements by decomposing into objects, e.g. MPEG-4 objects
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs
    • H04N21/2343Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
    • H04N21/23439Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements for generating different versions
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/25Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
    • H04N21/258Client or end-user data management, e.g. managing client capabilities, user preferences or demographics, processing of multiple end-users preferences to derive collaborative data
    • H04N21/25808Management of client data
    • H04N21/25825Management of client data involving client display capabilities, e.g. screen resolution of a mobile phone
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/41Structure of client; Structure of client peripherals
    • H04N21/414Specialised client platforms, e.g. receiver in car or embedded in a mobile appliance
    • H04N21/41407Specialised client platforms, e.g. receiver in car or embedded in a mobile appliance embedded in a portable device, e.g. video client on a mobile phone, PDA, laptop
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/472End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content
    • H04N21/4728End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content for selecting a Region Of Interest [ROI], e.g. for requesting a higher resolution version of a selected region

Definitions

  • This description relates to broadcasting video content to devices having different content presentation capabilities.
  • the devices span a wide range from projection TVs to flip-phones and others. Their display capabilities vary in terms of screen size, density of pixels, and color depth, for example. The best way to present a given video item may differ for different target devices. For example, a ticker of stock prices or sports scores at the bottom of a news video may be legible on a conventional TV set but blurred on the screen of a cell phone or personal digital assistant (PDA).
  • PDA personal digital assistant
  • frames of broadcast video received at a device are cropped to produce rendered frames on a display of the device, the cropping being based on both (a) a pre-determined value indicating a location within each of at least some of the frames with respect to which the cropping is to be done and (b) a characteristic of the display.
  • Implementations may include one or more of the following features.
  • the location comprises a centroid of a region of interest of the input digital image.
  • the pre- determined value is received from a source external to the device.
  • the pre-determined value is received with the input broadcast video.
  • the cropping of each of the frames occurs in real time as the frames are received.
  • the pre-determined values are generated remotely from the device.
  • the device comprises a mobile device.
  • the mobile device comprises a telephone or a personal digital assistant or a vehicle-mounted display.
  • the characteristic of the display includes the aspect ratio of the display.
  • the characteristic of the display includes the pixel density of the display.
  • the rendered image includes text.
  • a single directive is generated based on the video, the single directive being usable at each of the devices to control the presentation of the frame in two different ways that accommodate the respective capabilities of the two devices.
  • Implementations may include one or more of the following features.
  • the directive is packaged with the video for delivery together to the two devices.
  • Each of the frames of the video is associated with a directive based on the video.
  • At least one of the devices comprises a mobile device.
  • Presenting the frame comprises displaying the frame to a user.
  • the directive identifies a location of a region of interest in the frame.
  • the capabilities of the device include a size of a display.
  • the capabilities of the device include a density of pixels of a display.
  • the directive is generated at a central location.
  • the central location comprises ahead-end of a mobile video broadcasting system.
  • the mobile video broadcasting system comprises DVB-H.
  • the directive is generated in connection with encoding of the video.
  • a single directive is generated based on the video, the single directive being usable at each of the devices to control the presentation of the frame in two different ways that accommodate the respective capabilities of the two devices, and at each of the two different device cropping the frame based on both the directive and a characteristic of the display.
  • the directive comprises an indication of the location of a region of interest in at least one frame of the video
  • a region of interest is determined in a frame of a video, and a location is determined representative of the region of interest, the location being sufficient to enable each of two different video playing devices having different display characteristics to render the frame in a manner suitable for the display characteristics of the device.
  • Implementations may include one or more of the following features.
  • the location comprises a centroid of the region of interest. Such a location is determined for each of additional frames of the video. Information about the location is embedded in the video. The location is used in forming a directive stream associated with the video.
  • the region of interest is based on foreground pixels of the image.
  • the region of interest is based on a moving object or region in the video.
  • the region of interest is based on pan and scan vectors of the video.
  • the region of interest is based on a face appearing in the video.
  • the region of interest is based on the existence of small-font text on a periphery of the frame.
  • aspects include other combinations of features and aspects recited above and other methods, apparatus, programs, and systems, and may include aspects and features in which modifications other than or in addition to cropping are performed, communication is other than or in addition to broadcast, and the content is other than or in addition to video.
  • FIGS 1 through 5, 10, 11, and 16 are block diagrams.
  • FIGS 6 through 9 and 12 through 15, show video frames.
  • video material for example, in the form of compressed digital video
  • target devices 16, 18 having a range of display capabilities.
  • broadcast we mean, for example, a one-to-many communication in which the bearer of the communication in the network is a technology, like satellite or terrestrial transmission, that allows direct receipt by any receiver in range of the transmitter. The communication is carried to all recipients over a single, unidirectional, bandwidth-limited channel.
  • the video is decoded and presented at each of the devices.
  • Such devices include, for example, small screen, mobile, video-enabled devices, such as cell phones, PDAs, handheld video players, laptop and tablet PCs, handheld game-playing devices, and other portable consumer electronic devices.
  • mobile device we mean, for example, a device that is portable: small enough to be carried in one's hand and is battery- powered, thus obviating the need for a continuously-connected external power source.
  • Such devices come in many different configurations and with many different kinds of screens and display capabilities.
  • one way to serve the needs of different target devices would be for the broadcaster to transmit a video service (akin to a broadcast TV channel) in multiple versions, each version customized for a target class of device, for example: one version for 1.5"-wide cell phone screens, a second version for larger-screen PDAs, a third version for yet-larger handheld devices or those with a different form factor and/or aspect ratio, and so on.
  • a video service (akin to a broadcast TV channel) in multiple versions, each version customized for a target class of device, for example: one version for 1.5"-wide cell phone screens, a second version for larger-screen PDAs, a third version for yet-larger handheld devices or those with a different form factor and/or aspect ratio, and so on.
  • Yet broadcasting multiple versions is normally infeasible since a broadcast system has limited bandwidth.
  • a broadcaster could in theory transmit four different versions of each video service, but in doing so, he would reduce the number of different services he could transmit by approximately three-quarters.
  • a 5MHz band at 1.6GHz can accommodate only about 6Mb/s of transmission, using the DVB-H transmission standard.
  • DVB-H transmission standard More mobile digital broadcast standards — ISDB-T, DMB-T, and FLO, for example — have similar bandwidth constraints.
  • advanced video compression algorithms e.g., H.264 or WMV-9
  • a mobile broadcaster's typical available spectrum permits between ten and forty video services of high-quality video, each of 15-20 frames/second of QVGA resolution (320x240 pixels), each requiring about 300kb/s. Reducing the number of services by some factor is an unviable business proposition.
  • video content is ingested from a source, converted to a suitable form for mobile use (for example, the video is resized to a smaller format and encoded in a highly-compressed file format), and broadcast.
  • Each client decodes the received signal and presents the video to its user.
  • any customization of the generic video signal to an individual screen must be done on each target device after the generic video signal has been received.
  • target devices especially mobile video players — have modest computational resources and severe power constraints that limit their ability to customize the received video.
  • staged optimization of the video signal 22 can be used to take advantage of this principle: although analyzing the video signal and generating so-called "optimization directives" may require a large amount of computational power during the packaging of the video signal for broadcast, only a small amount of computational power is needed at the target device. Using staged optimization, the broadcaster need only transmit a single generic video signal. Each target device customizes the display of the video signal to suit its own capability.
  • the video signal 22 is analyzed and a compact stream of optimization directives 27 is generated 26.
  • the directives are interleaved with the video stream 29 to form a composite stream 28.
  • the directives represent annotations that suggest to a target device how to customize the frames according to its own capabilities and the characteristics of its display.
  • the annotations of the stream are typically broadcast 30 in-band with the video itself, although they could possibly be delivered in other ways and separately from the video stream.
  • the received optimization directives are applied in real-time to the incoming video stream of the composite stream 28 during the decoding and customization process 31, 32.
  • the decoded and customized signals are then sent to the respective displays 38, 40.
  • the video stream may be in Microsoft's Advanced Streaming Format (ASF), a file format designed to store synchronized multimedia data (the specification of which is available from Microsoft, for example, at http://www.microsoftxoni/windows/windowsmedia/format/asfspec.aspx).
  • ASF Microsoft's Advanced Streaming Format
  • This format allows arbitrary objects to be embedded at specific times in the multimedia stream — for example, hyperlinks.
  • Applications designed to interpret and play ASF files e.g. the Windows Media Player
  • an ASF file may be injected with optimization directives by interleaving data packets representing these directives within the ASF File Data Object.
  • the optimization directive could, for example, be a signal that the central region of interest has shifted to a pixel position of (140,190) within the frame.
  • an event is generated. This event will drive the player to act on this event asynchronously from its main processing of the video stream.
  • the event could be a module that crops this and all subsequent video frames (until a new directive appears) to the size of the screen, in this case using the position (140,190) as the centroid of the crop.
  • the directive-creation step requires more computational capacity than the directive- processing step, because the former requires analyzing the input video.
  • video content is created at a production source 23, e.g., a television studio.
  • the content is transmitted to a head end 20 using various mechanisms including satellite, high-speed dedicated fiber, or, in some cases (when the content is not intended for live broadcast) simply by commercial transport of recorded media, e.g. a DVD.
  • the head-end 20 is, for example, a site where multimedia services (e.g., live TV) are packaged and distributed for broadcast to mobile clients.
  • multimedia services e.g., live TV
  • head-end processing occurs at a central facility sometimes called a network operations center (NOC).
  • NOC network operations center
  • the NOC is equipped with redundant, reliable sources of power, space for rack-mount servers, security, and appropriate cooling facilities.
  • the head-end or portions of it as a packaging server.
  • the encoder prepares the input media 66 for broadcast.
  • the encoding steps include the following.
  • the media is unwrapped 70 from its original format 71 (e.g., MPEG-2 or NTSC) into a "raw" format 72 (in which each frame is represented as a two-dimensional bitmap of pixel values).
  • the raw format video is resized 74 to a smaller size 76 having fewer pixels, by down-sampling or cropping or both.
  • the resized media is converted 78 to a highly-compressed "codec" 80 (e.g., AVC or WMV-9, each of which can encapsulate video at a high frame rate, say, 15-20 frames/second, with high fidelity, using less than 300 kb/s).
  • codec is wrapped 82 in a file container 84 like ASF.
  • the wrapper specifies the structure of the underlying media stream and may contain meta-information 86 such as name of the program, actors, release date, and so on.
  • the ASF format may also contain arbitrary additional objects interleaved with the media stream.
  • the file may be encrypted 90 using a digital rights management (DRM) package (e.g., Windows Media DRM or OMA DRM).
  • DRM digital rights management
  • Encoding may be done using commercial media encoding products, such as those from Real Networks, Entera, Envivio, and Apple, which run on Windows, Macintosh, and Unix operating systems.
  • the media is intended for real-time delivery, that is, to be played at the target device, essentially as it is ingested at the head-end, aside from some minimal latency incurred by buffering and processing.
  • the packaging server must keep up with the ingested video. That is, the head-end system(s) cannot take longer than one second to process one second of input video and generate the corresponding one second of output.
  • multiple parallel video streams 22 from video sources 23 are handled by multiple encoders 40, 42, 44, each of which performs the encoding, analysis, generation of directives, and interleaving of the directives and the video frames for one of the streams.
  • the encoded (compressed) streams 28 are transmitted (typically via IPv4/v6 multicast) into an Internet protocol encapsulator (IPE) 46, which multiplexes them together into an MPEG2 TS (transport stream) 50 that includes multiple concurrent streams of video, audio, and other data 52.
  • IPE Internet protocol encapsulator
  • An electronic service guide (ESG) 56 provides programming information and a lookup table that allows the device to extract a user-selected service from the multiplex.
  • the ESG is shown multiplexed into the MPEG-2 TS, but it may be conveyed in other ways over or separate from the broadcast channel.
  • the ESG stream 73 that contains information for each of the services 74 is produced by an ESG generator 77 from programming information 79 provided manually by an operator, or automatically, for , example, from an XMLTV/TMS feed.
  • the broadcast hardware receives the TS, selects a service by consulting the ESG, and plays out the corresponding stream(s).
  • a receiver/tuner 60 selects a broadcast frequency, receives 104 the broadcast digital data stream 106, and delivers an MPEG-2 transport stream 50 containing, in interleaved form, the various media services 52, including audio, video, and the ESG 56.
  • a service 108 is then selected 110 from the multiplexed transport stream.
  • a specific service may correspond to a number of separate logical streams 114, 116, 118 in the transport stream, for example, an audio stream, a video stream, an English closed- captioning stream, a Spanish closed-captioning stream, and so on. This is analogous to selecting a channel on a traditional television.
  • Decompression algorithms 120 extract the raw video signal 122, frame by frame and pixel by pixel, from the compressed video. Decompression algorithms 120 also extract the raw audio signal 126. A decryption algorithm 128 may also be applied to the IP packets based on a key 130 that is available only to those users who have subscribed and paid for the service.
  • the ESG stream 56 is decoded and presented 130 to the user in a TV-guide-like form: channels, schedule, and program descriptions.
  • the user is also provided a "graphical user interface” (GUI) that offers the user the ability to select a service and view it.
  • GUI graphical user interface
  • Shrinking the input frame to the required size, without altering the aspect ratio, may require generating a pixel value by averaging the value from a neighborhood of pixels.
  • Common techniques for averaging, such as bicubic, bilinear, and nearest neighbor algorithms are supported by popular image-processing software tools, like Adobe Photoshop, and can also be executed in real-time on typical mobile video players.
  • Feature-length movies typically are recorded with an aspect ratio between 1.78:1 and 2.35:1, which is optimal for movie-theater viewing.
  • Viewing such a movie on a 4:3 (or even 16:9) TV produces letterboxing, but on a large enough TV screen, the effect is benign, because the blank lines at the top and bottom of the screen leave room for the wide-aspect-ratio video within the screen.
  • letterboxing may reduce the height of the video to 1.5" or less, rendering it unacceptably small.
  • pan and scan A technique known as pan and scan is used to correct letterboxing. Movie producers often embed pan and scan directives the digital stream of DVDs that instruct a TV how to extract a 4:3 rectangular viewing region from each frame. The director or a technician selects the optimal region on a frame-by-frame basis.
  • a pan and scan directive is a pair of (x,y) values that specify the corners of the exact bounding box of the frame that is to be rendered on the display of the target device.
  • the embedded pan and scan directives typically anticipate a fixed target form factor, e.g., NTSC 4:3 at 640x480 pixels, for a receiving device having a different form factor, the directives will not be suitable. And, because of the cost of creating pan and scan directives, they may be embedded in, for example, popular feature-length movies but not in a local news stream. In addition, as shown in figure 8, a panned and scanned frame 144 may drop important information that appeared in the original frame 142.
  • one strategy for rendering at the client to accommodate is to crop around the centroid of the frame. This works well when the subject is near the center of the frame but that is sometimes not the case.
  • two device screen sizes 146, 148 having a 16:9 aspect ratio and a 1:1 (square) aspect ratio are overlaid on the original frame 150.
  • centroid-based cropping shaves off the top of the subject's head and the cathedral in the background.
  • generating optimization directives 160 at the head end 20 can yield better presentation of video streams on client devices that have a variety of display formats and capabilities.
  • the uncompressed input video 23 is analyzed 162 (frame by frame, pixel by pixel, in raw uncompressed form) to determine an (x,y) point 164, 165 for each frame that best represents the centroid of the primary region of interest (ROI) in that frame.
  • This (x,y) information is compressed and embedded 166 and embedded 166 as an optimization directive stream 160 into the broadcast stream 50.
  • the stream of (x,y) ROI locations 160 is used to generate, for each frame, a bounding box for cropping the frame and putting the (x,y) location at the center of each box.
  • optimization directive stream for each video stream in the broadcast. For example, if there are eight parallel broadcast services, there will be eight optimization directive streams.
  • the optimization directive for each frame is a pair of numbers, (x,y), representing the location of the pixel that lies at the centroid of the primary ROI in that frame.
  • the single (x,y) value provides information that allows a target device to create a bounding box around the centroid as appropriate for the resolution, size, and aspect ratio of the target device's display.
  • the resulting video stream and directives are decoded at the client to recover the uncompressed form, and the optimization directives guide the cropping of the frames performed by the client in a transform step.
  • (x,y) be the ROI for a frame, found in the optimization directive stream;
  • W d be the width of device's screen;
  • h d be its height;
  • W/be the width of the input video frame, and A/be its height.
  • the transform i.e., the cropping of the input frame
  • This cropping operation is extremely simple to implement as a step in video rendering on a mobile device.
  • the device has a recent version (e.g. 5.0) of the Microsoft Windows Mobile operating system
  • the cropping Filter may be added to a processing chain (known as a graph in DirectShow) by invoking a GraphManager object and inserting the cropping Filter at the desired point in the flow.
  • a human operator manually inserts ROI information on a frame-by- frame basis, by using a computer workstation, an interactive application, and a pointing device such as a mouse.
  • a similar setup is employed to create pan-and-scan directives.
  • the ROI centroid can be computed simply as the centroid of all the foreground pixels.
  • Such partitioning is difficult and error-prone, but possible in the following cases, for example: Video that is recorded using multiple (stereo) cameras can be partitioned at high-quality and in real-time as explained in M. Harville, G. Gordon, J. Woodfill, "Foreground Segmentation Using Adaptive Mixture Models in Color and Depth", Proceedings of the IEEE Workshop on Detection and Recognition of Events in Video, (Vancouver, Canada), July 2001 and illustrated in figure 12.2.
  • Video that is encoded using a layered coding technique enables video to be encoded as objects that can be tracked as they move temporally across frames.
  • a frame-by-frame ROI location (x,y) can be calculated by averaging the (xj.yi) and fa.yi) pan and scan vectors as follows: x and
  • Faces can be identified and tracked in real-world video sequences, as described in P. Viola and M. Jones, "Robust Real-time Face Detection” in Int'l Journal of
  • a face-detection system may be used as the first stage in an ROI tracker followed by computing the center of mass of all the detected faces to be used as the frame's ROI. Face detection can eliminate head-cropping artifacts of the kind illustrated in figure 9.
  • MPEG motion vectors embedded in the compressed video stream
  • United States patent application 20050099541 describes a technique for generating pan-and-scan directives based on motion vectors. When it can be determined that a large convex region is moving monolithically and above a certain velocity threshold, the video is panned in the direction of motion.
  • Locating text within video has been discussed in, e.g., T. Sato, T. Kanade, E. Hughes, and M. Smith, "Video OCR for Digital News Archives", IEEE, Workshop on Content- Based Access of Image and Video Databases (CAIVD'98), Bombay,India,pp.52-60, January, 1998, using a combination of heuristics to detect text. These include looking for high-contrast regions and sharp edges. The better-performing techniques typically integrate results over a sequence of frames and apply multiple predicates (detectors) to the image.
  • Figure 15 from IEEE Int'l Workshop on Content-based access of image and video databases (ICCV98) - M Smith and T Kanade shows the output of a joint text/face detector developed at Carnegie Mellon University.
  • the several analysis techniques described above can be combined in a process, illustrated in figure 16, the output of which is a stream of (x,y) ROI locations, one per frame.
  • the input video is searched 200 for the presence of pan and scan vectors. If found, the vectors are used to generate 202 the ROI values. If the input video contains layered coding 204, segmentation is performed 206 using the layer information. If the video is recorded in stereo 208, the stereo information is used 210. Otherwise, if none of the above, face detection and motion analyis 212, 214 can be applied. Text detection 216 can then be used to adjust the ROI value.
  • Pattern classification systems such as object detectors may produce a confidence level for the results that they generate. If the confidence level is low, the system may default to the centroid of the original frame, rather than accepting a "wild guess.” (This concept is suggested in patent application 20050099541.)
  • the packaging process implements a hysteresis-type filter that dampens the speed of reaction a moving ROI. Instead of jumping fifty pixels at once, for example, the ROI is moved incrementally towards the target position. If, in the next frame the ROI is the same, the tracker will again move incrementally towards it.
  • the device owner is given the option to enable or disable the customization feature.
  • customization is on, the optimization directives are applied.
  • off they are ignored and the device instead performs its default frame rendering, for example, down-sampling and cropping around the center.
  • the option can be chosen on a per- channel, per-program, and per-minute basis.
  • the system could be used for to enable the display device to optimize the frames for any arbitrary geometry, for example, any arbitrary aspect ratio and even any arbitrary shape: oval-shaped, diamond-shaped, or any convex or other region.
  • AU that is required is that the client know how to crop an input video frame to its own geometry
  • the application of the system is not limited to handheld, portable, or mobile devices but could also extend to any small-screen device, including a device that could be mounted on a wall or dashboard or embedded in another device like a thermostat.
  • aspects and features of the system could be used for content that is other than video, for communication systems that are other than broadcast, and for techniques other than cropping.
  • the device to which the video is being sent could be other than a telephone, for example, any hand-held or non hand-held device, for example, a personal digital assistant or a display mounted in a car.

Abstract

Frames of broadcast video received at a device are cropped to produce rendered frames on a display of the device, the cropping being based on both (a) a pre-determined value indicating a location within each of at least some of the frames with respect to which the cropping is to be done and (b) a characteristic of the display.

Description

BROADCASTING VIDEO CONTENT TO DEVICES HAVING DIFFERENT VIDEO PRESENTATION CAPABILITIES
Background
This description relates to broadcasting video content to devices having different content presentation capabilities.
In the case of digital video content, the devices span a wide range from projection TVs to flip-phones and others. Their display capabilities vary in terms of screen size, density of pixels, and color depth, for example. The best way to present a given video item may differ for different target devices. For example, a ticker of stock prices or sports scores at the bottom of a news video may be legible on a conventional TV set but blurred on the screen of a cell phone or personal digital assistant (PDA).
Summary
In general, in one aspect, frames of broadcast video received at a device are cropped to produce rendered frames on a display of the device, the cropping being based on both (a) a pre-determined value indicating a location within each of at least some of the frames with respect to which the cropping is to be done and (b) a characteristic of the display.
Implementations may include one or more of the following features. The location comprises a centroid of a region of interest of the input digital image. The pre- determined value is received from a source external to the device. The pre-determined value is received with the input broadcast video. There is a one-to-one association of frames and pre-determined values. The cropping of each of the frames occurs in real time as the frames are received. The pre-determined values are generated remotely from the device. The device comprises a mobile device. The mobile device comprises a telephone or a personal digital assistant or a vehicle-mounted display. The characteristic of the display includes the aspect ratio of the display. The characteristic of the display includes the pixel density of the display. The rendered image includes text. In general, in another aspect, for a frame of video to be broadcast to at least two different devices that have different capabilities for presenting the video to users, a single directive is generated based on the video, the single directive being usable at each of the devices to control the presentation of the frame in two different ways that accommodate the respective capabilities of the two devices.
Implementations may include one or more of the following features. The directive is packaged with the video for delivery together to the two devices. Each of the frames of the video is associated with a directive based on the video. At least one of the devices comprises a mobile device. Presenting the frame comprises displaying the frame to a user. The directive identifies a location of a region of interest in the frame. The capabilities of the device include a size of a display. The capabilities of the device include a density of pixels of a display. The directive is generated at a central location. The central location comprises ahead-end of a mobile video broadcasting system. The mobile video broadcasting system comprises DVB-H. The directive is generated in connection with encoding of the video.
In general, in another aspect, for a frame of video to be delivered to at least two different devices that have different capabilities for presenting the frame to users, a single directive is generated based on the video, the single directive being usable at each of the devices to control the presentation of the frame in two different ways that accommodate the respective capabilities of the two devices, and at each of the two different device cropping the frame based on both the directive and a characteristic of the display.
Implementations may include one or more of the following features. The directive comprises an indication of the location of a region of interest in at least one frame of the video
In general, in another aspect, a region of interest is determined in a frame of a video, and a location is determined representative of the region of interest, the location being sufficient to enable each of two different video playing devices having different display characteristics to render the frame in a manner suitable for the display characteristics of the device. Implementations may include one or more of the following features. The location comprises a centroid of the region of interest. Such a location is determined for each of additional frames of the video. Information about the location is embedded in the video. The location is used in forming a directive stream associated with the video. The region of interest is based on foreground pixels of the image. The region of interest is based on a moving object or region in the video. The region of interest is based on pan and scan vectors of the video. The region of interest is based on a face appearing in the video. The region of interest is based on the existence of small-font text on a periphery of the frame.
Other aspects include other combinations of features and aspects recited above and other methods, apparatus, programs, and systems, and may include aspects and features in which modifications other than or in addition to cropping are performed, communication is other than or in addition to broadcast, and the content is other than or in addition to video.
Other advantages and features will become apparent from the following description and from the claims.
Description
Figures 1 through 5, 10, 11, and 16 are block diagrams.
Figures 6 through 9 and 12 through 15, show video frames.
As shown in figure 1, in the example of a broadcast video system 10, video material (for example, in the form of compressed digital video) 12 may be broadcast 14 to target devices 16, 18 having a range of display capabilities. By broadcast we mean, for example, a one-to-many communication in which the bearer of the communication in the network is a technology, like satellite or terrestrial transmission, that allows direct receipt by any receiver in range of the transmitter. The communication is carried to all recipients over a single, unidirectional, bandwidth-limited channel.
The video is decoded and presented at each of the devices. Such devices include, for example, small screen, mobile, video-enabled devices, such as cell phones, PDAs, handheld video players, laptop and tablet PCs, handheld game-playing devices, and other portable consumer electronic devices. By "mobile device" we mean, for example, a device that is portable: small enough to be carried in one's hand and is battery- powered, thus obviating the need for a continuously-connected external power source. Such devices come in many different configurations and with many different kinds of screens and display capabilities.
In theory, one way to serve the needs of different target devices would be for the broadcaster to transmit a video service (akin to a broadcast TV channel) in multiple versions, each version customized for a target class of device, for example: one version for 1.5"-wide cell phone screens, a second version for larger-screen PDAs, a third version for yet-larger handheld devices or those with a different form factor and/or aspect ratio, and so on. Yet broadcasting multiple versions is normally infeasible since a broadcast system has limited bandwidth. A broadcaster could in theory transmit four different versions of each video service, but in doing so, he would reduce the number of different services he could transmit by approximately three-quarters. For example, a 5MHz band at 1.6GHz can accommodate only about 6Mb/s of transmission, using the DVB-H transmission standard. (Other mobile digital broadcast standards — ISDB-T, DMB-T, and FLO, for example — have similar bandwidth constraints.) Using advanced video compression algorithms (e.g., H.264 or WMV-9), a mobile broadcaster's typical available spectrum permits between ten and forty video services of high-quality video, each of 15-20 frames/second of QVGA resolution (320x240 pixels), each requiring about 300kb/s. Reducing the number of services by some factor is an unviable business proposition.
Thus a broadcaster can afford to create only a single, generic version of each video service for transmission to an entire target population of devices and users.
In a typical mobile video broadcast center, video content is ingested from a source, converted to a suitable form for mobile use (for example, the video is resized to a smaller format and encoded in a highly-compressed file format), and broadcast. Each client decodes the received signal and presents the video to its user. As discussed earlier, any customization of the generic video signal to an individual screen must be done on each target device after the generic video signal has been received. Yet target devices — especially mobile video players — have modest computational resources and severe power constraints that limit their ability to customize the received video.
As broadcasting is a one-to-many activity in which a single packaging or head-end facility creates content for transmission to a large client population, it is efficient to add computational resources at the single head-end rather than at the large number of clients. The computational power of the packaging process can be increased easily by upgrading the packaging server. A comparable upgrade of the computational power of the clients would require in-the-field updates of hardware and/or software for potentially millions of devices.
As shown in figure 2, a so-called staged optimization of the video signal 22 can be used to take advantage of this principle: although analyzing the video signal and generating so-called "optimization directives" may require a large amount of computational power during the packaging of the video signal for broadcast, only a small amount of computational power is needed at the target device. Using staged optimization, the broadcaster need only transmit a single generic video signal. Each target device customizes the display of the video signal to suit its own capability.
At the packaging server 20, the video signal 22 is analyzed and a compact stream of optimization directives 27 is generated 26. The directives are interleaved with the video stream 29 to form a composite stream 28. The directives represent annotations that suggest to a target device how to customize the frames according to its own capabilities and the characteristics of its display. The annotations of the stream are typically broadcast 30 in-band with the video itself, although they could possibly be delivered in other ways and separately from the video stream.
At the different kinds of clients 14, 18, the received optimization directives are applied in real-time to the incoming video stream of the composite stream 28 during the decoding and customization process 31, 32. The decoded and customized signals are then sent to the respective displays 38, 40.
In some implementations, the video stream may be in Microsoft's Advanced Streaming Format (ASF), a file format designed to store synchronized multimedia data (the specification of which is available from Microsoft, for example, at http://www.microsoftxoni/windows/windowsmedia/format/asfspec.aspx). This format allows arbitrary objects to be embedded at specific times in the multimedia stream — for example, hyperlinks. Applications designed to interpret and play ASF files (e.g. the Windows Media Player) recognizes these embedded objects and acts upon them or, if an object is not recognized, passes it to an external module for processing. In this latter case, an ASF file may be injected with optimization directives by interleaving data packets representing these directives within the ASF File Data Object. The optimization directive could, for example, be a signal that the central region of interest has shifted to a pixel position of (140,190) within the frame. When a player receives an optimization directive packet, an event is generated. This event will drive the player to act on this event asynchronously from its main processing of the video stream. The event could be a module that crops this and all subsequent video frames (until a new directive appears) to the size of the screen, in this case using the position (140,190) as the centroid of the crop.
The directive-creation step requires more computational capacity than the directive- processing step, because the former requires analyzing the input video.
We turn now to a more detailed description of some implementations beginning with figure 3 in which video content is created at a production source 23, e.g., a television studio. The content is transmitted to a head end 20 using various mechanisms including satellite, high-speed dedicated fiber, or, in some cases (when the content is not intended for live broadcast) simply by commercial transport of recorded media, e.g. a DVD.
By analogy with typical cable TV network terminology, the head-end 20 is, for example, a site where multimedia services (e.g., live TV) are packaged and distributed for broadcast to mobile clients. For mobile video delivery, head-end processing occurs at a central facility sometimes called a network operations center (NOC). Typically, the NOC is equipped with redundant, reliable sources of power, space for rack-mount servers, security, and appropriate cooling facilities. We sometimes refer to the head-end or portions of it as a packaging server.
As shown in figure 4, after ingestion 68, the encoder prepares the input media 66 for broadcast. The encoding steps include the following. The media is unwrapped 70 from its original format 71 (e.g., MPEG-2 or NTSC) into a "raw" format 72 (in which each frame is represented as a two-dimensional bitmap of pixel values). The raw format video is resized 74 to a smaller size 76 having fewer pixels, by down-sampling or cropping or both. The resized media is converted 78 to a highly-compressed "codec" 80 (e.g., AVC or WMV-9, each of which can encapsulate video at a high frame rate, say, 15-20 frames/second, with high fidelity, using less than 300 kb/s). The codec is wrapped 82 in a file container 84 like ASF. The wrapper specifies the structure of the underlying media stream and may contain meta-information 86 such as name of the program, actors, release date, and so on. As we have discussed, the ASF format may also contain arbitrary additional objects interleaved with the media stream. Optionally, the file may be encrypted 90 using a digital rights management (DRM) package (e.g., Windows Media DRM or OMA DRM).
Encoding may be done using commercial media encoding products, such as those from Real Networks, Entera, Envivio, and Apple, which run on Windows, Macintosh, and Unix operating systems.
In some cases, the media is intended for real-time delivery, that is, to be played at the target device, essentially as it is ingested at the head-end, aside from some minimal latency incurred by buffering and processing. To stream content, the packaging server must keep up with the ingested video. That is, the head-end system(s) cannot take longer than one second to process one second of input video and generate the corresponding one second of output.
hi some other contexts, there is no real-time constraint on encoding. For example, a documentary may not need to be viewed as it is produced. In this case, the packaging maybe allowed to take considerably longer than real-time and may include more complex manipulations and analyses. For example, many commercial encoders (e.g. Windows Media Encoder) offer a two-pass encoding scheme for processing non- streaming content: during the first pass the video is analyzed and processed, and during the second pass the encoded (compressed) output is written. Two-pass encoding is not available in real-time streaming, because the content must be processed and written out as it is received. As shown in more detail in figure 3, multiple parallel video streams 22 from video sources 23 (different TV channels, for instance) are handled by multiple encoders 40, 42, 44, each of which performs the encoding, analysis, generation of directives, and interleaving of the directives and the video frames for one of the streams. The encoded (compressed) streams 28 are transmitted (typically via IPv4/v6 multicast) into an Internet protocol encapsulator (IPE) 46, which multiplexes them together into an MPEG2 TS (transport stream) 50 that includes multiple concurrent streams of video, audio, and other data 52.
An electronic service guide (ESG) 56 provides programming information and a lookup table that allows the device to extract a user-selected service from the multiplex. The ESG is shown multiplexed into the MPEG-2 TS, but it may be conveyed in other ways over or separate from the broadcast channel. The ESG stream 73 that contains information for each of the services 74 is produced by an ESG generator 77 from programming information 79 provided manually by an operator, or automatically, for , example, from an XMLTV/TMS feed. At the client 14, 18, the broadcast hardware receives the TS, selects a service by consulting the ESG, and plays out the corresponding stream(s).
Technologies for digital broadcast transmission to mobiles, for example, DVB-H, ISDB-T, DMB-T, and MediaFLO transmit the IP packet stream (possibly with further packaging) from multiple terrestrial towers, typically using a reserved part of the RF • spectrum. The differences among these technologies are mostly at the physical and network level, and not significant to the system described here. Our discussion here also will not include discussions of the following issues that are not relevant to an understanding of the system described here: physical network issues including power conservation, protecting against Doppler interference, supporting handover/roaming preventing transmission errors using forward error correction (FEC), and broadcast network configuration including where to place transmitters.
Referring again to figure 3, and also to figure 5, handling of the broadcast signal at the client involves several steps. A receiver/tuner 60 selects a broadcast frequency, receives 104 the broadcast digital data stream 106, and delivers an MPEG-2 transport stream 50 containing, in interleaved form, the various media services 52, including audio, video, and the ESG 56.
A service 108 is then selected 110 from the multiplexed transport stream. A specific service may correspond to a number of separate logical streams 114, 116, 118 in the transport stream, for example, an audio stream, a video stream, an English closed- captioning stream, a Spanish closed-captioning stream, and so on. This is analogous to selecting a channel on a traditional television.
Decompression algorithms 120 extract the raw video signal 122, frame by frame and pixel by pixel, from the compressed video. Decompression algorithms 120 also extract the raw audio signal 126. A decryption algorithm 128 may also be applied to the IP packets based on a key 130 that is available only to those users who have subscribed and paid for the service.
The ESG stream 56 is decoded and presented 130 to the user in a TV-guide-like form: channels, schedule, and program descriptions. The user is also provided a "graphical user interface" (GUI) that offers the user the ability to select a service and view it.
We now focus on the video-resizing aspect of the system.
Shrinking the input frame to the required size, without altering the aspect ratio, may require generating a pixel value by averaging the value from a neighborhood of pixels. Common techniques for averaging, such as bicubic, bilinear, and nearest neighbor algorithms are supported by popular image-processing software tools, like Adobe Photoshop, and can also be executed in real-time on typical mobile video players.
The use of simple down-sampling assumes that it is preferable to show an entire video frame at reduced size and resolution rather than a zoomed-in part of that frame at a higher size and resolution. This assumption is not always justified. As shown in figure 6, simply down-sampling a frame 130 to a small size frame 132 is unsatisfactory because the subject (the news anchor) is already small in the frame 130 and becomes almost invisible in the small frame 132. Reducing the frame size by cropping and zooming to frame 134 gives a superior viewer experience. As shown in figure 7, down-sampling can create artifacts such as blank borders 140 when the aspect ratio of the target device does not match that of the video signal. This artifact is known as letterboxing.
Feature-length movies typically are recorded with an aspect ratio between 1.78:1 and 2.35:1, which is optimal for movie-theater viewing. Viewing such a movie on a 4:3 (or even 16:9) TV (using a DVD, for example) produces letterboxing, but on a large enough TV screen, the effect is benign, because the blank lines at the top and bottom of the screen leave room for the wide-aspect-ratio video within the screen. Conversely, on the 2"-high screen of a cell phone, for example, letterboxing may reduce the height of the video to 1.5" or less, rendering it unacceptably small.
A technique known as pan and scan is used to correct letterboxing. Movie producers often embed pan and scan directives the digital stream of DVDs that instruct a TV how to extract a 4:3 rectangular viewing region from each frame. The director or a technician selects the optimal region on a frame-by-frame basis. A pan and scan directive is a pair of (x,y) values that specify the corners of the exact bounding box of the frame that is to be rendered on the display of the target device.
Because the embedded pan and scan directives typically anticipate a fixed target form factor, e.g., NTSC 4:3 at 640x480 pixels, for a receiving device having a different form factor, the directives will not be suitable. And, because of the cost of creating pan and scan directives, they may be embedded in, for example, popular feature-length movies but not in a local news stream. In addition, as shown in figure 8, a panned and scanned frame 144 may drop important information that appeared in the original frame 142.
In the absence of useful pan and scan directives, or any directives at all, one strategy for rendering at the client to accommodate is to crop around the centroid of the frame. This works well when the subject is near the center of the frame but that is sometimes not the case. As shown in figure 9, two device screen sizes 146, 148 having a 16:9 aspect ratio and a 1:1 (square) aspect ratio are overlaid on the original frame 150. For both target devices, centroid-based cropping shaves off the top of the subject's head and the cathedral in the background. As shown in figures 10 and 11, generating optimization directives 160 at the head end 20 can yield better presentation of video streams on client devices that have a variety of display formats and capabilities. At the head end, during video packaging, the uncompressed input video 23 is analyzed 162 (frame by frame, pixel by pixel, in raw uncompressed form) to determine an (x,y) point 164, 165 for each frame that best represents the centroid of the primary region of interest (ROI) in that frame. This (x,y) information is compressed and embedded 166 and embedded 166 as an optimization directive stream 160 into the broadcast stream 50.
At the client, the stream of (x,y) ROI locations 160 is used to generate, for each frame, a bounding box for cropping the frame and putting the (x,y) location at the center of each box.
There is one optimization directive stream for each video stream in the broadcast. For example, if there are eight parallel broadcast services, there will be eight optimization directive streams.
The optimization directive for each frame is a pair of numbers, (x,y), representing the location of the pixel that lies at the centroid of the primary ROI in that frame. The single (x,y) value provides information that allows a target device to create a bounding box around the centroid as appropriate for the resolution, size, and aspect ratio of the target device's display.
The resulting video stream and directives are decoded at the client to recover the uncompressed form, and the optimization directives guide the cropping of the frames performed by the client in a transform step.
Let (x,y) be the ROI for a frame, found in the optimization directive stream; Wd be the width of device's screen; hd be its height; W/be the width of the input video frame, and A/be its height.
Using this notation, the transform (i.e., the cropping of the input frame) is as follows:
In general, select a (Wd, hj) rectangle from within the input frame, where the centroid of the selected rectangle is the point (x,y). However, to prevent the selected rectangle from containing blank space, adjust the step as follows to keep the selected rectangle fully contained within the frame:
if y < hd/2 \heny = hj/2; ify > h/~ hJ2 then 3; = hf- ha/2; if x < hw/2 then x = hJ2; andifx > Wf- hw/2 then x = Wf- hy/2.
This cropping operation is extremely simple to implement as a step in video rendering on a mobile device. For example, if the device has a recent version (e.g. 5.0) of the Microsoft Windows Mobile operating system, one may use the DirectShow multimedia feature and implement the cropping process as a Filter. The cropping Filter may be added to a processing chain (known as a graph in DirectShow) by invoking a GraphManager object and inserting the cropping Filter at the desired point in the flow.
Now we return to the analysis step performed at the head end to determine the ROI locations for the frames of an input video stream. We describe several approaches for generating ROI information and a technique for consolidating these approaches into a single system.
1. Manually: A human operator manually inserts ROI information on a frame-by- frame basis, by using a computer workstation, an interactive application, and a pointing device such as a mouse. A similar setup is employed to create pan-and-scan directives.
2. By partitioning a video frame, pixel by pixel, into foreground and background partitions, the ROI centroid can be computed simply as the centroid of all the foreground pixels. Such partitioning is difficult and error-prone, but possible in the following cases, for example: Video that is recorded using multiple (stereo) cameras can be partitioned at high-quality and in real-time as explained in M. Harville, G. Gordon, J. Woodfill, "Foreground Segmentation Using Adaptive Mixture Models in Color and Depth", Proceedings of the IEEE Workshop on Detection and Recognition of Events in Video, (Vancouver, Canada), July 2001 and illustrated in figure 12.2. Video that is encoded using a layered coding technique (for example MPEG-4) enables video to be encoded as objects that can be tracked as they move temporally across frames. 3. When the input signal contains pan-and-scan vectors, a frame-by-frame ROI location (x,y) can be calculated by averaging the (xj.yi) and fa.yi) pan and scan vectors as follows: x
Figure imgf000015_0001
and
Figure imgf000015_0002
4. Faces can be identified and tracked in real-world video sequences, as described in P. Viola and M. Jones, "Robust Real-time Face Detection " in Int'l Journal of
Computer Vision 57(2)137 — 154, 2001, and illustrated in two examples in figure 13. A face-detection system may be used as the first stage in an ROI tracker followed by computing the center of mass of all the detected faces to be used as the frame's ROI. Face detection can eliminate head-cropping artifacts of the kind illustrated in figure 9.
5. MPEG motion vectors (embedded in the compressed video stream) may be analyzed to determine where the moving objects are in each frame, and then to assign those objects as the region of interest. United States patent application 20050099541 describes a technique for generating pan-and-scan directives based on motion vectors. When it can be determined that a large convex region is moving monolithically and above a certain velocity threshold, the video is panned in the direction of motion.
6. When synthetic computer-generated text is inserted in the video stream during production, if the video frame is proportionally down-sampled for small-screen devices, smaller-font text may become illegible, as shown in frame 180 of figure 14. During analysis, the head-end examines the input frames for small-font text that appears on the periphery of the frame, and uses that information to guide the ROI tracker. AU else being equal, a small-font ticker on the bottom of the screen would shift the ROI further up in the video frame.
Locating text within video has been discussed in, e.g., T. Sato, T. Kanade, E. Hughes, and M. Smith, "Video OCR for Digital News Archives", IEEE, Workshop on Content- Based Access of Image and Video Databases (CAIVD'98), Bombay,India,pp.52-60, January, 1998, using a combination of heuristics to detect text. These include looking for high-contrast regions and sharp edges. The better-performing techniques typically integrate results over a sequence of frames and apply multiple predicates (detectors) to the image. Figure 15, from IEEE Int'l Workshop on Content-based access of image and video databases (ICCV98) - M Smith and T Kanade, shows the output of a joint text/face detector developed at Carnegie Mellon University.
We turn now to discuss certain details of some implementations of the system.
The several analysis techniques described above can be combined in a process, illustrated in figure 16, the output of which is a stream of (x,y) ROI locations, one per frame. The input video is searched 200 for the presence of pan and scan vectors. If found, the vectors are used to generate 202 the ROI values. If the input video contains layered coding 204, segmentation is performed 206 using the layer information. If the video is recorded in stereo 208, the stereo information is used 210. Otherwise, if none of the above, face detection and motion analyis 212, 214 can be applied. Text detection 216 can then be used to adjust the ROI value.
We now describe certain implementation details.
Pattern classification systems such as object detectors may produce a confidence level for the results that they generate. If the confidence level is low, the system may default to the centroid of the original frame, rather than accepting a "wild guess." (This concept is suggested in patent application 20050099541.)
Discontinuous jumping around of the ROI from frame to frame would create an unpleasant user experience. The packaging process implements a hysteresis-type filter that dampens the speed of reaction a moving ROI. Instead of jumping fifty pixels at once, for example, the ROI is moved incrementally towards the target position. If, in the next frame the ROI is the same, the tracker will again move incrementally towards it.
The device owner is given the option to enable or disable the customization feature. When customization is on, the optimization directives are applied. When off, they are ignored and the device instead performs its default frame rendering, for example, down-sampling and cropping around the center. The option can be chosen on a per- channel, per-program, and per-minute basis.
The system could be used for to enable the display device to optimize the frames for any arbitrary geometry, for example, any arbitrary aspect ratio and even any arbitrary shape: oval-shaped, diamond-shaped, or any convex or other region. AU that is required is that the client know how to crop an input video frame to its own geometry
Other implementations are within the scope of the following claims.
The application of the system is not limited to handheld, portable, or mobile devices but could also extend to any small-screen device, including a device that could be mounted on a wall or dashboard or embedded in another device like a thermostat.
Although the system has been described in the context of digital broadcast to mobile devices, it could be applied to a wide variety of other platforms and in other contexts, for example, internet-based multicasting which does not use terrestrial broadcast technology to deliver the content, but rather uses traditional Internet structures — routers, switches, and last-mile technology such as WiFi, DSL, cable modems. The intermediate routers between the head-end and client would be multicast-enabled.
It is possible that aspects and features of the system could be used for content that is other than video, for communication systems that are other than broadcast, and for techniques other than cropping.
The device to which the video is being sent could be other than a telephone, for example, any hand-held or non hand-held device, for example, a personal digital assistant or a display mounted in a car.

Claims

Claims
1. A method comprising
cropping frames of broadcast video received at a device to produce rendered frames on a display of the device, the cropping being based on both (a) a pre- determined value indicating a location within each of at least some of the frames with respect to which the cropping is to be done and (b) a characteristic of the display.
2. The method of claim 1 in which the location comprises a centroid of a region of interest of the frames.
3. The method of claim 1 in which the pre-determined value is received from a source external to the device.
4. The method of claim 1 in which the pre-determined value is received with the broadcast video.
5. The method of claim 1 in which there is a one-to-one association of frames and pre-determined values.
6. The method of claim 1 in which the cropping of each of the frames occurs in real time as the frames are received.
7. The method of claim 1 in which the pre-determined values are generated remotely from the device.
8. The method of claim 1 in which the device comprises a mobile device.
9. The method of claim 1 in which the mobile device comprises a telephone, a personal digital assistant or a vehicle-mounted display.
10. The method of claim 1 in which the characteristic of the display includes the aspect ratio of the display.
11. The method of claim 1 in which the characteristic of the display includes the pixel density of the display.
12. The method of claim 1 in which the rendered image includes text.
13. A method comprising
for a frame of video to be broadcast to at least two different devices that have different capabilities for presenting the video to users, generating a single directive based on the video, the single directive being usable at each of the devices to control the presentation of the frame in two different ways that accommodate the respective capabilities of the two devices.
14. The method of claim 13 also including packaging the directive with the video for delivery together to the two devices.
15. The method of claim 13 in which each frame of the video is associated with a directive based on the frame of the video.
16. The method of claim 13 in which at least one of the devices comprises a mobile device.
17. The method of claim 13 in which presenting the frame comprises displaying the frame to a user.
18. The method of claim 13 in which the directive identifies a location of a region of interest in the frame.
19. The method of claim 13 in which the capabilities of the device include a size of a display.
20. The method of claim 13 in which the capabilities of the device include a density of pixels of a display.
21. The method of claim 13 in which the directive is generated at a central location.
22. The method of claim 21 in which the central location comprises a head-end of a mobile video broadcasting system.
23. The method of claim 22 in which the mobile video broadcasting system comprises DVB-H.
24. The method of claim 13 in which the directive is generated in connection with encoding of the video.
25. A method comprising
for a frame of video to be delivered to at least two different devices that have different capabilities for presenting the frame to users, generating a single directive based on the video, the single directive being usable at each of the devices to control the presentation of the frame in two different ways that accommodate the respective capabilities of the two devices, and
at each of the two different devices, cropping the frame based on both the directive and a characteristic of the display.
26. The method of claim 1 in which the directive comprises an indication of the location of a region of interest in at least one frame of the video.
27. A method comprising
determining a region of interest in a frame of a video, and
determining a location representative of the region of interest, the location being sufficient to enable each of two different video playing devices having different display characteristics to render the frame in a manner suitable for the display characteristics of the device.
28. The method of claim 27 in which the location comprises a centroid of the region of interest.
29. The method of claim 27 in which such a location is determined for each of additional frames of the video.
30. The method of claim 27 also including embedding information about the location in the video.
31. The method of claim 27 in which the location is used in forming a directive stream associated with the video.
32. The method of claim 27 in which the region of interest is based on foreground pixels of the image.
33. The method of claim 27 in which the region of interest is based on a moving object or region in the video.
34. The method of claim 27 in which the region of interest is based on pan and scan vectors of the video.
35. The method of claim 27 in which the region of interest is based on a face appearing in the video.
36. The method of claim 27 in which the region of interest is based on the existence ofsmall-font text on a periphery of the frame.
37. An apparatus comprising
in a presentation device, (a) a display, (b) an input to receive a pre-determined value indicating a location within a received frame of video with respect to which the frame is to be cropped, (c) information about a characteristic of the display, and (d) a processor to crop the received frame, based on the pre-determined value and the information about the characteristic to produce a rendered frame to be displayed on the display of the device.
38. A medium bearing instructions that enable a device to :
crop a broadcast video frame received at a device to produce a rendered frame on a display of the device, the cropping being based on both (a) a pre-determined value indicating a location within the frame with respect to which the cropping is to be done and (b) a characteristic of the display.
39. An apparatus comprising
in a server, (a) an input to receive a video frame to be broadcast to at least two different devices that have different capabilities for presenting the frame to users, (b) a processor to generate a single directive based on the video, and (c) an output to deliver the frame and the single directive to the two devices, the single directive being usable at each of the devices to control the presentation of the frame in two different ways that accommodate the respective capabilities of the two devices.
40. A medium bearing instructions to enable a device to:
generate, for a video frame to be broadcast to at least two different devices that have different capabilities for presenting the frame to users, a single directive based on the video, the single directive being usable at each of the devices to control the presentation of the frame in two different ways that accommodate the respective capabilities of the two devices.
41. An apparatus comprising
a processor to determine a region of interest in a frame of a video, and a location representative of the region of interest, the location being sufficient to enable each of two different video playing devices having different display characteristics to render the frame in a manner suitable for the display characteristics of the device.
42. A medium bearing instructions enabling a device to
determine a region of interest in a frame of a video, and
determine a location representative of the region of interest, the location being sufficient to enable each of two different video playing devices having different display characteristics to render the frame in a manner suitable for the display characteristics of the device.
PCT/US2006/036236 2005-09-15 2006-09-15 Broadcasting video content to devices having different video presentation capabilities WO2007035606A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US11/228,971 2005-09-15
US11/228,971 US8024768B2 (en) 2005-09-15 2005-09-15 Broadcasting video content to devices having different video presentation capabilities

Publications (2)

Publication Number Publication Date
WO2007035606A2 true WO2007035606A2 (en) 2007-03-29
WO2007035606A3 WO2007035606A3 (en) 2007-10-25

Family

ID=37856875

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2006/036236 WO2007035606A2 (en) 2005-09-15 2006-09-15 Broadcasting video content to devices having different video presentation capabilities

Country Status (2)

Country Link
US (1) US8024768B2 (en)
WO (1) WO2007035606A2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8024768B2 (en) 2005-09-15 2011-09-20 Penthera Partners, Inc. Broadcasting video content to devices having different video presentation capabilities

Families Citing this family (92)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1943837A4 (en) * 2005-11-01 2010-08-04 Nokia Corp Identifying scope esg fragments and enabling hierarchy in the scope
DE102005059616A1 (en) * 2005-12-12 2007-06-14 Robert Bosch Gmbh Method, communication system, multimedia subscriber and gateway for transmitting MPEG-format multimedia data
CN102169415A (en) * 2005-12-30 2011-08-31 苹果公司 Portable electronic device with multi-touch input
US11477617B2 (en) * 2006-03-20 2022-10-18 Ericsson Evdo Inc. Unicasting and multicasting multimedia services
US20090080866A1 (en) * 2006-03-27 2009-03-26 Nds Limited Video Substitution System
US9680686B2 (en) * 2006-05-08 2017-06-13 Sandisk Technologies Llc Media with pluggable codec methods
US20070260615A1 (en) * 2006-05-08 2007-11-08 Eran Shen Media with Pluggable Codec
US20070268406A1 (en) * 2006-05-22 2007-11-22 Broadcom Corporation, A California Corporation Video processing system that generates sub-frame metadata
JP2008011510A (en) * 2006-05-31 2008-01-17 Toshiba Corp Local information broadcast system, and broadcast device and broadcast method thereof
US20080007651A1 (en) * 2006-06-23 2008-01-10 Broadcom Corporation, A California Corporation Sub-frame metadata distribution server
US20100034425A1 (en) * 2006-10-20 2010-02-11 Thomson Licensing Method, apparatus and system for generating regions of interest in video content
WO2009003885A2 (en) * 2007-06-29 2009-01-08 Thomson Licensing Video indexing method, and video indexing device
US8358705B2 (en) 2007-07-05 2013-01-22 Coherent Logix, Incorporated Transmission of multimedia streams to mobile devices with uncoded transport tunneling
US9203445B2 (en) 2007-08-31 2015-12-01 Iheartmedia Management Services, Inc. Mitigating media station interruptions
US8260230B2 (en) * 2007-08-31 2012-09-04 Clear Channel Management Services, Inc. Radio receiver and method for receiving and playing signals from multiple broadcast channels
US9240056B2 (en) * 2008-04-02 2016-01-19 Microsoft Technology Licensing, Llc Video retargeting
US8209733B2 (en) * 2008-05-28 2012-06-26 Broadcom Corporation Edge device that enables efficient delivery of video to handheld device
US20090300687A1 (en) * 2008-05-28 2009-12-03 Broadcom Corporation Edge device establishing and adjusting wireless link parameters in accordance with qos-desired video data rate
US8255962B2 (en) * 2008-05-28 2012-08-28 Broadcom Corporation Edge device reception verification/non-reception verification links to differing devices
US20100050221A1 (en) * 2008-06-20 2010-02-25 Mccutchen David J Image Delivery System with Image Quality Varying with Frame Rate
US20100104004A1 (en) * 2008-10-24 2010-04-29 Smita Wadhwa Video encoding for mobile devices
US20100312828A1 (en) * 2009-06-03 2010-12-09 Mobixell Networks Ltd. Server-controlled download of streaming media files
US9930344B2 (en) * 2009-08-11 2018-03-27 Nbcuniversal Media, Llc Digital content integration and delivery system and method
US20110069238A1 (en) * 2009-09-21 2011-03-24 Sony Corporation Embedded recycle circuit for harnessing light energy
WO2011039848A1 (en) * 2009-09-29 2011-04-07 株式会社 東芝 Region-of-interest extracting device and program
US8527649B2 (en) * 2010-03-09 2013-09-03 Mobixell Networks Ltd. Multi-stream bit rate adaptation
US8650591B2 (en) * 2010-03-09 2014-02-11 Yolanda Prieto Video enabled digital devices for embedding user data in interactive applications
US9894314B2 (en) 2010-06-15 2018-02-13 Dolby Laboratories Licensing Corporation Encoding, distributing and displaying video data containing customized video content versions
US10324605B2 (en) 2011-02-16 2019-06-18 Apple Inc. Media-editing application with novel editing tools
US8832709B2 (en) 2010-07-19 2014-09-09 Flash Networks Ltd. Network optimization
US9251855B2 (en) 2011-01-28 2016-02-02 Apple Inc. Efficient media processing
US11747972B2 (en) 2011-02-16 2023-09-05 Apple Inc. Media-editing application with novel editing tools
US9412414B2 (en) 2011-02-16 2016-08-09 Apple Inc. Spatial conform operation for a media-editing application
US8839110B2 (en) * 2011-02-16 2014-09-16 Apple Inc. Rate conform operation for a media-editing application
US9997196B2 (en) 2011-02-16 2018-06-12 Apple Inc. Retiming media presentations
US8688074B2 (en) 2011-02-28 2014-04-01 Moisixell Networks Ltd. Service classification of web traffic
US9547428B2 (en) 2011-03-01 2017-01-17 Apple Inc. System and method for touchscreen knob control
US20130031187A1 (en) * 2011-07-30 2013-01-31 Bhatia Rajesh Method and system for generating customized content from a live event
CN103108197A (en) * 2011-11-14 2013-05-15 辉达公司 Priority level compression method and priority level compression system for three-dimensional (3D) video wireless display
EP2621180A3 (en) * 2012-01-06 2014-01-22 Kabushiki Kaisha Toshiba Electronic device and audio output method
US9829715B2 (en) 2012-01-23 2017-11-28 Nvidia Corporation Eyewear device for transmitting signal and communication method thereof
US20140373082A1 (en) * 2012-02-03 2014-12-18 Sharp Kabushiki Kaisha Output system, control method of output system, control program, and recording medium
US8856815B2 (en) * 2012-04-27 2014-10-07 Intel Corporation Selective adjustment of picture quality features of a display
US9578224B2 (en) 2012-09-10 2017-02-21 Nvidia Corporation System and method for enhanced monoimaging
US9491494B2 (en) * 2012-09-20 2016-11-08 Google Technology Holdings LLC Distribution and use of video statistics for cloud-based video encoding
US9933921B2 (en) 2013-03-13 2018-04-03 Google Technology Holdings LLC System and method for navigating a field of view within an interactive media-content item
US9438947B2 (en) 2013-05-01 2016-09-06 Google Inc. Content annotation tool
US9589597B2 (en) 2013-07-19 2017-03-07 Google Technology Holdings LLC Small-screen movie-watching using a viewport
US9779480B2 (en) 2013-07-19 2017-10-03 Google Technology Holdings LLC View-driven consumption of frameless media
US9766786B2 (en) 2013-07-19 2017-09-19 Google Technology Holdings LLC Visual storytelling on a mobile media-consumption device
US8718445B1 (en) 2013-09-03 2014-05-06 Penthera Partners, Inc. Commercials on mobile devices
MX349609B (en) * 2013-09-13 2017-08-04 Arris Entpr Llc Content based video content segmentation.
US9271048B2 (en) 2013-12-13 2016-02-23 The Directv Group, Inc. Systems and methods for immersive viewing experience
US9589595B2 (en) 2013-12-20 2017-03-07 Qualcomm Incorporated Selection and tracking of objects for display partitioning and clustering of video frames
US10346465B2 (en) 2013-12-20 2019-07-09 Qualcomm Incorporated Systems, methods, and apparatus for digital composition and/or retrieval
US10935788B2 (en) 2014-01-24 2021-03-02 Nvidia Corporation Hybrid virtual 3D rendering approach to stereovision
KR101801592B1 (en) * 2014-04-27 2017-11-27 엘지전자 주식회사 Broadcast signal transmitting apparatus, broadcast signal receiving apparatus, method for transmitting broadcast signal, and method for receiving broadcast signal
US10140827B2 (en) 2014-07-07 2018-11-27 Google Llc Method and system for processing motion event notifications
US9501915B1 (en) 2014-07-07 2016-11-22 Google Inc. Systems and methods for analyzing a video stream
US9544636B2 (en) 2014-07-07 2017-01-10 Google Inc. Method and system for editing event categories
US9354794B2 (en) * 2014-07-07 2016-05-31 Google Inc. Method and system for performing client-side zooming of a remote video feed
US10127783B2 (en) 2014-07-07 2018-11-13 Google Llc Method and device for processing motion events
US9851868B2 (en) 2014-07-23 2017-12-26 Google Llc Multi-story visual experience
US10289856B2 (en) * 2014-10-17 2019-05-14 Spatial Digital Systems, Inc. Digital enveloping for digital right management and re-broadcasting
US10341731B2 (en) 2014-08-21 2019-07-02 Google Llc View-selection feedback for a visual experience
USD782495S1 (en) 2014-10-07 2017-03-28 Google Inc. Display screen or portion thereof with graphical user interface
US9361011B1 (en) 2015-06-14 2016-06-07 Google Inc. Methods and systems for presenting multiple live video feeds in a user interface
US10506198B2 (en) * 2015-12-04 2019-12-10 Livestream LLC Video stream encoding system with live crop editing and recording
KR102468763B1 (en) * 2016-02-05 2022-11-18 삼성전자 주식회사 Image processing apparatus and control method thereof
US9906981B2 (en) 2016-02-25 2018-02-27 Nvidia Corporation Method and system for dynamic regulation and control of Wi-Fi scans
US10506237B1 (en) 2016-05-27 2019-12-10 Google Llc Methods and devices for dynamic adaptation of encoding bitrate for video streaming
US10380429B2 (en) 2016-07-11 2019-08-13 Google Llc Methods and systems for person detection in a video feed
US10957171B2 (en) 2016-07-11 2021-03-23 Google Llc Methods and systems for providing event alerts
US10192415B2 (en) 2016-07-11 2019-01-29 Google Llc Methods and systems for providing intelligent alerts for events
KR20180028782A (en) * 2016-09-09 2018-03-19 삼성전자주식회사 Electronic apparatus and operating method for the same
US10599950B2 (en) 2017-05-30 2020-03-24 Google Llc Systems and methods for person recognition data management
US11783010B2 (en) 2017-05-30 2023-10-10 Google Llc Systems and methods of person recognition in video streams
US10664688B2 (en) 2017-09-20 2020-05-26 Google Llc Systems and methods of detecting and responding to a visitor to a smart home environment
US11134227B2 (en) 2017-09-20 2021-09-28 Google Llc Systems and methods of presenting appropriate actions for responding to a visitor to a smart home environment
US11140450B2 (en) * 2017-11-28 2021-10-05 Rovi Guides, Inc. Methods and systems for recommending content in context of a conversation
US10455399B2 (en) * 2017-11-30 2019-10-22 Enforcement Technology Group Inc. Portable modular crisis communication system
US10856041B2 (en) * 2019-03-18 2020-12-01 Disney Enterprises, Inc. Content promotion using a conversational agent
US10986308B2 (en) 2019-03-20 2021-04-20 Adobe Inc. Intelligent video reframing
US11164339B2 (en) * 2019-11-12 2021-11-02 Sony Interactive Entertainment Inc. Fast region of interest coding using multi-segment temporal resampling
US11893795B2 (en) 2019-12-09 2024-02-06 Google Llc Interacting with visitors of a connected home environment
JP2022002378A (en) * 2020-06-22 2022-01-06 キヤノン株式会社 Imaging apparatus, imaging system, method for controlling imaging apparatus, and program
US11438673B2 (en) 2020-09-11 2022-09-06 Penthera Partners, Inc. Presenting media items on a playing device
USD996493S1 (en) 2020-12-07 2023-08-22 Applied Materials, Inc. Live streaming camera
US20210120259A1 (en) * 2020-12-23 2021-04-22 Intel Corporation Technologies for memory-efficient video encoding and decoding
US11381853B1 (en) 2021-01-28 2022-07-05 Meta Platforms, Inc. Systems and methods for generating and distributing content for consumption surfaces
EP4072148A1 (en) * 2021-04-07 2022-10-12 Idomoo Ltd A system and method to adapting video size
US20230275947A1 (en) * 2022-02-25 2023-08-31 International Business Machines Corporation Optimized transmission and consumption of digital content

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020069218A1 (en) * 2000-07-24 2002-06-06 Sanghoon Sull System and method for indexing, searching, identifying, and editing portions of electronic multimedia files
US20030112354A1 (en) * 2001-12-13 2003-06-19 Ortiz Luis M. Wireless transmission of in-play camera views to hand held devices

Family Cites Families (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6522352B1 (en) * 1998-06-22 2003-02-18 Motorola, Inc. Self-contained wireless camera device, wireless camera system and method
US6357042B2 (en) * 1998-09-16 2002-03-12 Anand Srinivasan Method and apparatus for multiplexing separately-authored metadata for insertion into a video data stream
US6937773B1 (en) * 1999-10-20 2005-08-30 Canon Kabushiki Kaisha Image encoding method and apparatus
US7020336B2 (en) * 2001-11-13 2006-03-28 Koninklijke Philips Electronics N.V. Identification and evaluation of audience exposure to logos in a broadcast event
JP2004104452A (en) * 2002-09-10 2004-04-02 Fuji Photo Film Co Ltd Monitoring computer
JP3793142B2 (en) * 2002-11-15 2006-07-05 株式会社東芝 Moving image processing method and apparatus
JP4084991B2 (en) * 2002-11-29 2008-04-30 富士通株式会社 Video input device
JP2004200739A (en) * 2002-12-16 2004-07-15 Sanyo Electric Co Ltd Image processor
US7116833B2 (en) * 2002-12-23 2006-10-03 Eastman Kodak Company Method of transmitting selected regions of interest of digital video data at selected resolutions
US7802288B2 (en) * 2003-03-14 2010-09-21 Starz Entertainment, Llc Video aspect ratio manipulation
US20040227778A1 (en) * 2003-05-13 2004-11-18 Wen-Cheng Lin Method for reducing power consumption of multimedia data playback on a computer system
KR20050015506A (en) * 2003-08-06 2005-02-21 김영수 Preparation method of seaweed salts
DE102004015806A1 (en) * 2004-03-29 2005-10-27 Smiths Heimann Biometrics Gmbh Method and device for recording areas of interest of moving objects
GB0416496D0 (en) * 2004-07-23 2004-08-25 Council Of The Central Lab Of Imaging device
US7792190B2 (en) * 2004-09-09 2010-09-07 Media Tek Singapore Pte Ltd. Inserting a high resolution still image into a lower resolution video stream
US7609855B2 (en) * 2004-11-30 2009-10-27 Object Prediction Technologies, Llc Method of analyzing moving objects using a vanishing point algorithm
US20070035665A1 (en) * 2005-08-12 2007-02-15 Broadcom Corporation Method and system for communicating lighting effects with additional layering in a video stream
US8024768B2 (en) 2005-09-15 2011-09-20 Penthera Partners, Inc. Broadcasting video content to devices having different video presentation capabilities
US8723951B2 (en) * 2005-11-23 2014-05-13 Grandeye, Ltd. Interactive wide-angle video server

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020069218A1 (en) * 2000-07-24 2002-06-06 Sanghoon Sull System and method for indexing, searching, identifying, and editing portions of electronic multimedia files
US20030112354A1 (en) * 2001-12-13 2003-06-19 Ortiz Luis M. Wireless transmission of in-play camera views to hand held devices

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8024768B2 (en) 2005-09-15 2011-09-20 Penthera Partners, Inc. Broadcasting video content to devices having different video presentation capabilities

Also Published As

Publication number Publication date
US20070061862A1 (en) 2007-03-15
WO2007035606A3 (en) 2007-10-25
US8024768B2 (en) 2011-09-20

Similar Documents

Publication Publication Date Title
US8024768B2 (en) Broadcasting video content to devices having different video presentation capabilities
US10674185B2 (en) Enhancing a region of interest in video frames of a video stream
US10721530B2 (en) Providing tile video streams to a client
US10097885B2 (en) Personal content distribution network
US9363542B2 (en) Techniques to provide an enhanced video replay
US10623816B2 (en) Method and apparatus for extracting video from high resolution video
EP1949689B1 (en) Digital video zooming system
Srivastava et al. Interactive TV technology and markets
EP3723381A1 (en) Transmission of reconstruction data in a tiered signal quality hierarchy
CN1638456A (en) Full scale video picture with overlaid graphical user interface and reduced image
Podborski et al. Virtual reality and DASH
CA2843718C (en) Methods and systems for processing content
US20150289032A1 (en) Main and immersive video coordination system and method
Curcio et al. Multi-viewpoint and overlays in the MPEG OMAF standard
EP3160156A1 (en) System, device and method to enhance audio-video content using application images
Niamut et al. Advanced visual rendering, gesture-based interaction and distributed delivery for immersive and interactive media services

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 06803763

Country of ref document: EP

Kind code of ref document: A2