EP4315868A1 - A method, an apparatus and a computer program product for processing media data - Google Patents

A method, an apparatus and a computer program product for processing media data

Info

Publication number
EP4315868A1
EP4315868A1 EP22779207.4A EP22779207A EP4315868A1 EP 4315868 A1 EP4315868 A1 EP 4315868A1 EP 22779207 A EP22779207 A EP 22779207A EP 4315868 A1 EP4315868 A1 EP 4315868A1
Authority
EP
European Patent Office
Prior art keywords
metadata
parameters
processing
media
media stream
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
EP22779207.4A
Other languages
German (de)
French (fr)
Inventor
Yu You
Sujeet Shyamsundar Mate
Saba AHSAN
Emre Baris Aksu
Miska Matias Hannuksela
Igor Danilo Diego Curcio
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nokia Technologies Oy
Original Assignee
Nokia Technologies Oy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Technologies Oy filed Critical Nokia Technologies Oy
Publication of EP4315868A1 publication Critical patent/EP4315868A1/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/60Network structure or processes for video distribution between server and client or between remote clients; Control signalling between clients, server and network components; Transmission of management data between server and client, e.g. sending from server to client commands for recording incoming content stream; Communication details between server and client 
    • H04N21/65Transmission of management data between client and server
    • H04N21/658Transmission by the client directed to the server
    • H04N21/6587Control parameters, e.g. trick play commands, viewpoint selection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/816Monomedia components thereof involving special video data, e.g 3D video
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/10Processing, recording or transmission of stereoscopic or multi-view image signals
    • H04N13/106Processing image signals
    • H04N13/172Processing image signals image signals comprising non-image signal components, e.g. headers or format information
    • H04N13/178Metadata, e.g. disparity information
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/235Processing of additional data, e.g. scrambling of additional data or processing content descriptors
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/60Network structure or processes for video distribution between server and client or between remote clients; Control signalling between clients, server and network components; Transmission of management data between server and client, e.g. sending from server to client commands for recording incoming content stream; Communication details between server and client 
    • H04N21/63Control signaling related to video distribution between client, server and network components; Network processes for video distribution between server and clients or between remote clients, e.g. transmitting basic layer and enhancement layers over different transmission paths, setting up a peer-to-peer communication via Internet between remote STB's; Communication protocols; Addressing
    • H04N21/633Control signals issued by server directed to the network components or client

Definitions

  • the present solution generally relates to processing media data.
  • the solution relates to processing of media data in a network-based environment.
  • new image and video capture devices are available. These devices are able to capture visual and audio content all around them, i.e. they can capture the whole angular field of view, sometimes referred to as 360 degrees field of view. More precisely, they can capture a spherical field of view (i.e., 360 degrees in all spatial directions).
  • new types of output technologies have been invented and produced, such as head-mounted displays. These devices allow a person to see visual content all around him/her, giving a feeling of being “immersed” into the scene captured by the 360 degrees camera.
  • the new capture and display paradigm, where the field of view is spherical is commonly referred to as virtual reality (VR) and is believed to be the common way people will experience media content in the future.
  • VR virtual reality
  • a method for media processing comprising receiving a media stream; retrieving a number of parameters to be used by a plurality of processing tasks of the media stream, wherein the plurality of processing tasks are carried out by a plurality of media processing entities of a network-based environment; incorporating the number of parameters to metadata; and delivering the media stream with the metadata through a processing workflow comprising the plurality of media processing entities.
  • an apparatus comprising at least means for receiving a media stream; means for retrieving a number of parameters to be used by a plurality of processing tasks of the media stream, wherein the plurality of processing tasks are carried out by a plurality of media processing entities of a network-based environment; means for incorporating the number of parameters to metadata; and means for delivering the media stream with the metadata through a processing workflow comprising the plurality of media processing entities.
  • an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following receive a media stream; retrieve a number of parameters to be used by a plurality of processing tasks of the media stream, wherein the plurality of processing tasks are carried out by a plurality of media processing entities of a network-based environment; incorporate the number of parameters to metadata; and deliver the media stream with the metadata through a processing workflow comprising the plurality of media processing entities.
  • a computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to receive a media stream; retrieve a number of parameters to be used by a plurality of processing tasks of the media stream, wherein the plurality of processing tasks are carried out by a plurality of media processing entities of a network-based environment; incorporate the number of parameters to metadata; and deliver the media stream with the metadata through a processing workflow comprising the plurality of media processing entities.
  • parameters of the metadata are updated from a media processing entity.
  • the metadata indicates a task allowed to use the parameters, and/or lifespan of the parameters.
  • the parameters comprise at least dynamically generated timed information needed in post-processing or rendering.
  • the parameters comprise information on the selection of targeted processing tasks.
  • the parameters comprise information on the dissemination of targeted processing tasks.
  • the metadata is delivered as an RTP format or an SEI message.
  • a parameter indicating that the media stream and the metadata are delivered in a same data stream is included into a bitstream.
  • a parameter indicating that whether the media stream and the metadata are to be synchronized based on timestamps or allow non- synchronized usage is not limited to, but not limited to,
  • the computer program product is embodied on a non- transitory computer readable medium.
  • FIG. 1 shows an NBMP Architecture according to an example
  • Fig. 2 shows a media processing workflow of a NBMP
  • Fig. 3 shows an example of event-triggered out-of-band parameter changes
  • Fig. 4 shows an apparatus according to an embodiment
  • Fig. 5 shows an implementation diagram according to an embodiment
  • Fig. 6 shows an example of out-of-sync or lagging in parameter passing
  • Fig. 7 shows an example of synchronized media data and metadata parameters
  • Fig. 8 shows an example of a viewport-dependent delivery
  • Fig. 9 shows an example of shaders and processors
  • Fig. 10 shows a flowchart for preparing shader processing in the viewport- dependent rendering
  • Fig. 11 shows an example of a viewport-dependent playing
  • Fig. 12 shows an example of a metadata being converted as RTP custom header extension
  • Fig. 13 is a flowchart illustrating a method according to an embodiment
  • Fig. 14 illustrates an example of media data and metadata streams transported between two NBMP media processing functions
  • Fig. 15 illustrates an example of NBMP compound mode enabled.
  • Example Embodiments In the following, several embodiments will be described in the context of viewport dependent delivery for omnidirectional conversational video.
  • the present embodiments are targeted to in-band metadata driven network-based media processing.
  • Immersive multimedia - such as omnidirectional content consumption - is more complex for the end user compared to the consumption of 2D content. This is due to the higher degree of freedom available to the end user (e.g., three degrees of freedom for yaw, pitch and roll). This freedom also results in more uncertainty.
  • Omnidirectional may refer to media content that has greater spatial extent than a field-of-view of a device rendering the content.
  • Omnidirectional content may for example cover substantially 360 degrees in the horizontal dimension and substantially 180 degrees in the vertical dimension, but omnidirectional may also refer to content covering less than 360 degree view in the horizontal direction and/or 180 degree view in the vertical direction.
  • the MPEG Omnidirectional Media Format (OMAF) v1 standardized the omnidirectional streaming of single three-degrees of freedom (3DoF) content (where the viewer is located at the centre of a unit sphere and has three degrees of freedom (Yaw-Pitch-Roll).
  • 3DoF three-degrees of freedom
  • ITT4RT is Upport of Immersive Teleconferencing and Telepresence for Remote Terminals.
  • the objective of ITT4RT is to specify virtual reality (VR) support in MTSI (Multimedia Telephony Service for IMS) in 3GPP TS 26.114 and IMS-based (Internet Multimedia Subsystem) Telepresence in 3GPP TS 26.223 to enable support of an immersive experience for remote terminals joining teleconferencing and telepresence sessions.
  • VR virtual reality
  • MTSI Multimedia Telephony Service for IMS
  • IMS-based Internet Multimedia Subsystem
  • the work is expected to enable scenarios with two-way audio and one-way immersive video, e.g., a remote single user wearing an HMD (Head- Mounted Display) participates to a conference will send audio and optionally two- dimensional (2D) video (e.g., of a presentation, screen sharing and/or a capture of the user itself), but receives stereo or immersive voice/audio and immersive video captured by an omnidirectional camera in a conference room connected to a fixed network.
  • HMD Head- Mounted Display
  • the RTP (Realtime Transport Protocol) payload format and SDP (Session Description Protocol) parameters to be developed under the 3GPP IVAS (Immersive Voice and Audio Services) Wl will be considered to support use of the IVAS codec for immersive voice and audio.
  • the RTP payload format and SDP parameters for HEVC (High Efficiency Video Coding) will be considered to support immersive video.
  • SEI Supplemental Enhancement Information
  • NBMP Network-Based Media Processing
  • MPEG MPEG-I
  • NBMP defines the architecture, APIs, media and metadata formats for discovering media processing functions, describing media workflows, deploying media processing workflows, configuring the runtime tasks, monitoring and taking corrective measures in case of faulty behavior. Offering a single standardized cloud-abstraction layer to service providers will give service providers high flexibility and reduce their time to market through the portability that comes with NBMP.
  • the NBMP facilitates digital media production through workflows (also referred to as “pipelines”) that are designed to facilitate various types of media transformation tasks, e.g. transcoding, filtering, content understanding and enhancement etc.
  • the processing may take initial configuration (e.g. codec parameters) to guide the specific tasks.
  • the present embodiments propose a generic “in-band” metadata approach to delivery parameters for the tasks along with the media streams. This differentiates from conventional “out-of-band” metadata approach.
  • the NBMP architecture comprises a NBMP Source 100 providing a workflow description document to a NBMP workflow manager 110.
  • the NBMP workflow manager 110 is configured to create a workflow based on the received workflow description document (WDD).
  • the workflow manager 110 selects and deploys the NBMP functions into selected Media Processing Entities (MPE) 120 and then performs the configuration of the tasks 125, 127 at the MPEs 120.
  • MPE Media Processing Entities
  • the Task 125 receives media flow from a media source 130, which media flow is further delivered to media sink 140.
  • MPEs are provisioned from the underlying cloud infrastructure provided by cloud providers (Cloud stack or environment 170), which is done by the Cloud management communication (Cloud mgmt) by the NBMP Workflow Manager 110 in Figure 1 .
  • NBMP uses so-called Descriptors as the basic elements for its all resource documents such as the workflow documents, task documents, and function documents.
  • Descriptors are a group of NBMP parameters which describe a set of related characteristics of Workflow, Function or Task. Some key descriptors are General, Input, Output, Processing, Requirements, Configuration etc.
  • Figure 2 illustrates an NBMP workflow.
  • the workflow consists of one or more tasks 602, 604, 606, 608, 610, 612, 614, 616.
  • the arrows represent the data flow 617 from upstream tasks to downstream tasks.
  • the parameters may be provided by the workflow manager at the time of the workflow creation or initialization phase.
  • the parameters are defined and initialized explicitly.
  • the processing behavior changes either immediately or in a delayed mode, according to the needs.
  • This is referred in the present disclosure as “out-of-band” parameter signalling, as parameters are provided through other data channels than the media data themselves.
  • the out-of-band mechanism has critical timing issues to determine the parameter changes occur at the right moment to the correct media data or data chunks.
  • Figure 3 illustrates an example on this drawback.
  • the parameter change is triggered by an event (i.e. “ROI event”) occurred at one task 301 (for example, object detector for region of interest encoding in Task 1) of the pipeline.
  • the ROI data flow (arrows A) through the application or workflow manager 305 to the downstream tasks 303 (e.g. Task 3) is different and unpredictable comparing to the speed of media data flow (arrows B), due to the processing and external event transmission latency.
  • ROI event i.e. “RO
  • the media data may be divided into a series of non-overlapping data chunks or frames. Each chunk can contain a unique identifier and timing data, for example, a timestamp, which may be monotonically increasing.
  • the out-of-band approach requires extra time to check the time-sensitive parameters. It is challenging to synchronize media and external signalling channels without any buffering techniques. Flowever, once buffers are used, the overall latency increases noticeably.
  • the present embodiments provide a processing pipeline, where an upstream task connects to one or more downstream tasks through relations.
  • the relation is uni directional and represents the data flow.
  • a task is the upstream of all tasks connected directly or indirectly through the uni-directional relations; and conversely a task is downstream of all task when it consumes the outputs, directly or indirectly.
  • downstream tasks depend on the upstream task or tasks.
  • Example embodiments of the present solution provide systems and method for processing dynamic and real-time parameters stored as metadata in media data, thus providing an in-band parameter passing via metadata approach.
  • a number of parameters can be generated dynamically by upstream tasks, whereupon the parameters are signalled as the metadata which is delivered together with the media data to downstream tasks.
  • each task can read and parse the metadata attached to the media data, and use the parameters respectively.
  • Tasks which consume parameters can generate new parameters to be appended to the metadata and delivered further along the processing pipeline.
  • the data structure of the metadata indicates conditions about 1 ) by whom (e.g., which tasks) the parameters of parameter sets can be used; 2) the life-span of the parameters, e.g., used once versus Time to Live (TTL) value.
  • TTL Time to Live
  • the apparatus is a user equipment (also referred to as “client”) for the purposes of the present embodiments, i.e. for content authoring.
  • the apparatus 90 comprises a main processing unit 91 , a memory 92, a user interface 94, a communication interface 93.
  • the apparatus according to an embodiment, shown in Figure 4 also comprises a camera module 95.
  • the apparatus may be configured to receive image and/or video data from an external camera device over a communication network.
  • the memory 92 stores data including computer program code in the apparatus 90.
  • the computer program code is configured to implement the method according various embodiments by means of various computer modules.
  • the camera module 95 or the communication interface 93 receives data, in the form of images or video stream, to be processed by the processor 91 .
  • the communication interface 93 forwards processed data, i.e. the image file, for example to a display of another device, such a virtual reality headset.
  • the apparatus 90 is a video source comprising the camera module 95, user inputs may be received from the user interface.
  • An example of a device for content consumption i.e. an apparatus according to another embodiment, is a virtual reality headset, such as a head-mounted display (HMD) for stereo viewing.
  • the head-mounted display may comprise two screen sections or two screens for displaying the left and right eye images.
  • the displays are close to the eyes, and therefore lenses are used to make the images easily viewable and for spreading the images to cover as much as possible of the eyes’ field of view.
  • the device is attached to the head of the user so that it stays in place even when the user turns his head.
  • the device may have an orientation detecting module for determining the head movements and direction of the head.
  • the head- mounted display is able to show omnidirectional content (3DOF content) of the recorded/streamed image file to a user.
  • Examples for an application of omnidirectional video comprises a video call or a teleconference with video.
  • Such a call or conference with omnidirectional video has a benefit of providing greater immersion for a user, and a possibility to explore the captured space in different directions.
  • the apparatus of Figure 4 may comprise one or more computer modules to
  • the metadata format can be proprietary or standard format depending on the payload types of the media transporting.
  • the timestamp for the metadata when added to the associated media data is derived from the media data (e.g., the timestamp for the rotation value for an ERP sub-frame is timestamp of the ERP frame).
  • the example embodiments provide a specific metadata structure called “parameter metadata” for media (e.g. video and audio) and other data bitstreams.
  • the parameter metadata can be implemented as any type of extension to any metadata formats depending on the data transmission/transport protocols.
  • the parameters may comprise timed information generated dynamically and which are needed by post-processing or rendering tasks for reacting according to the dynamic information represented by the metadata.
  • the high-level syntax of the parameter metadata is defined in the following table 1 :
  • the “type” information can be specified in a hierarchical manner with domain being the namespace, and the operations within that domain.
  • the “target” object can be defined as in following table (Table 2) with rules and criterial for the selection of the processing tasks.
  • the “ttl” value indicates the lifespan of the parameter set. It can be defined as a counter or a date format. When a counter is used, it refers to the amount of tasks or the consumed times by tasks. Other TTL types may be applied to indicate the life span of the parameter. When a date number is used as the ttl value, for example, a standard timestamp (ISO 8601), the value can be an addition to the current media timestamp being associated. In some cases, the “ttl” value can be used until a new parameter of the same type is generated. This can be used in the case of data which has sparse samples holding the old value until a new sample arrives. In some cases, the value should be negative to indicate the relevance for any older data, e.g. video frames with certain orders (decoding order or presentation order) cached in the buffer.
  • the “parameter” object can be defined as in the following table (Table 3):
  • the present embodiments can be implemented by any processing types, streaming or patch processing.
  • the design can be implemented in GStreamer framework.
  • GStreamer is a pipeline-based multimedia framework that links together a wide variety of media processing systems to complete complex workflows.
  • GStreamer can be used to build a system that reads files in one format, processes them, and exports them in another. The formats and processes can be changed in a plug and play fashion.
  • a shared library is designed to write and retrieve parameters to/from GStreamer media buffer, together with other existing metadata like timing metadata for various times such as DTS (decoding timestamp), PTS (Presentation timestamp), and duration of the data, and video metadata that contains information such as width, height, video data planes for planarformats and strides for padding.
  • the library may, at first, register the metadata type by a unique name, for example, “generic- parameter-metadata” to the system’s metadata registry for global lookup later by other tasks.
  • the library may also implement serialization and de-serialization methods to convert the parameter data between the internal parameter data structure and binary bit-stream, which can be concatenated into the media data buffer.
  • the timestamps such as DTS, PTS, and duration can be omitted from this parameter metadata to save space.
  • Figure 5 illustrates a generic parameter metadata library 500 with properties and key functions.
  • the “tags” are used for the framework to look up registered metadata by single or multiple tag names.
  • the “name” is the unique identifier of the metadata implementation, given the media streams may contain one or multiple different metadata blocks.
  • the “init function” and “free function” are used to initialize the “parameters” object by allocating and freeing memory.
  • the “serialization function” and “de-serialization function” are used to do the necessary low-level binary bitstream conversion.
  • the implementation can be done in the ISO/IEC 23090-8 MPEG-I Network-based Media Processing (NBMP).
  • the parameter set can be defined into the “metadata parameters” under Input and Output Descriptors.
  • the metadata parameter can deliver out- of-band in NBMP along with the input and output metadata.
  • properties such as “stream-id”, “protocol” and “caching-server-url”, which makes the metadata stream different from the correspondent I/O media streams. In order to support in- band approach, those properties can have the same value of the correspondent I/O media stream.
  • NBMP can define new parameters in NBMP Descriptors to indicate the in-band carriage of the metadata stream with the media streams over the same protocol used by the media streams.
  • Real-time object detection is a computer-vision technique to detect, locate and track one or more objects from an image or a video.
  • the special attribute about object detection is that it identifies the class of object (person, table, chair, etc.) and their location-specific coordinates in the given image. The location is pointed out by drawing a bounding box as an overlay around the object.
  • Term “overlay” refers to a visual media, e.g. videos and/or images, that is rendered over 360-degree video content.
  • the coded overlaying video can be a separate stream or part of the bitstream of the currently rendered 360-degree video/image.
  • Figure 6 illustrates the object detection example with two detected objects: a person and a hat with their confidence values and bounding boxes.
  • Figure 6 illustrates an example, where an out-of-band approach is used.
  • the detected information is signalled at the time (T1 ) to the application 630 and relayed to the overlay task 640 at time (T2), which T2 is greater or less than T1 due to some delay, which causes the incorrect drawing position at the T2 time onto the wrong frame 645.
  • This kind of an overlay operation is a simple example to illustrate the potential delay caused by the parameter delivery over a separate transportation channel.
  • Figure 7 illustrates the in-band mechanism according to present embodiments to be used for the example of Figure 6. It is realized that delay that appeared with Figure 6 does not exist as the parameters (i.e. the object bounding boxes) are delivered together with the correct video frames 740. The overlay task 750 can, therefore, draw those boxes correctly on the right frame.
  • the sample metadata definition for the object detection is illustrated as follows:
  • Object-detection object structure (in C code syntax) can be as follows (it is to be noticed that it can use other data representation like JSON/XML): struct _GstDetectedObjectlnfo ⁇
  • the binary format of the metadata bit-stream is implementation-specific.
  • a viewport may be defined as a region of omnidirectional image or video suitable for display and viewing by the user.
  • a current viewport (which may be sometimes referred simply as a viewport) may be defined as the part of the spherical video that is currently displayed and hence is viewable by the user(s).
  • a video rendered by an application on a head-mounted display renders a portion of the 360-degrees video, which is referred to as a viewport.
  • a viewport is a window on the 360-degree world represented in the omnidirectional video displayed via a rendering display.
  • a viewport may be characterized by a horizontal field-of-view (VHFoV) and a vertical field-of-view (WFoV).
  • VHFoV horizontal field-of-view
  • WFoV vertical field-of-view
  • the horizontal field-of-view of the viewport will be abbreviated with HFoV and, respectively, the vertical field-of-view of the viewport will be abbreviated with VFoV.
  • Figure 8 shows an example of a viewport-dependent delivery.
  • viewport parameter as well as other parameters (e.g. packed picture mapping etc.) need to be defined.
  • the operation involves tasks 810, e.g. determining a viewport region of the 360-degree video frame (e.g. picture); and rotating the projected picture to reorient the viewport region of the projected picture to a center of the projected picture for encoding.
  • rendering and/or consuming clients e.g. decode the bitstream and rotate the region to its original location in the 360-degree project picture and finally projects the region to the final rendering surface, e.g. 2D display plane, e.g. by applying equirectangular projection to rectilinear projection.
  • Figure 8 illustrates the tasks 810, 820 according to an example.
  • the rotation and projection tasks 820 can be implemented in various ways. Such pixel-level processing is done with the help of GPU (Graphic Processor Unit)’s parallel processing capability.
  • GPU Graphic Processor Unit
  • Figure 9 shows an example of shaders in GPU and other processors in CPU. Shader, or shading, is a common approach for this kind of work and supported by almost all GPU venders.
  • GLSL OpenGL shading language
  • Figure 9 illustrates the shaders and parameter passing between CPU and GPU. The media content needs to be copied as texture from CPU memory to GPU memory.
  • a shader code is executed by a GPU unit, some parameters/variables need to be defined/assigned from CPU to GPU threads. Those parameters are defined as uniforms 930
  • a uniform 930 is a global shader variable declared with the “uniform” keyword.
  • “uniform” defines the viewport variables (parameters) for the vertex shaders 940 for the rotation and projection, respectively.
  • the sample metadata definition for viewport information is illustrated as follows:
  • the parameter object can be represented in one example using JSON Schema Other formats can be supported, depending on the platform implementation. ⁇
  • Figure 10 shows example steps on how the tasks understand the metadata as GLSL uniforms and prepare for the shader processing in the viewport-dependent rendering.
  • the flowchart of Figure 10 comprises
  • - reading 1020 metadata form incoming buffer (picture) and checking parameter type (“uniforms”); and other properties such as ttl and target values;
  • the parameter as metadata needs to be supported when the media data is delivered over the network (e.g. IP network).
  • the transmission data may contain both a header and the actual data (payload) to be transmitted.
  • the header part can be extended to convey the parameter metadata following some standard general mechanism like RTP header extensions. It has to be understood that the RTP protocol is an example protocol, and the scope of this invention is not limited to such protocol. For instance, any transport protocol suitable for media or control data delivery could convey the information described in this invention.
  • Figure 11 illustrates a viewport-dependent playing example, where viewport and other rendering parameters are serialized as metadata over IP network.
  • the rendering variables are serialized into metadata and attached to the stream 1120.
  • shader rendering is controlled by the uniform variables.
  • the parameters can be re-written by the De-Packetizer to stream buffer as they are synchronized 1150.
  • the metadata needed for network-based media processing can use lETF’s RTP (Real-time Transport Protocol) format.
  • RTP is a network protocol for delivering audio and video over IP networks.
  • RTP is used in video communication and entertainment systems including the new WWW’s standard WebRTC.
  • the metadata can be converted as RTP custom header extensions by the last component before sending over the IP network.
  • An example is shown with local ID (1 ) 1210 and length (n) (in full 32-bit words) 1220.
  • the data is packed as opaque byte stream 1230.
  • the header extension local identifier (ID, 1210) can be signalling externally by other means, e.g., via SDK offer and answer, according to the RFC 8285. This has been illustrated in Figure 12.
  • Figure 14 illustrates an example embodiment of the media data and metadata streams being transported between two NBMP media processing functions 1401 and 1430.
  • Their correspondent connections 1410 and 1420 are signaled by NBMP’s “connection-map” parameter specified in the Processing Description according to ISO NBMP standard.
  • the media connection 1410 connects the output port name 1440 of one function 1401 and the input port 1444 of another function 1430.
  • the metadata connection 1420 links the metadata output port 1442 and metadata input port 1446.
  • Figure 15 illustrates an example embodiment of the combined transferring mode (compound or synchronized transferring mode) where the media stream and metadata stream are combined into one stream data 1501 (referred to as media + metadata bitstream or simply media + metadata) using one transport protocol 1511 .
  • the single stream data format is generated by Function 1 1401 such that it can be understood by Function 2 1430.
  • a new parameter “compound mode” is specified in the “connection-map” object where the connection between function 1401 and function 1430.
  • a new parameter “synchronized-mode” can also be included into the “connection-map” object “synchronized-mode” can be used to indicate that the two streams (media and metadata) are to be synchronized, based on the timestamps, even when they have different data rates.
  • connection-map object in NBMP New parameters added to connection-map object in NBMP are defined in the following table 4:
  • the synchronized mode addresses the scenario where the media data and metadata need not be strictly synchronized. This allows for lowering the latency by not waiting for the corresponding metadata to arrive at the receiver. Consequently, the receiving function can process the media data or the metadata without waiting for the other by applying any suitable methods. For example, in case of low latency VDD (viewport-dependent delivery), the receiver function can perform viewport warping without waiting for the rotation value with help of Al based prediction.
  • VDD viewport-dependent delivery
  • the synchronized mode can be present when the compound mode is false.
  • the compound mode can be supported by the transport protocols 1511 with mutually agreeable data format (specified by the workflow description), or by the function implementation natively. In the latter case, the sending function needs to indicate the compound support capability 1520 of generating and the receiving function indicates the capability for parsing the data comprising metadata and media data 1501 .
  • Table 5 defines the new parameter in the context of ISO NBMP standard.
  • the payload data transferred over the transport protocol 1511 can support multiple or hybrid data types. That is, the metadata and media data can be sent together synchronously without physically combined or muxed into one signle bitstream, for example, metadata can be placed to the header part of the package, and the media data is the payload, such as the above-mentioned RTP protocol.
  • the metadata needed for network-based media processing e.g., viewport-dependent streaming
  • the conversation needed for transporting media data e.g. video streams
  • SEI Supplemental Enhancement Information
  • the SEI mechanism and specific SEI message have been specified as a part of video coding standards, such as H.264/AVC and H.265 HEVC. Lately, SEI messages have been specified in the Versatile Supplemental Enhancement Information (VSEI) standard (ITU-T H.274
  • VSEI Versatile Supplemental Enhancement Information
  • SEI has been developed to contain various types of data that indicate the timing of the video pictures or describe various properties of the coded video or how it can be used or enhanced.
  • SEI messages are also defined that can contain arbitrary user-defined data. SEI messages do not affect the core decoding process, but can indicate how the video is recommended to be post-processed or displayed.
  • Custom SEI payload type can be used for signalling the opaque binary metadata, e.g. User Data Unregistered SEI Message” in AVC (H.264) and newer standards such as HEVC (H.265) and VVC (H.266).
  • MPEG Timed Metadata track as the format for timed parameters
  • the ISOBMFF contains different mechanisms, such as, timed metadata track and sample groups, for representing time varying metadata.
  • This information can be auxiliary information related to the media data and can be used for describing the operation to be performed with the media data.
  • Each sample of the timed metadata track corresponds to a time instance for which sample carries the parameters or parameter sets.
  • the metadata for the network-based processing can use the timed metadata track along with the media data. This describes how the media data can be post-processed or rendered. Prior to insertion to the workflow or pipeline, the relevant sample is appended to the input buffer delivered to the next task (e.g., to the GPU processing task).
  • such a timed metadata track may be retrieved at a rate faster than the media data, so that prefetch and pre-processing of the metadata may be possible.
  • a derived visual track acts as the format for timed parameters.
  • Derived visual tracks are designed to enable defining a timed sequence of visual transformation operations to be applied to input still images and/or samples of timed sequences of images in the same presentation.
  • the derived visual track feature is built using tools defined in the ISO base media file format (ISO/IEC 14496-12).
  • a derived visual track describes a timed sequence of derived samples composed of an ordered list of derivation operations, each derivation operation applying a derivation transformation for the duration of the derived sample on an ordered list of inputs represented in the same presentation.
  • the method for media processing generally comprises receiving 1310 a media stream; retrieving 1320 a number of parameters to be used by a plurality of processing tasks of the media stream, wherein the plurality of processing tasks are carried out by a plurality of media processing entities of a network-based environment; incorporating 1330 the number of parameters to metadata; and delivering 1340 the media stream with the metadata through a processing workflow comprising the plurality of media processing entities.
  • Each of the steps can be implemented by a respective module of a computer system.
  • An apparatus comprises means for receiving a media stream; means for retrieving a number of parameters to be used by a plurality of processing tasks of the media stream, wherein the plurality of processing tasks are carried out by a plurality of media processing entities of a network-based environment; means for incorporating the number of parameters to metadata; and means for delivering the media stream with the metadata through a processing workflow comprising the plurality of media processing entities.
  • the means comprises at least one processor, and a memory including a computer program code, wherein the processor may further comprise processor circuitry.
  • the memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method of Figure 13 according to various embodiments.
  • a device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment.
  • a network device like a server may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of an embodiment.
  • a computer program product according to an embodiment can be embodied on a non-transitory computer readable medium. According to another embodiment, the computer program product can be downloaded over a network in a data packet.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The embodiments relate to a method and technical equipment for implementing the method. The method comprises receiving a media stream; retrieving a number of parameters (740) to be used by a plurality of processing tasks (750) of the media stream, wherein the plurality of processing tasks are carried out by a plurality of media processing entities of a network-based environment; incorporating the number of parameters to metadata; and delivering the media stream with the metadata through a processing workflow comprising the plurality of media processing entities.

Description

A METHOD, AN APPARATUS AND A COMPUTER PROGRAM PRODUCT FOR PROCESSING MEDIA DATA
Technical Field
The present solution generally relates to processing media data. In particular, the solution relates to processing of media data in a network-based environment.
Background
Since the beginning of photography and cinematography, the most common type of image and video content has been captured by cameras with relatively narrow field of view and displayed as a rectangular scene on flat displays. The cameras are mainly directional, whereby they capture only a limited angular field of view (the field of view towards which they are directed).
More recently, new image and video capture devices are available. These devices are able to capture visual and audio content all around them, i.e. they can capture the whole angular field of view, sometimes referred to as 360 degrees field of view. More precisely, they can capture a spherical field of view (i.e., 360 degrees in all spatial directions). Furthermore, new types of output technologies have been invented and produced, such as head-mounted displays. These devices allow a person to see visual content all around him/her, giving a feeling of being “immersed” into the scene captured by the 360 degrees camera. The new capture and display paradigm, where the field of view is spherical, is commonly referred to as virtual reality (VR) and is believed to be the common way people will experience media content in the future.
Summary
The scope of protection sought for various example embodiments of the invention is set out by the independent claims. The example embodiments and features, if any, described in this specification that do not fall under the scope of the independent claims are to be interpreted as examples useful for understanding various example embodiments of the invention. Various aspects include a method, an apparatus and a non-transitory computer readable medium comprising a computer program, which are characterized by what is stated in the independent claims. Various details of the example embodiments are disclosed in the dependent claims and in the corresponding images and description.
According to a first aspect, there is provided a method for media processing, comprising receiving a media stream; retrieving a number of parameters to be used by a plurality of processing tasks of the media stream, wherein the plurality of processing tasks are carried out by a plurality of media processing entities of a network-based environment; incorporating the number of parameters to metadata; and delivering the media stream with the metadata through a processing workflow comprising the plurality of media processing entities.
According to a second aspect, there is provided an apparatus comprising at least means for receiving a media stream; means for retrieving a number of parameters to be used by a plurality of processing tasks of the media stream, wherein the plurality of processing tasks are carried out by a plurality of media processing entities of a network-based environment; means for incorporating the number of parameters to metadata; and means for delivering the media stream with the metadata through a processing workflow comprising the plurality of media processing entities.
According to a third aspect, there is provided an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following receive a media stream; retrieve a number of parameters to be used by a plurality of processing tasks of the media stream, wherein the plurality of processing tasks are carried out by a plurality of media processing entities of a network-based environment; incorporate the number of parameters to metadata; and deliver the media stream with the metadata through a processing workflow comprising the plurality of media processing entities.
According to a fourth aspect, there is provided a computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to receive a media stream; retrieve a number of parameters to be used by a plurality of processing tasks of the media stream, wherein the plurality of processing tasks are carried out by a plurality of media processing entities of a network-based environment; incorporate the number of parameters to metadata; and deliver the media stream with the metadata through a processing workflow comprising the plurality of media processing entities.
According to an embodiment, parameters of the metadata are updated from a media processing entity.
According to an embodiment, the metadata indicates a task allowed to use the parameters, and/or lifespan of the parameters.
According to an embodiment, the parameters comprise at least dynamically generated timed information needed in post-processing or rendering.
According to an embodiment, the parameters comprise information on the selection of targeted processing tasks.
According to an embodiment, the parameters comprise information on the dissemination of targeted processing tasks.
According to an embodiment, the metadata is delivered as an RTP format or an SEI message.
According to an embodiment, a parameter indicating that the media stream and the metadata are delivered in a same data stream is included into a bitstream.
According to an embodiment, a parameter indicating that whether the media stream and the metadata are to be synchronized based on timestamps or allow non- synchronized usage.
According to an embodiment, the computer program product is embodied on a non- transitory computer readable medium.
Description of the Drawings
In the following, various embodiments will be described in more detail with reference to the appended drawings, in which Fig. 1 shows an NBMP Architecture according to an example;
Fig. 2 shows a media processing workflow of a NBMP;
Fig. 3 shows an example of event-triggered out-of-band parameter changes;
Fig. 4 shows an apparatus according to an embodiment;
Fig. 5 shows an implementation diagram according to an embodiment;
Fig. 6 shows an example of out-of-sync or lagging in parameter passing;
Fig. 7 shows an example of synchronized media data and metadata parameters;
Fig. 8 shows an example of a viewport-dependent delivery;
Fig. 9 shows an example of shaders and processors;
Fig. 10 shows a flowchart for preparing shader processing in the viewport- dependent rendering;
Fig. 11 shows an example of a viewport-dependent playing;
Fig. 12 shows an example of a metadata being converted as RTP custom header extension;
Fig. 13 is a flowchart illustrating a method according to an embodiment;
Fig. 14 illustrates an example of media data and metadata streams transported between two NBMP media processing functions; and
Fig. 15 illustrates an example of NBMP compound mode enabled.
Description of Example Embodiments In the following, several embodiments will be described in the context of viewport dependent delivery for omnidirectional conversational video. In particular the present embodiments are targeted to in-band metadata driven network-based media processing.
The following description and drawings are illustrative and are not to be construed as unnecessarily limiting. The specific details are provided for a thorough understanding of the disclosure. However, in certain instances, well-known or conventional details are not described in order to avoid obscuring the description. References to one or an embodiment in the present disclosure can be, but not necessarily are, reference to the same embodiment and such references mean at least one of the embodiments.
Reference in this specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment in included in at least one embodiment of the disclosure
Users may consume both videos and images as visual content. However, the consumption of videos and images have been independent on each other. The recent development of applications - such as immersive multimedia - has enabled new use cases where users consume both videos and images together. Immersive multimedia - such as omnidirectional content consumption - is more complex for the end user compared to the consumption of 2D content. This is due to the higher degree of freedom available to the end user (e.g., three degrees of freedom for yaw, pitch and roll). This freedom also results in more uncertainty.
As used herein the term “omnidirectional” may refer to media content that has greater spatial extent than a field-of-view of a device rendering the content. Omnidirectional content may for example cover substantially 360 degrees in the horizontal dimension and substantially 180 degrees in the vertical dimension, but omnidirectional may also refer to content covering less than 360 degree view in the horizontal direction and/or 180 degree view in the vertical direction.
The MPEG Omnidirectional Media Format (OMAF) v1 standardized the omnidirectional streaming of single three-degrees of freedom (3DoF) content (where the viewer is located at the centre of a unit sphere and has three degrees of freedom (Yaw-Pitch-Roll). “Support of Immersive Teleconferencing and Telepresence for Remote Terminals” (ITT4RT) has been defined in 3GPP SP-180985. The objective of ITT4RT is to specify virtual reality (VR) support in MTSI (Multimedia Telephony Service for IMS) in 3GPP TS 26.114 and IMS-based (Internet Multimedia Subsystem) Telepresence in 3GPP TS 26.223 to enable support of an immersive experience for remote terminals joining teleconferencing and telepresence sessions. For MTSI, the work is expected to enable scenarios with two-way audio and one-way immersive video, e.g., a remote single user wearing an HMD (Head- Mounted Display) participates to a conference will send audio and optionally two- dimensional (2D) video (e.g., of a presentation, screen sharing and/or a capture of the user itself), but receives stereo or immersive voice/audio and immersive video captured by an omnidirectional camera in a conference room connected to a fixed network.
The RTP (Realtime Transport Protocol) payload format and SDP (Session Description Protocol) parameters to be developed under the 3GPP IVAS (Immersive Voice and Audio Services) Wl will be considered to support use of the IVAS codec for immersive voice and audio. The RTP payload format and SDP parameters for HEVC (High Efficiency Video Coding) will be considered to support immersive video. For video codec(s), use of omnidirectional video specific Supplemental Enhancement Information (SEI) messages for carriage of metadata required for rendering of the omnidirectional video has been considered.
NBMP (Network-Based Media Processing) is a standard (ISO/IEC 23090-8) belonging to MPEG (MPEG-I) providing a unified abstraction layer for composing, controlling, managing, and monitoring flexible media processing workflows, independent of the underlying cloud infrastructure provider(s). NBMP defines the architecture, APIs, media and metadata formats for discovering media processing functions, describing media workflows, deploying media processing workflows, configuring the runtime tasks, monitoring and taking corrective measures in case of faulty behavior. Offering a single standardized cloud-abstraction layer to service providers will give service providers high flexibility and reduce their time to market through the portability that comes with NBMP.
The NBMP facilitates digital media production through workflows (also referred to as “pipelines”) that are designed to facilitate various types of media transformation tasks, e.g. transcoding, filtering, content understanding and enhancement etc. The processing may take initial configuration (e.g. codec parameters) to guide the specific tasks. The present embodiments propose a generic “in-band” metadata approach to delivery parameters for the tasks along with the media streams. This differentiates from conventional “out-of-band” metadata approach.
NBMP architecture according to the standard is illustrated in Figure 1. The NBMP architecture comprises a NBMP Source 100 providing a workflow description document to a NBMP workflow manager 110.
The NBMP workflow manager 110 is configured to create a workflow based on the received workflow description document (WDD). The workflow manager 110 selects and deploys the NBMP functions into selected Media Processing Entities (MPE) 120 and then performs the configuration of the tasks 125, 127 at the MPEs 120. The Task 125 receives media flow from a media source 130, which media flow is further delivered to media sink 140.
MPEs are provisioned from the underlying cloud infrastructure provided by cloud providers (Cloud stack or environment 170), which is done by the the Cloud management communication (Cloud mgmt) by the NBMP Workflow Manager 110 in Figure 1 .
NBMP uses so-called Descriptors as the basic elements for its all resource documents such as the workflow documents, task documents, and function documents. Descriptors are a group of NBMP parameters which describe a set of related characteristics of Workflow, Function or Task. Some key descriptors are General, Input, Output, Processing, Requirements, Configuration etc.
Figure 2 illustrates an NBMP workflow. The workflow consists of one or more tasks 602, 604, 606, 608, 610, 612, 614, 616. The arrows represent the data flow 617 from upstream tasks to downstream tasks.
The parameters (i.e. Descriptors) may be provided by the workflow manager at the time of the workflow creation or initialization phase. The parameters are defined and initialized explicitly. The processing behavior changes either immediately or in a delayed mode, according to the needs. This is referred in the present disclosure as “out-of-band” parameter signalling, as parameters are provided through other data channels than the media data themselves. The out-of-band mechanism, however, has critical timing issues to determine the parameter changes occur at the right moment to the correct media data or data chunks. Figure 3 illustrates an example on this drawback. In Figure 3 the parameter change is triggered by an event (i.e. “ROI event”) occurred at one task 301 (for example, object detector for region of interest encoding in Task 1) of the pipeline. The ROI data flow (arrows A) through the application or workflow manager 305 to the downstream tasks 303 (e.g. Task 3) is different and unpredictable comparing to the speed of media data flow (arrows B), due to the processing and external event transmission latency.
The media data may be divided into a series of non-overlapping data chunks or frames. Each chunk can contain a unique identifier and timing data, for example, a timestamp, which may be monotonically increasing. The out-of-band approach requires extra time to check the time-sensitive parameters. It is challenging to synchronize media and external signalling channels without any buffering techniques. Flowever, once buffers are used, the overall latency increases noticeably.
The present embodiments provide a processing pipeline, where an upstream task connects to one or more downstream tasks through relations. The relation is uni directional and represents the data flow. A task is the upstream of all tasks connected directly or indirectly through the uni-directional relations; and conversely a task is downstream of all task when it consumes the outputs, directly or indirectly. In particular, downstream tasks depend on the upstream task or tasks.
Example embodiments of the present solution provide systems and method for processing dynamic and real-time parameters stored as metadata in media data, thus providing an in-band parameter passing via metadata approach. According to one embodiment, a number of parameters can be generated dynamically by upstream tasks, whereupon the parameters are signalled as the metadata which is delivered together with the media data to downstream tasks. When the media data is passed along the tasks, i.e., in the processing pipeline, each task can read and parse the metadata attached to the media data, and use the parameters respectively. Tasks which consume parameters can generate new parameters to be appended to the metadata and delivered further along the processing pipeline.
Some parameters can be used only by specific tasks, and can be removed only when they have been consumed by certain specific tasks. The data structure of the metadata indicates conditions about 1 ) by whom (e.g., which tasks) the parameters of parameter sets can be used; 2) the life-span of the parameters, e.g., used once versus Time to Live (TTL) value.
An apparatus according to an embodiment is illustrated in Figure 4. The apparatus is a user equipment (also referred to as “client”) for the purposes of the present embodiments, i.e. for content authoring. The apparatus 90 comprises a main processing unit 91 , a memory 92, a user interface 94, a communication interface 93. The apparatus according to an embodiment, shown in Figure 4, also comprises a camera module 95. Alternatively, the apparatus may be configured to receive image and/or video data from an external camera device over a communication network. The memory 92 stores data including computer program code in the apparatus 90. The computer program code is configured to implement the method according various embodiments by means of various computer modules. The camera module 95 or the communication interface 93 receives data, in the form of images or video stream, to be processed by the processor 91 . The communication interface 93 forwards processed data, i.e. the image file, for example to a display of another device, such a virtual reality headset. When the apparatus 90 is a video source comprising the camera module 95, user inputs may be received from the user interface.
An example of a device for content consumption, i.e. an apparatus according to another embodiment, is a virtual reality headset, such as a head-mounted display (HMD) for stereo viewing. The head-mounted display may comprise two screen sections or two screens for displaying the left and right eye images. The displays are close to the eyes, and therefore lenses are used to make the images easily viewable and for spreading the images to cover as much as possible of the eyes’ field of view. The device is attached to the head of the user so that it stays in place even when the user turns his head. The device may have an orientation detecting module for determining the head movements and direction of the head. The head- mounted display is able to show omnidirectional content (3DOF content) of the recorded/streamed image file to a user.
Examples for an application of omnidirectional video comprises a video call or a teleconference with video. Such a call or conference with omnidirectional video has a benefit of providing greater immersion for a user, and a possibility to explore the captured space in different directions. According to an example embodiment, the apparatus of Figure 4 may comprise one or more computer modules to
- receive a number of parameter key-value pairs with other attributes related to optional target tasks by means of identifiers, types of parameters and TTL values;
- create appropriate metadata formats supporting variable-length formats and attach the created metadata to the media data as a new payload content; and
- update or delete one or more parameters.
The metadata format can be proprietary or standard format depending on the payload types of the media transporting. The timestamp for the metadata when added to the associated media data is derived from the media data (e.g., the timestamp for the rotation value for an ERP sub-frame is timestamp of the ERP frame).
The example embodiments provide a specific metadata structure called “parameter metadata” for media (e.g. video and audio) and other data bitstreams. The parameter metadata can be implemented as any type of extension to any metadata formats depending on the data transmission/transport protocols. The parameters may comprise timed information generated dynamically and which are needed by post-processing or rendering tasks for reacting according to the dynamic information represented by the metadata. The high-level syntax of the parameter metadata is defined in the following table 1 :
TABLE 1:
The “type” information can be specified in a hierarchical manner with domain being the namespace, and the operations within that domain.
The “target” object can be defined as in following table (Table 2) with rules and criterial for the selection of the processing tasks.
The “ttl” value indicates the lifespan of the parameter set. It can be defined as a counter or a date format. When a counter is used, it refers to the amount of tasks or the consumed times by tasks. Other TTL types may be applied to indicate the life span of the parameter. When a date number is used as the ttl value, for example, a standard timestamp (ISO 8601), the value can be an addition to the current media timestamp being associated. In some cases, the “ttl” value can be used until a new parameter of the same type is generated. This can be used in the case of data which has sparse samples holding the old value until a new sample arrives. In some cases, the value should be negative to indicate the relevance for any older data, e.g. video frames with certain orders (decoding order or presentation order) cached in the buffer.
TABLE 2:
The “parameter” object can be defined as in the following table (Table 3):
TABLE 3:
The present embodiments can be implemented by any processing types, streaming or patch processing. For example, the design can be implemented in GStreamer framework. GStreamer is a pipeline-based multimedia framework that links together a wide variety of media processing systems to complete complex workflows. For instance, GStreamer can be used to build a system that reads files in one format, processes them, and exports them in another. The formats and processes can be changed in a plug and play fashion.
A shared library is designed to write and retrieve parameters to/from GStreamer media buffer, together with other existing metadata like timing metadata for various times such as DTS (decoding timestamp), PTS (Presentation timestamp), and duration of the data, and video metadata that contains information such as width, height, video data planes for planarformats and strides for padding. The library may, at first, register the metadata type by a unique name, for example, “generic- parameter-metadata” to the system’s metadata registry for global lookup later by other tasks. The library may also implement serialization and de-serialization methods to convert the parameter data between the internal parameter data structure and binary bit-stream, which can be concatenated into the media data buffer. When the binary metadata is added together with other available timing metadata, the timestamps such as DTS, PTS, and duration can be omitted from this parameter metadata to save space.
Figure 5 illustrates a generic parameter metadata library 500 with properties and key functions. The “tags” are used for the framework to look up registered metadata by single or multiple tag names. The “name” is the unique identifier of the metadata implementation, given the media streams may contain one or multiple different metadata blocks. The “init function” and “free function” are used to initialize the “parameters” object by allocating and freeing memory. The “serialization function” and “de-serialization function” are used to do the necessary low-level binary bitstream conversion.
According to the one aspect of the present embodiments, the implementation can be done in the ISO/IEC 23090-8 MPEG-I Network-based Media Processing (NBMP). The parameter set can be defined into the “metadata parameters” under Input and Output Descriptors. By default, the metadata parameter can deliver out- of-band in NBMP along with the input and output metadata. There are few properties such as “stream-id”, “protocol” and “caching-server-url”, which makes the metadata stream different from the correspondent I/O media streams. In order to support in- band approach, those properties can have the same value of the correspondent I/O media stream. Alternatively, NBMP can define new parameters in NBMP Descriptors to indicate the in-band carriage of the metadata stream with the media streams over the same protocol used by the media streams.
The present embodiments are described in more detailed manner with reference to various uses cases: 1 ) Object detection parameters passing from Object detection task to Object overlay task; 2) Viewport parameter passing in 360-degree viewport- dependent video processing and streaming; 3) Parameter control over network- based media processing pipeline (viewport parameters).
1) Object detection parameters passing from Object detection task to Object overlay task Real-time object detection is a computer-vision technique to detect, locate and track one or more objects from an image or a video. The special attribute about object detection is that it identifies the class of object (person, table, chair, etc.) and their location-specific coordinates in the given image. The location is pointed out by drawing a bounding box as an overlay around the object. Term “overlay” refers to a visual media, e.g. videos and/or images, that is rendered over 360-degree video content. The coded overlaying video can be a separate stream or part of the bitstream of the currently rendered 360-degree video/image.
Figure 6 illustrates the object detection example with two detected objects: a person and a hat with their confidence values and bounding boxes. There are two tasks involved in such example: 1 ) object detector task 620, and 2) bounding box overlay task 640. Figure 6 illustrates an example, where an out-of-band approach is used. In such example the detected information is signalled at the time (T1 ) to the application 630 and relayed to the overlay task 640 at time (T2), which T2 is greater or less than T1 due to some delay, which causes the incorrect drawing position at the T2 time onto the wrong frame 645. This kind of an overlay operation is a simple example to illustrate the potential delay caused by the parameter delivery over a separate transportation channel.
Figure 7 illustrates the in-band mechanism according to present embodiments to be used for the example of Figure 6. It is realized that delay that appeared with Figure 6 does not exist as the parameters (i.e. the object bounding boxes) are delivered together with the correct video frames 740. The overlay task 750 can, therefore, draw those boxes correctly on the right frame.
The sample metadata definition for the object detection is illustrated as follows:
Object-detection object structure (in C code syntax) can be as follows (it is to be noticed that it can use other data representation like JSON/XML): struct _GstDetectedObjectlnfo {
// bounding box guint32 x, y, width, height; gfloat confidence; gchar* class_name; guint32 trackjd;
};
/* Array of detected objects 7 struct _GstDetectedObjectslnfoArray { GstObjectlnfo* items; guint32 size;
};
The binary format of the metadata bit-stream is implementation-specific.
2) Viewport parameter passing in 360-degree viewport-dependent video processing and streaming
Delivery of omnidirectional conversational video can be viewport dependent. A viewport may be defined as a region of omnidirectional image or video suitable for display and viewing by the user. A current viewport (which may be sometimes referred simply as a viewport) may be defined as the part of the spherical video that is currently displayed and hence is viewable by the user(s).
At any point of time, a video rendered by an application on a head-mounted display (HMD) renders a portion of the 360-degrees video, which is referred to as a viewport. Likewise, when viewing a spatial part of the 360-degree content on a conventional display, the spatial part that is currently displayed is a viewport. A viewport is a window on the 360-degree world represented in the omnidirectional video displayed via a rendering display. A viewport may be characterized by a horizontal field-of-view (VHFoV) and a vertical field-of-view (WFoV). In the following, the horizontal field-of-view of the viewport will be abbreviated with HFoV and, respectively, the vertical field-of-view of the viewport will be abbreviated with VFoV.
Figure 8 shows an example of a viewport-dependent delivery. For viewport dependent delivery of omnidirectional conversational video, viewport parameter as well as other parameters (e.g. packed picture mapping etc.) need to be defined. The operation involves tasks 810, e.g. determining a viewport region of the 360-degree video frame (e.g. picture); and rotating the projected picture to reorient the viewport region of the projected picture to a center of the projected picture for encoding. Upon receiving the encoded content, rendering and/or consuming clients perform tasks 820, e.g. decode the bitstream and rotate the region to its original location in the 360-degree project picture and finally projects the region to the final rendering surface, e.g. 2D display plane, e.g. by applying equirectangular projection to rectilinear projection. Figure 8 illustrates the tasks 810, 820 according to an example.
The rotation and projection tasks 820 can be implemented in various ways. Such pixel-level processing is done with the help of GPU (Graphic Processor Unit)’s parallel processing capability.
Figure 9 shows an example of shaders in GPU and other processors in CPU. Shader, or shading, is a common approach for this kind of work and supported by almost all GPU venders. GLSL (OpenGL shading language) is a high-level shading language based on C programming language, specified by Khronos Group. Figure 9 illustrates the shaders and parameter passing between CPU and GPU. The media content needs to be copied as texture from CPU memory to GPU memory. When a shader code is executed by a GPU unit, some parameters/variables need to be defined/assigned from CPU to GPU threads. Those parameters are defined as uniforms 930 In GLSL, a uniform 930 is a global shader variable declared with the “uniform” keyword. In this disclosure “uniform” defines the viewport variables (parameters) for the vertex shaders 940 for the rotation and projection, respectively.
The sample metadata definition for viewport information is illustrated as follows:
The parameter object can be represented in one example using JSON Schema Other formats can be supported, depending on the platform implementation. {
"$schema": "http://json-schema.Org/draft-04/schema#", "type": "object",
"properties": {
"viewport": {
"type": "object",
"properties": {
"azimuth": {
"type": "number"
}.
"elevation": {
"type": "number"
}.
"tilt": {
"type": "number"
}
}.
"required": [
"azimuth",
"elevation",
"tilt"
]
}.
"field_of_view": {
"type": "object",
"properties": {
"width": {
"type": "number"
}.
"height": {
"type": "number"
}
}.
"required": [
"width",
"height"
]
}.
"zoom": {
"type": "number" }
}.
Figure 10 shows example steps on how the tasks understand the metadata as GLSL uniforms and prepare for the shader processing in the viewport-dependent rendering. The flowchart of Figure 10 comprises
- preparing OpenGL shaders 1010;
- reading 1020 metadata form incoming buffer (picture) and checking parameter type (“uniforms”); and other properties such as ttl and target values;
- if the target matches for underlined, looping 1030 through the parameters object and getting all key-value pairs;
- constructing 1040 uniforms by converting the parameter key as the uniform name and parameter value as the uniform values;
- compiling 1050 shader code and updating uniforms;
- after rendering is done, and output buffer is ready, converting 1060 the uniform variables back to parameter metadata and concatenating to the output buffer.
It is worth noticing that multiple-shader tasks can be linked in the pipeline. The parameter as metadata needs to be written back at each step. It is necessary because uniforms can be changed programmatically by the application or workflow manager. It is not possible to pass through the metadata from the input stream to the output stream. The metadata needs to be converted and updated from the shader’s uniform variables.
3) Parameter control over network-based media processing pipeline (viewport parameters)
According to the one aspect of the present embodiments, the parameter as metadata needs to be supported when the media data is delivered over the network (e.g. IP network). The transmission data may contain both a header and the actual data (payload) to be transmitted. The header part can be extended to convey the parameter metadata following some standard general mechanism like RTP header extensions. It has to be understood that the RTP protocol is an example protocol, and the scope of this invention is not limited to such protocol. For instance, any transport protocol suitable for media or control data delivery could convey the information described in this invention.
Figure 11 illustrates a viewport-dependent playing example, where viewport and other rendering parameters are serialized as metadata over IP network. At the sender user equipment 1110, the rendering variables are serialized into metadata and attached to the stream 1120. At the receiver user equipment 1140, shader rendering is controlled by the uniform variables. When carried in RTP packet/header, then the parameters can be re-written by the De-Packetizer to stream buffer as they are synchronized 1150.
According to an example, the metadata needed for network-based media processing (e.g., viewport-dependent streaming) over network can use lETF’s RTP (Real-time Transport Protocol) format. RTP is a network protocol for delivering audio and video over IP networks. RTP is used in video communication and entertainment systems including the new WWW’s standard WebRTC. The metadata can be converted as RTP custom header extensions by the last component before sending over the IP network. An example is shown with local ID (1 ) 1210 and length (n) (in full 32-bit words) 1220. The data is packed as opaque byte stream 1230. The header extension local identifier (ID, 1210) can be signalling externally by other means, e.g., via SDK offer and answer, according to the RFC 8285. This has been illustrated in Figure 12.
Figure 14 illustrates an example embodiment of the media data and metadata streams being transported between two NBMP media processing functions 1401 and 1430. Their correspondent connections 1410 and 1420 are signaled by NBMP’s “connection-map” parameter specified in the Processing Description according to ISO NBMP standard. For instance, the media connection 1410 connects the output port name 1440 of one function 1401 and the input port 1444 of another function 1430. Likewise, the metadata connection 1420 links the metadata output port 1442 and metadata input port 1446.
Figure 15 illustrates an example embodiment of the combined transferring mode (compound or synchronized transferring mode) where the media stream and metadata stream are combined into one stream data 1501 (referred to as media + metadata bitstream or simply media + metadata) using one transport protocol 1511 . The single stream data format is generated by Function 1 1401 such that it can be understood by Function 2 1430. A new parameter “compound mode” is specified in the “connection-map” object where the connection between function 1401 and function 1430. A new parameter “synchronized-mode” can also be included into the “connection-map” object “synchronized-mode” can be used to indicate that the two streams (media and metadata) are to be synchronized, based on the timestamps, even when they have different data rates.
New parameters added to connection-map object in NBMP are defined in the following table 4:
TABLE 4:
The synchronized mode addresses the scenario where the media data and metadata need not be strictly synchronized. This allows for lowering the latency by not waiting for the corresponding metadata to arrive at the receiver. Consequently, the receiving function can process the media data or the metadata without waiting for the other by applying any suitable methods. For example, in case of low latency VDD (viewport-dependent delivery), the receiver function can perform viewport warping without waiting for the rotation value with help of Al based prediction.
The synchronized mode can be present when the compound mode is false.
The compound mode can be supported by the transport protocols 1511 with mutually agreeable data format (specified by the workflow description), or by the function implementation natively. In the latter case, the sending function needs to indicate the compound support capability 1520 of generating and the receiving function indicates the capability for parsing the data comprising metadata and media data 1501 . The following Table 5 defines the new parameter in the context of ISO NBMP standard.
TABLE 5:
In another embodiment, the payload data transferred over the transport protocol 1511 can support multiple or hybrid data types. That is, the metadata and media data can be sent together synchronously without physically combined or muxed into one signle bitstream, for example, metadata can be placed to the header part of the package, and the media data is the payload, such as the above-mentioned RTP protocol.
SEI NAL Unit as one realization of embedding custom metadata
In another embodiment, the metadata needed for network-based media processing (e.g., viewport-dependent streaming) over network or the conversation needed for transporting media data (e.g. video streams) over network can use the SEI (Supplemental Enhancement Information) format. The SEI mechanism and specific SEI message have been specified as a part of video coding standards, such as H.264/AVC and H.265 HEVC. Lately, SEI messages have been specified in the Versatile Supplemental Enhancement Information (VSEI) standard (ITU-T H.274 | ISO/IEC 23002-7) in a manner that is independent of the underlying codec. SEI has been developed to contain various types of data that indicate the timing of the video pictures or describe various properties of the coded video or how it can be used or enhanced. SEI messages are also defined that can contain arbitrary user-defined data. SEI messages do not affect the core decoding process, but can indicate how the video is recommended to be post-processed or displayed. Custom SEI payload type can be used for signalling the opaque binary metadata, e.g. User Data Unregistered SEI Message” in AVC (H.264) and newer standards such as HEVC (H.265) and VVC (H.266). MPEG Timed Metadata track as the format for timed parameters
The ISOBMFF contains different mechanisms, such as, timed metadata track and sample groups, for representing time varying metadata. This information can be auxiliary information related to the media data and can be used for describing the operation to be performed with the media data. Each sample of the timed metadata track corresponds to a time instance for which sample carries the parameters or parameter sets.
In yet another embodiment, the metadata for the network-based processing can use the timed metadata track along with the media data. This describes how the media data can be post-processed or rendered. Prior to insertion to the workflow or pipeline, the relevant sample is appended to the input buffer delivered to the next task (e.g., to the GPU processing task).
In another embodiment, such a timed metadata track may be retrieved at a rate faster than the media data, so that prefetch and pre-processing of the metadata may be possible.
In another embodiment, a derived visual track, as specified in ISO/IEC 23001-16, acts as the format for timed parameters. Derived visual tracks are designed to enable defining a timed sequence of visual transformation operations to be applied to input still images and/or samples of timed sequences of images in the same presentation. The derived visual track feature is built using tools defined in the ISO base media file format (ISO/IEC 14496-12). A derived visual track describes a timed sequence of derived samples composed of an ordered list of derivation operations, each derivation operation applying a derivation transformation for the duration of the derived sample on an ordered list of inputs represented in the same presentation.
The method according to an embodiment is shown in Figure 13. The method for media processing generally comprises receiving 1310 a media stream; retrieving 1320 a number of parameters to be used by a plurality of processing tasks of the media stream, wherein the plurality of processing tasks are carried out by a plurality of media processing entities of a network-based environment; incorporating 1330 the number of parameters to metadata; and delivering 1340 the media stream with the metadata through a processing workflow comprising the plurality of media processing entities. Each of the steps can be implemented by a respective module of a computer system.
An apparatus according to an embodiment comprises means for receiving a media stream; means for retrieving a number of parameters to be used by a plurality of processing tasks of the media stream, wherein the plurality of processing tasks are carried out by a plurality of media processing entities of a network-based environment; means for incorporating the number of parameters to metadata; and means for delivering the media stream with the metadata through a processing workflow comprising the plurality of media processing entities. The means comprises at least one processor, and a memory including a computer program code, wherein the processor may further comprise processor circuitry. The memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method of Figure 13 according to various embodiments.
The various embodiments can be implemented with the help of computer program code that resides in a memory and causes the relevant apparatuses to carry out the method. For example, a device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment. Yet further, a network device like a server may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of an embodiment.
A computer program product according to an embodiment can be embodied on a non-transitory computer readable medium. According to another embodiment, the computer program product can be downloaded over a network in a data packet.
If desired, the different functions discussed herein may be performed in a different order and/or concurrently with other. Furthermore, if desired, one or more of the above-described functions and embodiments may be optional or may be combined.
Although various aspects of the embodiments are set out in the independent claims, other aspects comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims.
It is also noted herein that while the above describes example embodiments, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications, which may be made without departing from the scope of the present disclosure as, defined in the appended claims.

Claims

Claims:
1 . A method for media processing, comprising:
- receiving a media stream;
- retrieving a number of parameters to be used by a plurality of processing tasks of the media stream, wherein the plurality of processing tasks are carried out by a plurality of media processing entities of a network-based environment;
- incorporating the number of parameters to metadata; and
- delivering the media stream with the metadata through a processing workflow comprising the plurality of media processing entities.
2. The method according to claim 1 , further comprising updating parameters to metadata from a processing task.
3. The method according to claim 1 or 2, wherein the metadata indicates a task allowed to use the parameters, and/or lifespan of the parameters.
4. The method according to any of the claims 1 to 3, wherein the parameters comprise at least dynamically generated timed information needed in post processing or rendering.
5. The method according to any of the claims 1 to 4, wherein the parameters comprise information on the selection of targeted processing tasks.
6. The method according to any of the claims 1 to 4, wherein the parameters comprise information on the dissemination of targeted processing tasks.
7. The method according to any of the claims 1 to 6, wherein the metadata is delivered as an RTP format or an SEI message.
8. The method according to any of the claims 1 to 7, further comprising including into a bitstream a parameter indicating that the media stream and the metadata are delivered in a same data stream.
9. The method according to any of the claims 1 to 8, further comprising including into a bitstream a parameter indicating that whether the media stream and the metadata are to be synchronized based on timestamps or allow non- synchronized usage.
10. An apparatus comprising at least
- means for receiving a media stream;
- means for retrieving a number of parameters to be used by a plurality of processing tasks of the media stream, wherein the plurality of processing tasks are carried out by a plurality of media processing entities of a network-based environment;
- means for incorporating the number of parameters to metadata; and
- means for delivering the media stream with the metadata through a processing workflow comprising the plurality of media processing entities.
11 .The apparatus according to claim 10, further comprising means for updating parameters to metadata from a processing task.
12. The apparatus according to claim 10 or 11 , wherein the metadata indicates a task allowed to use the parameters, and/or lifespan of the parameters.
13. The apparatus according to any of the claims 10 to 12, wherein the parameters comprise at least dynamically generated timed information needed in post-processing or rendering.
14. The apparatus according to any of the claims 10 to 13, wherein the parameters comprise information on the selection of targeted processing tasks.
15. The apparatus according to any of the claims 10 to 14, wherein the parameters comprise information on the dissemination of targeted processing tasks.
16. The apparatus according to any of the claims 10 to 15, wherein the metadata is delivered as an RTP format or an SEI message.
17. The apparatus according to any of the claims 10 to 16, further comprising means for including into a bitstream a parameter indicating that the media stream and the metadata are delivered in a same data stream.
18. The apparatus according to any of the claims 10 to 17, further comprising means for including into a bitstream a parameter indicating that whether the media stream and the metadata are to be synchronized based on timestamps or allow non-synchronized usage.
19. An apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following:
- receive a media stream;
- retrieve a number of parameters to be used by a plurality of processing tasks of the media stream, wherein the plurality of processing tasks are carried out by a plurality of media processing entities of a network-based environment;
- incorporate the number of parameters to metadata; and
- deliver the media stream with the metadata through a processing workflow comprising the plurality of media processing entities.
EP22779207.4A 2021-03-29 2022-02-14 A method, an apparatus and a computer program product for processing media data Pending EP4315868A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
FI20215360 2021-03-29
PCT/FI2022/050087 WO2022207962A1 (en) 2021-03-29 2022-02-14 A method, an apparatus and a computer program product for processing media data

Publications (1)

Publication Number Publication Date
EP4315868A1 true EP4315868A1 (en) 2024-02-07

Family

ID=83455636

Family Applications (1)

Application Number Title Priority Date Filing Date
EP22779207.4A Pending EP4315868A1 (en) 2021-03-29 2022-02-14 A method, an apparatus and a computer program product for processing media data

Country Status (2)

Country Link
EP (1) EP4315868A1 (en)
WO (1) WO2022207962A1 (en)

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112369017A (en) * 2018-07-11 2021-02-12 诺基亚技术有限公司 Method and apparatus for virtual reality content stitching control with network-based media processing
US11012721B2 (en) * 2019-03-15 2021-05-18 Tencent America LLC Method and apparatus for envelope descriptor in moving picture experts group network based media processing
US11089359B1 (en) * 2019-05-12 2021-08-10 Facebook, Inc. Systems and methods for persisting in-band metadata within compressed video files

Also Published As

Publication number Publication date
WO2022207962A1 (en) 2022-10-06

Similar Documents

Publication Publication Date Title
KR102246002B1 (en) Method, device, and computer program to improve streaming of virtual reality media content
CN110431850B (en) Signaling important video information in network video streaming using MIME type parameters
US11094130B2 (en) Method, an apparatus and a computer program product for video encoding and video decoding
US20210099773A1 (en) Using gltf2 extensions to support video and audio data
CN111149368A (en) Content source description for immersive media data
CN113287323A (en) Multi-decoder interface for streaming media data
US20200221063A1 (en) Method, an apparatus and a computer program product for virtual reality
WO2023051138A1 (en) Immersive-media data processing method, apparatus, device, storage medium and program product
US11438731B2 (en) Method and apparatus for incorporating location awareness in media content
WO2023098279A1 (en) Video data processing method and apparatus, computer device, computer-readable storage medium and computer program product
EP3939327A1 (en) Method and apparatus for grouping entities in media content
Lim et al. Tiled panoramic video transmission system based on MPEG-DASH
EP4327560A1 (en) Anchoring a scene description to a user environment for streaming immersive media content
WO2022207962A1 (en) A method, an apparatus and a computer program product for processing media data
JP2024519747A (en) Split rendering of extended reality data over 5G networks
US11722751B2 (en) Method, an apparatus and a computer program product for video encoding and video decoding
Kammachi‐Sreedhar et al. Omnidirectional video delivery with decoder instance reduction
US20220335694A1 (en) Anchoring a scene description to a user environment for streaming immersive media content
US11973820B2 (en) Method and apparatus for mpeg dash to support preroll and midroll content during media playback
US11882170B2 (en) Extended W3C media extensions for processing dash and CMAF inband events
WO2023169003A1 (en) Point cloud media decoding method and apparatus and point cloud media coding method and apparatus
US20240129537A1 (en) Method and apparatus for signaling cmaf switching sets in isobmff
US20240007603A1 (en) Method, an apparatus and a computer program product for streaming of immersive video
WO2024100028A1 (en) Signalling for real-time 3d model generation
CN117242780A (en) Anchoring a scene description to a user environment and streaming immersive media content

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20231030

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR