WO2022078148A1 - 媒体文件的封装方法、媒体文件的解封装方法及相关设备 - Google Patents

媒体文件的封装方法、媒体文件的解封装方法及相关设备 Download PDF

Info

Publication number
WO2022078148A1
WO2022078148A1 PCT/CN2021/118755 CN2021118755W WO2022078148A1 WO 2022078148 A1 WO2022078148 A1 WO 2022078148A1 CN 2021118755 W CN2021118755 W CN 2021118755W WO 2022078148 A1 WO2022078148 A1 WO 2022078148A1
Authority
WO
WIPO (PCT)
Prior art keywords
media
file
application scenario
type field
stream
Prior art date
Application number
PCT/CN2021/118755
Other languages
English (en)
French (fr)
Inventor
胡颖
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to JP2022561600A priority Critical patent/JP7471731B2/ja
Priority to EP21879194.5A priority patent/EP4231609A4/en
Priority to KR1020227037594A priority patent/KR102661694B1/ko
Publication of WO2022078148A1 publication Critical patent/WO2022078148A1/zh
Priority to US17/954,134 priority patent/US20230034937A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/10Processing, recording or transmission of stereoscopic or multi-view image signals
    • H04N13/106Processing image signals
    • H04N13/172Processing image signals image signals comprising non-image signal components, e.g. headers or format information
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/60Network streaming of media packets
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/597Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding specially adapted for multi-view video sequence encoding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/60Network streaming of media packets
    • H04L65/61Network streaming of media packets for supporting one-way streaming services, e.g. Internet radio
    • H04L65/612Network streaming of media packets for supporting one-way streaming services, e.g. Internet radio for unicast
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/60Network streaming of media packets
    • H04L65/70Media network packetisation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/60Network streaming of media packets
    • H04L65/75Media network packet handling
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/60Network streaming of media packets
    • H04L65/75Media network packet handling
    • H04L65/765Media network packet handling intermediate
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/10Processing, recording or transmission of stereoscopic or multi-view image signals
    • H04N13/106Processing image signals
    • H04N13/128Adjusting depth or disparity
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/10Processing, recording or transmission of stereoscopic or multi-view image signals
    • H04N13/106Processing image signals
    • H04N13/161Encoding, multiplexing or demultiplexing different image signal components
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/70Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by syntax aspects related to video coding, e.g. related to compression standards
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/21Server components or server architectures
    • H04N21/218Source of audio or video content, e.g. local disk arrays
    • H04N21/21805Source of audio or video content, e.g. local disk arrays enabling multiple viewpoints, e.g. using a plurality of cameras
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/236Assembling of a multiplex stream, e.g. transport stream, by combining a video stream with other content or additional data, e.g. inserting a URL [Uniform Resource Locator] into a video stream, multiplexing software data into a video stream; Remultiplexing of multiplex streams; Insertion of stuffing bits into the multiplex stream, e.g. to obtain a constant bit-rate; Assembling of a packetised elementary stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/236Assembling of a multiplex stream, e.g. transport stream, by combining a video stream with other content or additional data, e.g. inserting a URL [Uniform Resource Locator] into a video stream, multiplexing software data into a video stream; Remultiplexing of multiplex streams; Insertion of stuffing bits into the multiplex stream, e.g. to obtain a constant bit-rate; Assembling of a packetised elementary stream
    • H04N21/23605Creation or processing of packetized elementary streams [PES]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/236Assembling of a multiplex stream, e.g. transport stream, by combining a video stream with other content or additional data, e.g. inserting a URL [Uniform Resource Locator] into a video stream, multiplexing software data into a video stream; Remultiplexing of multiplex streams; Insertion of stuffing bits into the multiplex stream, e.g. to obtain a constant bit-rate; Assembling of a packetised elementary stream
    • H04N21/23614Multiplexing of additional data and video streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/238Interfacing the downstream path of the transmission network, e.g. adapting the transmission rate of a video stream to network bandwidth; Processing of multiplex streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/25Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
    • H04N21/262Content or additional data distribution scheduling, e.g. sending additional data at off-peak times, updating software modules, calculating the carousel transmission frequency, delaying a video stream transmission, generating play-lists
    • H04N21/26258Content or additional data distribution scheduling, e.g. sending additional data at off-peak times, updating software modules, calculating the carousel transmission frequency, delaying a video stream transmission, generating play-lists for generating a list of items to be played back in a given order, e.g. playlist, or scheduling item distribution according to such list
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/816Monomedia components thereof involving special video data, e.g 3D video
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8456Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • H04N21/85406Content authoring involving a specific file format, e.g. MP4 format
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L2212/00Encapsulation of packets
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the technical field of data processing, in particular to the encapsulation and decapsulation technology of media files.
  • Immersive media refers to media content that can bring users an immersive experience, which can also be called immersive media.
  • media content that allows users to feel immersive through audio and video technology is immersive media.
  • VR Virtual Reality, virtual reality
  • immersive media There are various application forms of immersive media. When the user side decapsulates, decodes, and renders immersive media in different application scenarios, the required operation steps and processing capabilities are different. However, related technologies currently cannot effectively distinguish application scenarios corresponding to immersive media, which will increase the difficulty of processing immersive media on the user side.
  • Embodiments of the present application provide a media file encapsulation method, a media file decapsulation method, a media file encapsulation device, a media file decapsulation device, an electronic device, and a computer-readable storage medium, which can Different application scenarios are distinguished.
  • An embodiment of the present application provides a method for encapsulating a media file, which is executed by an electronic device.
  • the method includes: acquiring a media stream of a target media content in a corresponding application scenario; encapsulating the media stream, and generating the media
  • An encapsulation file of the code stream, the encapsulation file includes a first application scenario type field, and the first application scenario type field is used to indicate an application scenario corresponding to the media code stream; send the encapsulated file to the first device , so that the first device determines the application scenario corresponding to the media stream according to the first application scenario type field, and determines the decoding mode and rendering mode of the media stream according to the application scenario corresponding to the media stream at least one of them.
  • An embodiment of the present application provides a method for decapsulating a media file, which is executed by an electronic device.
  • the method includes: receiving an encapsulation file of a media stream of a target media content in its corresponding application scenario, where the encapsulation file includes a first an application scenario type field, where the first application scenario type field is used to indicate the application scenario corresponding to the media stream; decapsulate the encapsulated file to obtain the first application scenario type field; according to the first application
  • the scene type field is used to determine the application scene corresponding to the media code stream; according to the application scene corresponding to the media code stream, at least one of the decoding mode and the rendering mode of the media code stream is determined.
  • An embodiment of the present application provides an apparatus for encapsulating a media file.
  • the apparatus includes: a media stream acquiring unit, configured to acquire a media stream of a target media content in its corresponding application scenario; a media stream encapsulating unit, configured to Encapsulate the media stream, and generate an encapsulation file of the media stream, where the encapsulation file includes a first application scenario type field, where the first application scenario type field is used to indicate an application scenario corresponding to the media stream
  • the package file sending unit is used to send the package file to the first device, so that the first device determines the application scene corresponding to the media stream according to the first application scene type field, and according to the media
  • the application scenario corresponding to the code stream determines at least one of the decoding mode and the rendering mode of the media code stream.
  • An embodiment of the present application provides an apparatus for decapsulating a media file, the apparatus comprising: an encapsulation file receiving unit configured to receive an encapsulation file of a media stream of a target media content in its corresponding application scenario, where the encapsulation file contains It includes a first application scenario type field, where the first application scenario type field is used to indicate an application scenario corresponding to the media stream; a file decapsulation unit is used to decapsulate the encapsulated file to obtain the first application scenario a type field; an application scene obtaining unit is used to determine the application scene corresponding to the media code stream according to the first application scene type field; the decoding and rendering determining unit is used to determine the application scene corresponding to the media code stream according to the At least one of a decoding mode and a rendering mode of the media stream.
  • Embodiments of the present application provide a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the media file encapsulation method or the media file decapsulation method described in the foregoing embodiments .
  • An embodiment of the present application provides an electronic device, including: at least one processor; and a storage device configured to store at least one program, and when the at least one program is executed by the at least one processor, the at least one process
  • the device implements the media file encapsulation method or the media file decapsulation method as described in the above embodiments.
  • Embodiments of the present application provide a computer program product, including instructions, which when run on a computer, cause the computer to execute the media file encapsulation method or the media file decapsulation method described in the foregoing embodiments.
  • a first application scenario type field is expanded in the encapsulation file, and the first application scenario type field is to indicate the application scenario corresponding to the media stream, thereby realizing the distinction between the application scenarios corresponding to different media streams in the encapsulation of the media file.
  • the first device when the encapsulated file is sent to the first device, the first device The application scene of the media code stream can be distinguished according to the first application scene type field in the package file, so that the decoding method or rendering method to be adopted for the media code stream can be determined according to the application scene corresponding to the media code stream, which can save energy.
  • the computing power and resources of the first device since the application scenario of the media stream can be determined in the encapsulation stage, even if the first device does not have the decoding capability of the media stream, it can also determine the corresponding application scenarios without waiting until the media stream is decoded to distinguish.
  • Figure 1 schematically shows a schematic diagram of three degrees of freedom
  • Figure 2 schematically shows a schematic diagram of three degrees of freedom +
  • Figure 3 schematically shows a schematic diagram of six degrees of freedom
  • FIG. 4 schematically shows a flowchart of a method for encapsulating a media file according to an embodiment of the present application
  • FIG. 5 schematically shows a flowchart of a method for encapsulating a media file according to an embodiment of the present application
  • FIG. 6 schematically shows a flowchart of a method for encapsulating a media file according to an embodiment of the present application
  • FIG. 7 schematically shows a schematic diagram of a way of splicing up and down media with six degrees of freedom according to an embodiment of the present application
  • FIG. 8 schematically shows a schematic diagram of a left-right splicing manner of six-degree-of-freedom media according to an embodiment of the present application
  • FIG. 9 schematically shows a schematic diagram of a 1/4 resolution stitching manner of a six-degree-of-freedom media depth map according to an embodiment of the present application.
  • FIG. 10 schematically shows a flowchart of a method for encapsulating a media file according to an embodiment of the present application
  • FIG. 11 schematically shows a flowchart of a method for encapsulating a media file according to an embodiment of the present application
  • FIG. 12 schematically shows a flowchart of a method for encapsulating a media file according to an embodiment of the present application
  • FIG. 13 schematically shows a schematic diagram of a first multi-view video top and bottom splicing method according to an embodiment of the present application
  • FIG. 14 schematically shows a schematic diagram of a second multi-view video top and bottom splicing manner according to an embodiment of the present application
  • FIG. 16 schematically shows a block diagram of an apparatus for encapsulating a media file according to an embodiment of the present application
  • FIG. 17 schematically shows a block diagram of an apparatus for decapsulating a media file according to an embodiment of the present application
  • FIG. 18 shows a schematic structural diagram of an electronic device suitable for implementing the embodiments of the present application.
  • Example embodiments will now be described more fully with reference to the accompanying drawings.
  • Example embodiments can be embodied in various forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this application will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.
  • Point Cloud is a set of discrete points that are randomly distributed in space and express the spatial structure and surface properties of a three-dimensional object or scene.
  • a point cloud refers to a collection of massive three-dimensional points.
  • Each point in the point cloud has at least three-dimensional position information, and may also have color (color), material, or other information (such as reflectance and other additional attributes) depending on the application scenario.
  • color color
  • material such as reflectance and other additional attributes
  • each point in a point cloud has the same number of additional properties.
  • the point cloud obtained according to the principle of laser measurement including three-dimensional coordinates (XYZ) and laser reflection intensity (reflectance);
  • the point cloud obtained according to the principle of photogrammetry including three-dimensional coordinates (XYZ) and color information (RGB, red, green and blue) ;
  • point clouds can be divided into two categories according to their uses: machine perception point clouds and human eye perception point clouds; among them, machine perception point clouds can be used in autonomous navigation systems, real-time inspection systems, geographic information systems, visual Picking robots, emergency rescue robots and other scenarios; human eye perception point clouds can be used in scenarios such as digital cultural heritage, free viewpoint broadcasting, 3D immersive communication, and 3D immersive interaction.
  • the point cloud can be divided into three categories according to the acquisition method: static point cloud, dynamic point cloud and dynamic acquisition point cloud; the first type of static point cloud: that is, the object is stationary, and the device that acquires the point cloud is also stationary; The second type of dynamic point cloud: the object is moving, but the device that obtains the point cloud is stationary; the third type of dynamic point cloud acquisition: the device that obtains the point cloud is moving.
  • Point Cloud Compression Point Cloud Compression, point cloud compression.
  • Point cloud is a collection of massive points. These point cloud data will not only consume a large amount of storage memory, but also be unfavorable for transmission. There is no such large bandwidth in related technologies that can support the transmission of point cloud directly at the network layer without compression. It is necessary to compress the point cloud.
  • G-PCC Geometry-based Point Cloud Compression, point cloud compression based on geometric features.
  • G-PCC can be used to compress the first type of static point cloud and the third type of dynamically acquired point cloud.
  • the corresponding obtained point cloud media can be called point cloud media compressed based on geometric features, abbreviated as G-PCC point. cloud media.
  • V-PCC Video-based Point Cloud Compression, point cloud compression based on traditional video coding.
  • V-PCC can be used to compress the second type of dynamic point cloud, and the corresponding obtained point cloud media can be called point cloud media compressed based on traditional video coding methods, referred to as V-PCC point cloud media for short.
  • sample an encapsulation unit in the encapsulation process of a media file.
  • a media file consists of many samples. Taking the media file as video media as an example, a sample of the video media is usually a video frame.
  • DoF Degree of Freedom
  • degrees of freedom In the mechanical system, it refers to the number of independent coordinates.
  • rotational and vibrational degrees of freedom In addition to the translational degrees of freedom, there are also rotational and vibrational degrees of freedom.
  • the degree of freedom refers to the degree of freedom that a user supports to move and interact with content when viewing immersive media.
  • 3DoF that is, three degrees of freedom, which refers to the three degrees of freedom of the user's head rotating around the XYZ axis.
  • Figure 1 schematically shows a schematic diagram of three degrees of freedom. As shown in Figure 1, it is possible to rotate at a certain place or point on all three axes, you can turn your head, you can bow your head up and down, or you can swing your head. Through the experience of three degrees of freedom, users can immerse themselves in a scene in 360 degrees. If it is static, it can be understood as a panoramic picture. If the panoramic picture is dynamic, it is a panoramic video, that is, a VR video. However, the three-degree-of-freedom VR video has certain limitations, that is, the user cannot move, and cannot choose any place to watch it.
  • 3DoF+ On the basis of three degrees of freedom, the user also has the freedom of limited movement along the XYZ axis, which can also be called restricted six degrees of freedom, and the corresponding media stream can be called restricted six degrees of freedom media stream.
  • Figure 2 schematically shows a schematic diagram of the three degrees of freedom +.
  • 6DoF On the basis of three degrees of freedom, users also have degrees of freedom to move freely along the XYZ axes, and the corresponding media stream can be called a six-degree-of-freedom media stream.
  • Figure 3 schematically shows a schematic diagram of six degrees of freedom.
  • 6DoF media refers to the 6-degree-of-freedom video, which means that the video can provide users with a high-degree-of-freedom viewing experience in which the user can freely move the viewpoint in the XYZ axis direction of the three-dimensional space, and freely rotate the viewpoint around the XYX axis.
  • 6DoF media is a combination of videos from different perspectives in the corresponding space captured by the camera array.
  • the 6DoF media data is expressed as a combination of the following information: texture maps captured by multiple cameras, depth maps corresponding to the texture maps of multiple cameras, and corresponding 6DoF media content description metadata , the metadata includes parameters of multiple cameras, and description information such as splicing layout and edge protection of 6DoF media.
  • the texture map information of the multi-camera and the corresponding depth map information are spliced, and the description data obtained by splicing is written into the metadata according to the defined syntax and semantics.
  • the spliced multi-camera depth map and texture map information is encoded by plane video compression, and then transmitted to the terminal for decoding to synthesize the 6DoF virtual viewpoint requested by the user, thereby providing the user with a viewing experience of 6DoF media.
  • Volumetric media is a type of immersive media, which can include volumetric video, for example.
  • Volumetric video is a three-dimensional data representation. Since the current mainstream coding is based on two-dimensional video data, the original volumetric video data needs to be converted from three-dimensional to two-dimensional before encapsulation and transmission at the system layer. Code again. In the process of presenting the content of the volumetric video, it is necessary to convert the two-dimensionally represented data into three-dimensional data to represent the final presented volumetric video. How the volumetric video is represented in a 2D plane will directly affect the system-level packaging, transport, and final volumetric video content rendering process.
  • Atlas indicates the area information on the plane frame of 2D (2-dimension, two-dimensional), the area information of 3D (3-dimension, three-dimensional) rendering space, and the mapping relationship between the two and the mapping required necessary parameter information.
  • the atlas includes a set of tiles and the associated information that the tiles correspond to an area in the three-dimensional space of the volume data.
  • a patch is a rectangular area in the atlas, which is associated with the volume information of the three-dimensional space.
  • the component data of the two-dimensional representation of the volume video is processed to generate tiles, and the two-dimensional plane area where the two-dimensional representation of the volume video is located is divided into multiple rectangular regions of different sizes according to the position of the volume video represented in the geometric component data.
  • a rectangular area is a tile, and the tile contains the necessary information for back-projecting the rectangular region into the three-dimensional space; pack the tiles to generate an atlas, put the tiles into a two-dimensional grid, and ensure that each tile is in the The valid parts are non-overlapping.
  • the tiles generated from a volumetric video can be packaged into one or more atlases.
  • the corresponding geometric data, attribute data, and placeholder data are generated based on the atlas data, and the final representation of the volume video on the two-dimensional plane is generated according to the combination of the atlas data, the geometry data, the attribute data, and the placeholder data.
  • the geometry component is required
  • the placeholder component is conditionally required
  • the attribute component is optional.
  • AVS Audio Video Coding Standard, audio and video coding standard.
  • ISOBMFF ISO Based Media File Format, a media file format based on ISO (International Standard Organization, International Organization for Standardization) standards.
  • ISOBMFF International Standard Organization, International Organization for Standardization
  • ISOBMFF is an encapsulation standard for media files.
  • the most typical ISOBMFF file is an MP4 (Moving Picture Experts Group 4, Moving Picture Experts Group 4) file.
  • Depth map As a way of expressing 3D scene information, the gray value of each pixel of the depth map can be used to represent the distance of a point in the scene from the camera.
  • the method for encapsulating a media file provided in this embodiment of the present application may be executed by any electronic device. It is not limited to this.
  • FIG. 4 schematically shows a flowchart of a method for encapsulating a media file according to an embodiment of the present application. As shown in FIG. 4 , the method provided by this embodiment of the present application may include the following steps.
  • step S410 a media stream of the target media content in its corresponding application scenario is obtained.
  • the target media content may be any one or a combination of video, audio, image, etc.
  • video is used as an example for illustration, but the present application is not limited to this.
  • the media stream may include a six-degree-of-freedom (6DoF) media stream and a restricted six-degree-of-freedom (3DoF+) media stream that can be rendered in a 3D space, and the following
  • 6DoF media stream is used as an example for illustration.
  • the methods provided in the embodiments of the present application may be applicable to applications such as 6DoF media content recording and broadcasting, on-demand, live broadcasting, communication, program editing, and production.
  • Immersive media can be divided into 3DoF media, 3DoF+ media and 6DoF media according to the degree of freedom that users can support when consuming target media content.
  • the 6DoF media can include multi-view video and point cloud media.
  • point cloud media can be divided into point cloud media compressed based on traditional video coding methods (ie V-PCC) and point cloud media compressed based on geometric features (G-PCC).
  • V-PCC traditional video coding methods
  • G-PCC geometric features
  • Multi-view video is usually shot by a camera array from multiple angles (also called viewing angles) of the same scene, forming a texture map including the texture information (color information, etc.) of the scene and a texture map including depth information (spatial distance information, etc.).
  • the depth map together with the mapping information from the 2D plane frame to the 3D presentation space, constitutes 6DoF media that can be consumed on the user side.
  • the coding of multi-view video and V-PCC is a set of rules
  • the coding of G-PCC is a set of rules.
  • the coding standards of the two are different, and accordingly, the decoding processing methods are also different.
  • multi-view video requires texture maps and depth maps
  • V-PCC may also require placeholder maps in addition to these, which is also a difference.
  • step S420 the media code stream is encapsulated, and an encapsulation file of the media code stream is generated, where the encapsulation file includes a first application scene type field, and the first application scene type field is used to indicate the media code The application scenario corresponding to the stream.
  • the embodiments of the present application can distinguish different application scenarios of 6DoF media.
  • step S430 the encapsulation file is sent to the first device, so that the first device determines the application scenario corresponding to the media stream according to the first application scenario type field, and determines the application scenario corresponding to the media stream according to the first application scenario type field.
  • the application scenario determines at least one of a decoding mode and a rendering mode of the media stream.
  • the first device may be any intermediate node, or may be any user terminal that consumes the media stream, which is not limited in this application.
  • the method for encapsulating a media file extends a first application scenario type field in the encapsulation file when generating an encapsulation file of a media stream in a corresponding application scenario, and indicates through the first application scenario type field The application scenario corresponding to the media stream, so that different application scenarios of different media streams can be distinguished in the encapsulation of the media file.
  • the first device when the encapsulated file is sent to the first device, the first device
  • the application scene of the media code stream can be distinguished according to the first application scene type field in the package file, so that it can be determined according to the application scene corresponding to the media code stream which decoding mode and/or rendering mode to adopt for the media code stream,
  • the computing power and resources of the first device can be saved; on the other hand, since the application scenario of the media stream can be determined in the encapsulation stage, even if the first device does not have the decoding capability of the media stream, the media stream can be determined.
  • the corresponding application scenarios do not need to be differentiated after decoding the media stream.
  • FIG. 5 schematically shows a flowchart of a method for encapsulating a media file according to an embodiment of the present application. As shown in FIG. 5 , the method provided by this embodiment of the present application may include the following steps.
  • step S410 in the embodiment of FIG. 5 reference may be made to the foregoing embodiment.
  • step S420 in the above-mentioned embodiment of FIG. 4 may further include the following steps.
  • step S421 the first application scene type field is added to the volume visual media header data box of the target media file format data box (for example, the Volumetric Visual Media Header Box exemplified below).
  • the volume visual media header data box of the target media file format data box for example, the Volumetric Visual Media Header Box exemplified below.
  • the extended ISOBMFF data box (as the target media file format data box) is used as an example for illustration, but the present application is not limited to this.
  • step S422 the value of the first application scenario type field is determined according to the application scenario corresponding to the media stream.
  • the value of the first application scene type field may include any one of the following: a first value indicating that the media code stream is a multi-view video of non-large-scale atlas information (for example, "0"); indicates that the media stream is the second value of multi-view video with large-scale atlas information (for example, "1”); indicates that the media stream is point cloud media compressed based on traditional video coding methods
  • the third value for example, "2" of ; indicates that the media code stream is the fourth value (for example, "3") of point cloud media compressed based on geometric features.
  • the value of the first application scenario type field is not limited to indicating the above application scenarios, it may indicate more or less application scenarios, and may be set according to actual requirements.
  • step S430 in the embodiment of FIG. 5 reference may be made to the foregoing embodiments.
  • the media file encapsulation method provided by the embodiments of the present application can enable the first device that consumes 6DoF media to select strategies in decapsulation, decoding, and rendering of 6DoF media in a targeted manner .
  • FIG. 6 schematically shows a flowchart of a method for encapsulating a media file according to an embodiment of the present application. As shown in FIG. 6 , the method provided by this embodiment of the present application may include the following steps.
  • step S410 in the embodiment of FIG. 6 reference may be made to the foregoing embodiment.
  • step S420 in the above embodiment may further include the following step S4221 , that is, it is determined that the media stream is a multi-view video with large-scale atlas information through the first application scene type field during encapsulation.
  • step S4221 encapsulate the media stream, and generate an encapsulation file of the media stream, where the encapsulation file includes a first application scene type field (for example, application_type exemplified below), and the value of the first application scene type field is A second value indicating that the media stream is a multi-view video with large-scale atlas information.
  • a first application scene type field for example, application_type exemplified below
  • the value of the first application scene type field is A second value indicating that the media stream is a multi-view video with large-scale atlas information.
  • the mapping information of its 2D plane frame to 3D rendering space determines the 6DoF experience of multi-view video.
  • this mapping relationship there are two methods.
  • One method defines an atlas to divide the area of the 2D plane in a more detailed manner, and then indicates the mapping relationship between these 2D small area sets and the 3D space, which is called is non-large-scale atlas information, and the corresponding multi-view video is multi-view video with non-large-scale atlas information.
  • the other method is more crude, directly from the perspective of the acquisition device (all taking the camera as an example), identify the depth map and texture map generated by each camera, and restore the corresponding 2D depth map according to the parameters of each camera
  • the mapping relationship with the texture map in 3D space is called large-scale atlas information
  • the corresponding multi-view video is a multi-view video with large-scale atlas information. It can be understood that the large-scale atlas information and the non-large-scale atlas information here are relative, and do not directly limit the specific size.
  • the camera parameters are usually divided into external parameters and internal parameters of the camera.
  • the external parameters usually include information such as the position and angle of the camera
  • the internal parameters usually include the position of the optical center of the camera, the focal length and other information.
  • the multi-view video in 6DoF media can further include multi-view video with large-scale atlas information and multi-view video with non-large-scale atlas information, that is, the application forms of 6DoF media are various, and users are different for different applications.
  • the 6DoF media of the scene is decapsulated, decoded, and rendered, the required operation steps and processing capabilities are different.
  • the difference between large-scale atlas information and non-large-scale atlas information lies in the granularity of mapping and rendering from 2D areas to 3D space.
  • the large-scale atlas information is 6 pieces of 2D puzzles mapped to 3D space, and non-large-scale images
  • the set information might be a 60-piece puzzle mapped into 3D space. Then the complexity of the two mapping algorithms must be different, and the algorithm for large-scale atlas information will be simpler than the algorithm for non-large-scale atlas information.
  • mapping relationship from 2D area to 3D space is obtained from camera parameters, that is, it is a multi-view video with large-scale atlas information, then it is not necessary to define more details in the package file.
  • the mapping relationship of small 2D area to 3D space is obtained from camera parameters, that is, it is a multi-view video with large-scale atlas information, then it is not necessary to define more details in the package file.
  • the method may further include the following steps.
  • step S601 if the media code stream is encapsulated according to a single track, add a bit stream sample entry (such as the V3CbitstreamSampleEntry exemplified below, but the application is not limited to this) of the target media file format data box.
  • Large-scale atlas flag (such as large_scale_atlas_flag in the example below).
  • step S602 if the large-scale atlas identifier indicates that the media stream is a multi-view video with large-scale atlas information, then add the identifier of the number of cameras that collect the media stream in the bit stream sample entry (for example, camera_count exemplified below), an identifier of the number of views corresponding to cameras included in the current file of the media stream (for example, camera_count_contained exemplified below).
  • step S603 the resolutions of the texture map and depth map acquired from the viewing angle corresponding to the camera included in the current file are added to the bit stream sample entry (for example, camera_resolution_x and camera_resolution_y exemplified below).
  • the method further includes at least one of the following steps S604-S607.
  • step S604 the downsampling factor of the depth map acquired from the perspective corresponding to the camera included in the current file is added to the bitstream sample entry (for example, depth_downsample_factor exemplified below).
  • step S605 add the offset ( For example texture_vetex_x and texture_vetex_y exemplified below).
  • step S606 add the offset ( For example, depth_vetex_x and depth_vetex_y exemplified below).
  • step S607 the guard band width of the texture map and depth map acquired from the perspective corresponding to the camera included in the current file is added to the bitstream sample entry (for example, padding_size_texture and padding_size_depth exemplified below).
  • padding_size_texture and padding_size_depth respectively define the size of the edge protection area of each texture map and depth map, in order to protect the edge mutation area when compressing the spliced image.
  • the values of padding_size_texture and padding_size_depth indicate the width of the edge protection area of the texture map and depth map. If padding_size_texture and padding_size_depth are equal to 0, it means that there is no edge protection.
  • the method may further include the following steps.
  • step S608 if the media stream is encapsulated according to multiple tracks, a large-scale atlas identifier is added to the sample entry of the target media file format data box.
  • step S609 if the large-scale atlas identifier indicates that the media stream is a multi-view video of large-scale atlas information, then add the identifier of the number of cameras that collected the media stream, the number of cameras that collected the media stream, and the number of cameras in the sample entry. The number of viewing angles corresponding to the cameras included in the current file of the media stream is identified.
  • step S610 the resolution of the texture map and the depth map acquired from the viewing angle corresponding to the camera included in the current file is added to the sample entry.
  • the method further includes at least one of the following steps S611-S614.
  • step S611 the downsampling factor of the depth map acquired from the perspective corresponding to the camera included in the current file is added to the sample entry.
  • step S612 add the offset of the upper left vertex of the texture map corresponding to the camera included in the current file and the origin of the plane frame in the large-scale atlas information in the sample entry.
  • step S613 add the offset of the upper left vertex of the depth map corresponding to the viewing angle captured by the camera included in the current file relative to the origin of the plane frame in the large-scale atlas information into the sample entry.
  • step S614 the guard band width of the texture map and depth map acquired from the viewing angle corresponding to the camera included in the current file is added to the sample entry.
  • step S430 in the embodiment of FIG. 6 reference may be made to the above-mentioned embodiment.
  • the six_dof_stitching_layout field may be used to indicate the stitching method of the depth map and texture map collected from the viewing angle corresponding to each camera in the 6DoF media, which is used to identify the texture map and depth map stitching layout of the 6DoF media.
  • the six_dof_stitching_layout field may be used to indicate the stitching method of the depth map and texture map collected from the viewing angle corresponding to each camera in the 6DoF media, which is used to identify the texture map and depth map stitching layout of the 6DoF media.
  • 6DoF media splicing layout 0 Top and bottom stitching of depth map and texture map 1 Left and right stitching of depth map and texture map 2 Depth map 1/4 downsampling stitching other reserved
  • FIG. 7 schematically shows a schematic diagram of a way of splicing up and down media with six degrees of freedom according to an embodiment of the present application.
  • the 6DoF media stitching mode is top-bottom stitching, as shown in Figure 7, in the top-bottom stitching mode, the texture maps collected by multiple cameras (such as the view 1 texture map in Figure 7, the view 2 texture map, the view angle 3 texture map, view 4 texture map) are arranged above the image in order, and the corresponding depth maps (for example, the view 1 depth map in Figure 7, the view 2 depth map, the view 3 depth map, and the view 4 depth map) are arranged in order below the image.
  • the texture maps collected by multiple cameras such as the view 1 texture map in Figure 7, the view 2 texture map, the view angle 3 texture map, view 4 texture map
  • the corresponding depth maps for example, the view 1 depth map in Figure 7, the view 2 depth map, the view 3 depth map, and the view 4 depth map
  • the reconstruction module can use the values of camera_resolution_x and camera_resolution_y to calculate the corresponding texture map of each camera and the layout position of the depth map, so as to further utilize the texture map and depth of multiple cameras. image information for reconstruction of 6DoF media.
  • FIG. 8 schematically shows a schematic diagram of a left-right splicing manner of 6-DOF media according to an embodiment of the present application.
  • the 6DoF media stitching mode is left and right stitching, as shown in Figure 8, in the left and right stitching mode, the texture maps collected by multiple cameras (such as the view 1 texture map in Figure 8, the view 2 texture map, the view angle 3 texture map, view 4 texture map) are arranged in order on the left of the image, and the corresponding depth maps (for example, view 1 depth map, view 2 depth map, view 3 depth map, view 4 depth map in Figure 8 ) are arranged in order to the right of the image.
  • the texture maps collected by multiple cameras such as the view 1 texture map in Figure 8, the view 2 texture map, the view angle 3 texture map, view 4 texture map
  • the corresponding depth maps for example, view 1 depth map, view 2 depth map, view 3 depth map, view 4 depth map in Figure 8
  • FIG. 9 schematically shows a schematic diagram of a 1/4 resolution stitching manner of a six-degree-of-freedom media depth map according to an embodiment of the present application.
  • the 6DoF media stitching mode is the stitching of depth map 1/4 downsampling.
  • the depth map, the depth map of view 2, the depth map of view 3, the depth map of view 4) are down-sampled at 1/4 resolution, and then spliced into the texture map (for example, the texture map of view 1, the texture map of view 2, the texture map of view 3 in Figure 9 texture map, bottom right of the View 4 texture map). If the stitching of the depth map cannot fill the rectangular area of the final stitched image, the remaining part is filled with an empty image.
  • the media file encapsulation method provided by the embodiments of the present application can not only distinguish application scenarios of different 6DoF media, but also enable the first device that consumes 6DoF media to perform policy selection in the links of decapsulation, decoding, and rendering of 6DoF media in a targeted manner. Further, for multi-view video applications in 6DoF media, a method for indicating multi-view video depth map and texture map related information in file encapsulation is proposed, so that the multi-view video depth maps and texture maps from different perspectives are packaged and combined. more flexible.
  • the method may further include: generating a target description file of the target media content, where the target description file includes a second application scenario type field, where the second application scenario type field is used to indicate The application scenario corresponding to the media stream; sending the target description file to the first device, so that the first device determines in the encapsulation file of the media stream according to the second application scenario type field The target encapsulation file corresponding to the target media stream.
  • sending the encapsulation file to the first device, so that the first device determines the application scenario corresponding to the media stream according to the first application scenario type field may include: sending the target encapsulation file to the first device, so that the first device determines the target application scenario corresponding to the target media stream according to the first application scenario type field in the target encapsulation file.
  • FIG. 10 schematically shows a flowchart of a method for encapsulating a media file according to an embodiment of the present application. As shown in FIG. 10 , the method provided by this embodiment of the present application may include the following steps.
  • Steps S410-S420 in the embodiment of FIG. 10 may refer to the foregoing embodiment, which may further include the following steps.
  • a target description file of the target media content is generated, and the target description file includes a second application scene type field (for example, v3cAppType in the following example), and the second application scene type field indicates the media The application scenario corresponding to the code stream.
  • a second application scene type field for example, v3cAppType in the following example
  • the fields at the signaling transmission level can also be expanded.
  • DASH Dynamic Adaptive streaming over HTTP (HyperText Transfer Protocol)
  • MPD Media Presentation Description, media file description file
  • signaling (as a target description file) in the form of
  • the application scenario type indication of 6DoF media and the indication of large-scale atlas are defined.
  • step S1020 send the target description file to the first device, so that the first device determines the target of the target media stream from the encapsulation file of the media stream according to the second application scenario type field package file.
  • step S1030 the target encapsulation file is sent to the first device, so that the first device determines the target application of the target media stream according to the first application scenario type field in the target encapsulation file Scenes.
  • the media file encapsulation method provided by the embodiments of the present application can not only identify the application scene corresponding to the media stream in the encapsulation file through the first application scene type field, but also identify the media through the second application scene type field in the target description file.
  • the first device can first determine what kind of media code stream it needs to obtain according to the second application scene type field in the target description file, so that it can request the corresponding target media code stream from the server, thereby Reduce the amount of data transmission, and ensure that the requested target media stream can match the actual capabilities of the first device.
  • the type field determines the target application scenario of the target media stream, so as to know which decoding and rendering method to adopt, which reduces computing resources.
  • FIG. 11 schematically shows a flowchart of a method for encapsulating a media file according to an embodiment of the present application. As shown in FIG. 11 , the method provided by this embodiment of the present application may include the following steps.
  • Steps S410-S420 in the embodiment of FIG. 11 may refer to the above-mentioned embodiment.
  • step S1010 in the embodiment of FIG. 10 may further include the following steps.
  • step S1011 the second application scenario type field is added to the target description file of the hypertext transfer protocol-based dynamic adaptive streaming media transmission of the target media content.
  • step S1012 the value of the second application scenario type field is determined according to the application scenario corresponding to the media stream.
  • Steps S1020 and S1030 in the embodiment of FIG. 11 may refer to the above-mentioned embodiments.
  • the method proposed in this embodiment of the present application can be used for 6DoF media application scenario indication, and can include the following steps:
  • mapping from the 2D plane frame to the 3D space is based on the output of the captured camera, that is, the mapping from the 2D plane frame to the 3D space is based on the texture map and
  • the depth map is mapped in units, which is called large-scale atlas information; if it is necessary to further divide the texture map and depth map collected by each camera in a more detailed manner, indicating that the divided 2D small areas are assembled into 3D space
  • the mapping is called non-large-scale atlas information.
  • mapping of the multi-view video from the 2D plane frame to the 3D space is performed in units of the output of the acquisition camera, the relevant information output by different acquisition cameras is indicated in the package file.
  • several descriptive fields may be added at the system layer, which may include field expansion at the file encapsulation level and field expansion at the signaling transmission level, so as to support the above steps in this embodiment of the present application.
  • the following is an example in the form of extended ISOBMFF data box and DASH MPD signaling, which defines the application type indication of 6DoF media and the indication of large-scale atlas, as follows (the extension part is marked with italics).
  • V3CBitstreamSampleEntry()extends VolumetricVisualSampleEntry('v3e1') ⁇ //Because 6DoF media can be encapsulated according to single track or multiple tracks when encapsulating, this structure corresponds to the case of single track.
  • V3CConfigurationBox config
  • V3CSampleEntry extends VolumemetricVisualSampleEntry('v3c1') ⁇ //This structure corresponds to the multi-track case.
  • V3CConfigurationBox config
  • the first application scenario type field application_type indicates the application scenario type of the 6DoF media, and the specific value includes but is not limited to the content shown in Table 2 below:
  • Multi-view video non-large-scale atlas information
  • Multi-view video large-scale atlas information
  • Point cloud media compressed based on traditional video coding methods 3 Compressed point cloud media based on geometric features
  • the large-scale atlas flag large_scale_atlas_flag indicates whether the atlas information is large-scale atlas information, that is, whether the atlas information can be obtained only through relevant information such as camera parameters. Atlas information), when it is equal to 0, it indicates multi-view video (non-large-scale atlas information).
  • the first application scene type field application_type can already indicate whether it is a multi-view video with large-scale atlas information.
  • large_scale_atlas_flag is added to facilitate analysis. Only one is needed for use, but because it is uncertain which field will be used, the information here is redundant.
  • camera_count is used to indicate the number of all cameras that collect 6DoF media, which is called the identifier of the number of cameras that collect the media stream.
  • the value of camera_number ranges from 1 to 255.
  • camera_count_contained is used to indicate the number of viewing angles corresponding to the cameras included in the current file of the 6DoF media, which is referred to as the identification of the number of viewing angles corresponding to the cameras included in the current file.
  • padding_size_depth represents the width of the guard band of the depth map.
  • padding_size_texture The width of the guard band of the texture map.
  • camera_id represents the camera identifier corresponding to each view angle.
  • camera_resolution_x, camera_resolution_y represent the resolution width and height of the texture map and depth map collected by the camera, and represent the resolution in the X and Y directions collected by the corresponding camera, respectively.
  • depth_downsample_factor represents the downsampling factor of the corresponding depth map, and the actual resolution width and height of the depth map are 1/2 depth_downsample_factor of the width and height of the camera acquisition resolution.
  • depth_vetex_x, depth_vetex_y respectively represent the X, Y component values in the offset of the upper left vertex of the corresponding depth map relative to the origin of the plane frame (the upper left vertex of the plane frame).
  • texture_vetex_x, texture_vetex_y respectively represent the X, Y component values in the offset of the upper left vertex of the corresponding texture map relative to the origin of the plane frame (the upper left vertex of the plane frame).
  • the second application scenario type field v3cAppType may be extended in the table shown in Table 3 below in the DASH MPD signaling.
  • the above system description corresponds to the data structure of each area of the plane frame in FIG. 7 .
  • the above system description corresponds to the data structure of each area of the plane frame in FIG. 8 .
  • the atlas information of the multi-view video A is large-scale atlas information.
  • the above system description corresponds to the data structure of each area of the plane frame in FIG. 9 .
  • padding_size_depth and padding_size_texture do not have an absolute value range, and different values have no influence on the method provided by this embodiment of the present application.
  • This solution only indicates the size of padding_size_depth and padding_size_texture.
  • the sizes of padding_size_depth and padding_size_texture are like this, it is determined by an encoding algorithm and has nothing to do with the method provided by this embodiment of the present application.
  • camera_resolution_x and camera_resolution_y are used to calculate the actual resolution width and height of the depth map, which is the resolution of each camera.
  • the multi-view video is shot by multiple cameras, and the resolutions of different cameras can be different.
  • the resolution width and height of the viewing angle are both exemplified as 100 pixels, which are only taken for the convenience of the example, and are not actually limited to this.
  • each area of the multi-view video plane frame can be compared with the texture maps of different cameras, The depth map corresponds. Then, by decoding the camera parameter information in the media stream of the multi-view video, each area of the plane frame can be restored to the 3D rendering area, so as to consume the multi-view video.
  • the server delivers the target description file corresponding to the MPD signaling to the client installed on the first device.
  • the client When the client receives the target description file corresponding to the MPD signaling sent by the server, it requests the target encapsulation file of the target media stream of the corresponding application scenario type according to the capabilities and presentation requirements of the client device. It is assumed that the processing capability of the client of the first device is low, so the client requests the target package file of the multi-view video A.
  • the server sends the target package file of the multi-view video A to the client of the first device.
  • the client of the first device After receiving the target package file of the multi-view video A sent by the server, the client of the first device determines the application scene type of the current 6DoF media file according to the application_type field in the VolumetricVisualMediaHeaderBox data box, and can then process accordingly. Different application scenarios will have different decoding and rendering processing algorithms.
  • the client can use a relatively simple processing algorithm for the multi-view video. to be processed.
  • acquiring the media stream of the target media content in its corresponding application scenario may include: receiving the first package file of the first multi-view video sent by the second device and the first package file sent by the third device. the second encapsulated file of the second multi-view video; decapsulate the first encapsulated file and the second encapsulated file respectively to obtain the first multi-view video and the second multi-view video; decode the first multi-view video respectively a multi-view video and the second multi-view video, obtain a first depth map and a first texture map in the first multi-view video, and a second depth map and a second depth map in the second multi-view video a texture map; obtaining a combined multi-view video according to the first depth map, the second depth map, the first texture map and the second texture map.
  • a first number of cameras may be installed on the second device, a second number of cameras may be installed on the third device, and the second device and the third device use their respective cameras to target the same
  • the scene is subjected to multi-view video capture and shooting, and the first multi-view video and the second multi-view video are obtained.
  • both the first package file and the second package file may include the first application scenario type field, and the first application scenario type field in the first package file and the second package file
  • the value of is used to indicate that the first multi-view video and the second multi-view video are the second values of the multi-view videos of large-scale atlas information.
  • FIG. 12 schematically shows a flowchart of a method for encapsulating a media file according to an embodiment of the present application. As shown in FIG. 12 , the method provided by this embodiment of the present application may include the following steps.
  • step S1210 the first package file of the first multi-view video sent by the second device and the second package file of the second multi-view video sent by the third device are received.
  • step S1220 the first encapsulated file and the second encapsulated file are decapsulated, respectively, to obtain the first multi-view video and the second multi-view video.
  • step S1230 the first multi-view video and the second multi-view video are decoded respectively to obtain a first depth map and a first texture map in the first multi-view video, and the second multi-view video A second depth map and a second texture map in the video.
  • step S1240 a combined multi-view video is obtained according to the first depth map, the second depth map, the first texture map and the second texture map.
  • step S1250 encapsulate and merge the multi-view video, and generate an encapsulation file for merging the multi-view video.
  • the package file includes a first application scene type field, and the first application scene type field is used to indicate that the application scene corresponding to the merged multi-view video is large.
  • the second value of the multi-view video for the scale atlas information is used to indicate that the application scene corresponding to the merged multi-view video is large.
  • step S1260 the package file is sent to the first device, so that the first device obtains the application scene corresponding to the combined multi-view video according to the first application scene type field, and according to the combined multi-view video
  • the application scene corresponding to the video determines the decoding or rendering mode of the combined multi-view video.
  • the second device and the third device are UAV A and UAV B respectively (but the application is not limited to this), and suppose that two cameras are installed on UAV A and UAV B respectively (ie The first number and the second number are both equal to 2, but the application is not limited to this, and can be set according to actual scenarios).
  • UAV A and UAV B are used to capture and shoot multi-view video of the same scene.
  • the first package file corresponding to the first multi-view video is as follows :
  • the above system description corresponds to the data composition of each area of the plane frame in FIG. 13 , and the upper-lower splicing method is described here as an example.
  • the second package file corresponding to the package of the second multi-view video is as follows:
  • the above system description corresponds to the data structure of each area of the plane frame as shown in FIG. 14 .
  • the server decapsulates and decodes the first encapsulated file and the second encapsulated file, and then decapsulates and decodes all the depth maps and texture maps. Merge, and assume that after downsampling the depth map, a merged multi-view video is obtained.
  • the depth map is not as important as the texture map, and the amount of data can be reduced after downsampling.
  • the embodiment of the present application indicates this scenario, but this scenario is limited.
  • the above system description corresponds to the data structure of each area of the plane frame shown in FIG. 9 .
  • the client of the first device When the client of the first device receives the package file of the combined multi-view video sent by the server, it can combine the regions of the combined multi-view video plane frame with the texture maps of different cameras, by parsing the corresponding fields in the package file. The depth map corresponds. Then, by decoding the camera parameter information in the media stream of the merged multi-view video, each area of the plane frame can be restored to the 3D rendering area, thereby consuming the merged multi-view video.
  • the media file encapsulation method provided by the embodiments of the present application for multi-view video applications in 6DoF media, a method for indicating multi-view video depth map and texture map related information in file encapsulation is proposed, so that the multi-view video of different views can be
  • the encapsulation and combination of depth map and texture map is more flexible.
  • Different application scenarios can be supported. As described in the above embodiments, in some scenarios, different devices are shooting, which will be packaged into two files, and the method provided in this embodiment of the present application can associate these two files for combined consumption. Otherwise, in the above embodiment, only two files can be presented separately, and cannot be presented jointly.
  • the method for decapsulating a media file provided in this embodiment of the present application may be performed by any electronic device.
  • the method for decapsulating a media file is applied to an intermediate node or a first device (for example, an immersive system) of an immersive system. player) as an example to illustrate, but the present application is not limited to this.
  • FIG. 15 schematically shows a flowchart of a method for decapsulating a media file according to an embodiment of the present application. As shown in FIG. 15 , the method provided by this embodiment of the present application may include the following steps.
  • step S1510 an encapsulation file of a media stream of the target media content in its corresponding application scenario is received, the encapsulation file includes a first application scenario type field, and the first application scenario type field is used to indicate the The application scenario corresponding to the media stream.
  • the method may further include: receiving a target description file of the target media content, where the target description file includes a second application scenario type field, where the second application scenario type field is used to indicate The application scenario corresponding to the media code stream; according to the second application scenario type field, the target encapsulation file of the target media code stream is determined in the encapsulation file of the media code stream.
  • receiving the encapsulation file of the media stream of the target media content in its corresponding application scenario may include: receiving the target encapsulation file, so as to determine the target encapsulation file according to the first application scenario type field in the target encapsulation file. Describe the target application scenario of the target media stream.
  • step S1520 the encapsulated file is decapsulated to obtain the first application scenario type field.
  • step S1530 an application scenario corresponding to the media stream is determined according to the first application scenario type field.
  • step S1540 at least one of a decoding mode and a rendering mode of the media stream is determined according to an application scenario corresponding to the media stream.
  • the method may further include: parsing The encapsulation file obtains the mapping relationship between the texture map and depth map collected by the camera corresponding to the viewing angle included in the media code stream and the plane frame in the large-scale atlas information; decode the media code stream to obtain camera parameters in the media code stream; according to the mapping relationship and the camera parameters, the multi-view video is presented in a three-dimensional space.
  • the media file encapsulation device provided in this embodiment of the present application may be set on any electronic device.
  • the device is set on the server side of the immersive system as an example for illustration, but the present application is not limited to this.
  • FIG. 16 schematically shows a block diagram of an apparatus for encapsulating a media file according to an embodiment of the present application.
  • the apparatus 1600 for encapsulating a media file provided by this embodiment of the present application may include a media stream obtaining unit 1610 , a media stream encapsulating unit 1620 , and an encapsulating file sending unit 1630 .
  • the media stream acquiring unit 1610 may be configured to acquire the media stream of the target media content in its corresponding application scenario.
  • the media code stream encapsulation unit 1620 may be configured to encapsulate the media code stream, and generate an encapsulation file of the media code stream, where the encapsulation file includes a first application scenario type field, and the first application scenario type field is used to indicate The application scenario corresponding to the media stream.
  • the package file sending unit 1640 may be configured to send the package file to the first device, so that the first device determines the application scenario corresponding to the media stream according to the first application scenario type field, and The application scenario corresponding to the code stream determines at least one of the decoding mode and the rendering mode of the media code stream.
  • the media file encapsulation device provided by the embodiment of the present application extends the first application scenario type field in the encapsulation file when generating the encapsulation file of the media stream under the corresponding application scenario, and indicates the first application scenario type field by the first application scenario type field.
  • the application scenario corresponding to the media stream thereby realizing the distinction between application scenarios corresponding to different media streams in the encapsulation of the media file.
  • the first device when the encapsulated file is sent to the first device, the first device can The first application scene type field in the package file distinguishes the application scene of the media code stream, so that the decoding method or rendering method to be adopted for the media code stream can be determined according to the application scene corresponding to the media code stream, which can save the first The computing power and resources of the device; on the other hand, since the application scenario of the media stream can be determined in the encapsulation stage, even if the first device does not have the decoding capability of the media stream, the application scenario corresponding to the media stream can be determined , instead of waiting until the media stream is decoded to distinguish.
  • the media code stream encapsulation unit 1620 may include: a first application scenario type field adding unit, which may be configured to add the first application in the volume visual media header data box of the target media file format data box Scenario type field; a unit for determining the value of the first application scenario type field, which may be configured to determine the value of the first application scenario type field according to the application scenario corresponding to the media stream.
  • the value of the first application scene type field may include any one of the following: indicating that the media stream is a first value of a multi-view video with non-large-scale atlas information; indicating The media code stream is the second value of the multi-view video of large-scale atlas information; it means that the media code stream is the third value of point cloud media compressed based on the traditional video coding method; it means that the media code stream is Fourth value for point cloud media compressed based on geometric features.
  • the apparatus 1600 for encapsulating a media file may further include: a single-track large-scale atlas identification adding unit, which may It is used to add a large-scale atlas mark in the bit stream sample entry of the target media file format data box if the media code stream is encapsulated according to a single track; the single-track camera angle of view mark adding unit can be used for if the The large-scale atlas identifier indicates that the media code stream is a multi-view video with large-scale atlas information, then the number of cameras that collect the media code stream is added to the bit stream sample entry, and the current number of the media code stream is added.
  • a single-track large-scale atlas identification adding unit which may It is used to add a large-scale atlas mark in the bit stream sample entry of the target media file format data box if the media code stream is encapsulated according to a single track
  • the single-track camera angle of view mark adding unit can be used for if the The large-scale atlas identifier
  • a single-track texture depth map resolution adding unit which can be used to add, in the bitstream sample entry, the texture maps and depths acquired from the viewing angles corresponding to the cameras included in the current file The resolution of the graph.
  • the apparatus 1600 for encapsulating a media file may further include at least one of the following: a single-track downsampling multiplier factor adding unit, which may be configured to add, in the bitstream sample entry, the current file including The downsampling multiplier factor of the depth map collected from the perspective corresponding to the camera; the single-track texture map offset adding unit can be used to add the data from the perspective corresponding to the camera included in the current file in the bitstream sample entry.
  • a single-track downsampling multiplier factor adding unit which may be configured to add, in the bitstream sample entry, the current file including The downsampling multiplier factor of the depth map collected from the perspective corresponding to the camera
  • the single-track texture map offset adding unit can be used to add the data from the perspective corresponding to the camera included in the current file in the bitstream sample entry.
  • the offset of the upper left vertex of the texture map relative to the origin of the plane frame in the large-scale atlas information; the single-track depth map offset adding unit can be used to add the current file in the bitstream sample entry
  • the apparatus 1600 for encapsulating a media file may further include: a multi-track large-scale atlas identification adding unit, which may It is used to add a large-scale atlas mark in the sample entry of the target media file format data box if the media code stream is encapsulated according to multi-tracks; the multi-track camera view point mark adding unit can be used for adding a large-scale atlas mark if the large-scale The atlas identifier indicates that the media code stream is a multi-view video with large-scale atlas information, then the number of cameras that collect the media code stream is added to the sample entry. The number of viewpoints corresponding to the camera is identified; the multi-track texture depth map resolution adding unit can be used to add the resolution of the texture map and depth map collected from the viewpoint corresponding to the camera included in the current file in the sample entry.
  • a multi-track large-scale atlas identification adding unit which may It is used to add a large-scale atlas mark in the sample entry of the target media file format data box
  • the apparatus 1600 for encapsulating a media file may further include at least one of the following: a multi-track downsampling multiplying factor adding unit, which may be used to add the camera included in the current file in the sample entry The downsampling factor of the depth map collected from the corresponding perspective; the multi-track texture map offset adding unit can be used to add the upper left vertex of the texture map corresponding to the camera included in the current file in the sample entry.
  • a multi-track downsampling multiplying factor adding unit which may be used to add the camera included in the current file in the sample entry The downsampling factor of the depth map collected from the corresponding perspective
  • the multi-track texture map offset adding unit can be used to add the upper left vertex of the texture map corresponding to the camera included in the current file in the sample entry.
  • the apparatus 1600 for encapsulating a media file may further include: a target description file generating unit, which may be configured to generate a target description file of the target media content, where the target description file includes a second application scene type field , the second application scenario type field is used to indicate the application scenario corresponding to the media stream; the target description file sending unit can be used to send the target description file to the first device, so that the first device According to the second application scenario type field, a target encapsulation file corresponding to the target media stream is determined in the encapsulation file of the media stream.
  • the package file sending unit 1640 may include: a target package file sending unit, which may be configured to send the target package file to the first device, so that the first device can send the target package file according to the first application in the target package file.
  • the scene type field determines the target application scene corresponding to the target media stream.
  • the target description file generating unit may include: a second application scenario type field adding unit, which may be used for the target description file of the hypertext transfer protocol-based dynamic adaptive streaming media transmission of the target media content
  • the second application scenario type field is added in the second application scenario type field;
  • the second application scenario type field value determination unit may be configured to determine the value of the second application scenario type field according to the application scenario corresponding to the media stream.
  • the media stream obtaining unit 1620 may include: an encapsulation file receiving unit, which may be configured to receive a first encapsulated file of the first multi-view video sent by the second device and a second multi-view video sent by the third device.
  • a second encapsulated file of a video an encapsulated file decapsulation unit, which can be used to decapsulate the first encapsulated file and the second encapsulated file respectively, to obtain the first multi-view video and the second multi-view video; a multi-view video decoding unit, which can be configured to decode the first multi-view video and the second multi-view video respectively, and obtain a first depth map and a first texture map in the first multi-view video, and the The second depth map and the second texture map in the second multi-view video; the multi-view video merging unit can be configured to use the first depth map, the second depth map, the first texture map and the The second texture map is obtained from the merged multi-view video.
  • a first number of cameras may be mounted on the second device, a second number of cameras may be mounted on the third device, and the second device and the third device may utilize their respective The camera can perform multi-view video capture and shooting for the same scene, and obtain the first multi-view video and the second multi-view video.
  • the first encapsulation file and the second encapsulation file may both include the first application scenario type field, and the first application scenario type field in the first encapsulation file and the second encapsulation file
  • the value of can be respectively used to indicate that the first multi-view video and the second multi-view video are the second values of the multi-view videos with large-scale atlas information.
  • the media code stream may include a 6-DOF media code stream and a restricted 6-DOF media code stream.
  • each unit in the device for encapsulating a media file provided by the embodiment of the present application, reference may be made to the content in the foregoing method for encapsulating a media file, and details are not described herein again.
  • the device for decapsulating a media file provided in this embodiment of the present application may be installed in any electronic device.
  • the device is set at an intermediate node or a first device (such as a player) of an immersive system as an example.
  • the present application is not limited to this.
  • FIG. 17 schematically shows a block diagram of an apparatus for decapsulating a media file according to an embodiment of the present application.
  • the apparatus 1700 for decapsulating a media file provided in this embodiment of the present application may include an encapsulated file receiving unit 1710 , a file decapsulating unit 1720 , an application scene obtaining unit 1730 , and a decoding and rendering determining unit 1740 .
  • the encapsulation file receiving unit 1710 may be configured to receive an encapsulation file of a media stream of the target media content in its corresponding application scenario, where the encapsulation file includes a first application scenario type field, and the first application scenario type field.
  • the application scene type field is used to indicate the application scene corresponding to the media stream.
  • the file decapsulation unit 1720 may be configured to decapsulate the encapsulated file to obtain the first application scenario type field.
  • the application scenario obtaining unit 1730 may be configured to determine the application scenario corresponding to the media stream according to the first application scenario type field.
  • the decoding and rendering determining unit 1740 may be configured to determine at least one of a decoding mode and a rendering mode of the media stream according to an application scenario corresponding to the media stream.
  • the device for decapsulating the media file 1700 may further include: an encapsulated file parsing unit, which can be used to parse the encapsulated file to obtain the texture map and depth map collected from the camera corresponding to the viewing angle included in the media stream and the plane frame in the large-scale atlas information. The mapping relationship between them; a media code stream decoding unit, which can be used to decode the media code stream to obtain camera parameters in the media code stream; a multi-view video presentation unit, which can be used for Camera parameters to present the multi-view video in three-dimensional space.
  • an encapsulated file parsing unit which can be used to parse the encapsulated file to obtain the texture map and depth map collected from the camera corresponding to the viewing angle included in the media stream and the plane frame in the large-scale atlas information. The mapping relationship between them; a media code stream decoding unit, which can be used to decode the media code stream to obtain camera parameters in the media code stream; a multi-view video presentation unit, which can be used
  • the apparatus 1700 for decapsulating a media file may further include: a target description file receiving unit, which may be configured to receive a target description file of the target media content, where the target description file includes the second application scenario type field, the second application scenario type field is used to indicate the application scenario corresponding to the media code stream; the target encapsulation file determination unit can be used for encapsulating the media code stream according to the second application scenario type field
  • the target encapsulation file that determines the target media stream in the file.
  • the encapsulation file receiving unit 1710 may include: a target application scenario determination unit, which may be configured to receive the target encapsulation file, so as to determine the target media code stream according to the first application scenario type field in the target encapsulation file. target application scenario.
  • each unit in the apparatus for decapsulating a media file provided by the embodiment of the present application, reference may be made to the content in the above-mentioned method for decapsulating a media file, which will not be repeated here.
  • Embodiments of the present application provide a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the method for encapsulating a media file as described in the foregoing embodiments.
  • Embodiments of the present application provide a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the method for decapsulating a media file as described in the foregoing embodiments.
  • An embodiment of the present application provides an electronic device, including: at least one processor; and a storage device configured to store at least one program, and when the at least one program is executed by the at least one processor, the at least one process
  • the device implements the media file encapsulation method described in the above embodiments.
  • An embodiment of the present application provides an electronic device, including: at least one processor; and a storage device configured to store at least one program, and when the at least one program is executed by the at least one processor, the at least one process
  • the device implements the method for decapsulating a media file as described in the above embodiments.
  • FIG. 18 shows a schematic structural diagram of an electronic device suitable for implementing the embodiments of the present application.
  • the electronic device 1800 includes a central processing unit (CPU, Central Processing Unit) 1801, which can be loaded into a random device according to a program stored in a read-only memory (ROM, Read-Only Memory) 1802 or from a storage part 1808 A program in a memory (RAM, Random Access Memory) 1803 is accessed to execute various appropriate actions and processes. In the RAM 1803, various programs and data necessary for system operation are also stored.
  • the CPU 1801, the ROM 1802, and the RAM 1803 are connected to each other through a bus 1804.
  • An input/output (I/O) interface 1805 is also connected to the bus 1804 .
  • the following components are connected to the I/O interface 1805: an input section 1806 including a keyboard, a mouse, etc.; an output section 1807 including a cathode ray tube (CRT, Cathode Ray Tube), a liquid crystal display (LCD, Liquid Crystal Display), etc., and a speaker, etc. ; a storage part 1808 including a hard disk and the like; and a communication part 1809 including a network interface card such as a LAN (Local Area Network) card, a modem, and the like.
  • the communication section 1809 performs communication processing via a network such as the Internet.
  • Drivers 1810 are also connected to I/O interface 1805 as needed.
  • a removable medium 1811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is mounted on the drive 1810 as needed so that a computer program read therefrom is installed into the storage section 1808 as needed.
  • embodiments of the present application include a computer program product comprising a computer program carried on a computer-readable storage medium, the computer program containing program code for performing the method illustrated in the flowchart.
  • the computer program may be downloaded and installed from the network via the communication portion 1809, and/or installed from the removable medium 1811.
  • CPU central processing unit
  • the computer-readable storage medium shown in this application may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two.
  • the computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above.
  • Computer readable storage media may include, but are not limited to, electrical connections having at least one wire, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable Read memory (EPROM (Erasable Programmable Read Only Memory, erasable programmable read only memory) or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any of the above suitable combination.
  • a computer-readable storage medium can be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code therein. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium can also be any computer-readable storage medium other than a computer-readable storage medium that can be sent, propagated, or transmitted for use by or in connection with the instruction execution system, apparatus, or device program of.
  • the program code contained on the computer-readable storage medium can be transmitted by any suitable medium, including but not limited to: wireless, wire, optical cable, RF (Radio Frequency, radio frequency), etc., or any suitable combination of the above.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains at least one configurable function for implementing the specified logical function. Execute the instruction.
  • the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • the units involved in the embodiments of the present application may be implemented in software or hardware, and the described units may also be provided in a processor. Among them, the names of these units do not constitute a limitation on the unit itself under certain circumstances.
  • the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium may be included in the electronic device described in the above-mentioned embodiments; in electronic equipment.
  • the above-mentioned computer-readable storage medium carries one or more programs, and when the above-mentioned one or more programs are executed by an electronic device, the electronic device is made to implement the methods described in the following embodiments. For example, the electronic device can implement the various steps shown in FIG. 4 or FIG. 5 or FIG. 6 or FIG. 10 or FIG. 11 or FIG. 12 or FIG. 15 .
  • the exemplary embodiments described herein may be implemented by software, or may be implemented by software combined with necessary hardware. Therefore, the technical solutions according to the embodiments of the present application may be embodied in the form of software products, and the software products may be stored in a non-volatile storage medium (which may be CD-ROM, U disk, mobile hard disk, etc.) or on the network , which includes several instructions to cause a computing device (which may be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiments of the present application.
  • a computing device which may be a personal computer, a server, a touch terminal, or a network device, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Databases & Information Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Testing, Inspecting, Measuring Of Stereoscopic Televisions And Televisions (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

本申请提供了一种媒体文件的封装方法、媒体文件的解封装方法及相关设备。该媒体文件的封装方法包括:获取目标媒体内容在其对应的应用场景下的媒体码流;封装所述媒体码流,生成所述媒体码流的封装文件,所述封装文件中包括第一应用场景类型字段,所述第一应用场景类型字段用于指示所述媒体码流对应的应用场景;将所述封装文件发送至第一设备,以便所述第一设备根据所述第一应用场景类型字段确定所述媒体码流对应的应用场景,并根据所述媒体码流对应的应用场景确定所述媒体码流的解码方式和渲染方式中的至少一种。该方案能够在媒体文件的封装中对不同的应用场景进行区分。

Description

媒体文件的封装方法、媒体文件的解封装方法及相关设备
本申请要求于2020年10月14日提交中国专利局、申请号为2020110981907、申请名称为“媒体文件的封装方法、媒体文件的解封装方法及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及数据处理技术领域,具体涉及媒体文件的封装与解封装技术。
背景技术
沉浸式媒体(Immersive Media)是指能为用户带来沉浸式体验的媒体内容,也可以称之为浸入式媒体。广义上来说,通过音视频技术让用户产生身临其境的感觉的媒体内容,就属于沉浸式媒体。例如,当用户戴上VR(Virtual Reality,虚拟现实)头盔之后会有强烈的沉浸在现场的感觉。
沉浸式媒体的应用形式多种多样,用户侧在对不同应用场景的沉浸式媒体进行解封装、解码和渲染时,需要的操作步骤和处理能力各有差别。而相关技术目前无法有效地区分沉浸式媒体所对应的应用场景,这将增加用户侧对于沉浸式媒体的处理难度。
发明内容
本申请实施例提供一种媒体文件的封装方法、媒体文件的解封装方法、媒体文件的封装装置、媒体文件的解封装装置、电子设备和计算机可读存储介质,能够在媒体文件的封装中对不同的应用场景进行区分。
本申请实施例提供一种媒体文件的封装方法,由电子设备执行,所述方法包括:获取目标媒体内容在其对应的应用场景下的媒体码流;封装所述媒体码流,生成所述媒体码流的封装文件,所述封装文件中包括第一应用场景类型字段,所述第一应用场景类型字段用于指示所述媒体码流对应的应用场景;将所述封装文件发送至第一设备,以便所述第一设备根据所述第一应用场景类型字段确定所述媒体码流对应的应用场景,并根据所述媒体码流对应的应用场景确定所述媒体码流的解码方式和渲染方式中的至少一种。
本申请实施例提供一种媒体文件的解封装方法,由电子设备执行,所述方法包括:接收目标媒体内容在其对应的应用场景下的媒体码流的封装文件,所述封装文件中包括第一应用场景类型字段,所述第一应用场景类型字段用于指示所述媒体码流对应的应用场景;解封装所述封装文件,获得所述第一应用场景类型字段;根据所述第一应用场景类型字段,确定所述媒体码流对应的应用场景;根据所述媒体码流对应的应用场景,确定所述媒体码流的解码方式和渲染方式中的至少一种。
本申请实施例提供一种媒体文件的封装装置,所述装置包括:媒体码流获取单元,用于获取目标媒体内容在其对应的应用场景下的媒体码流;媒体码流封装单元,用于封装所述媒体码流,生成所述媒体码流的封装文件,所述封装文件中包括第一应用场景类型字段,所述第一应用场景类型字段用于指示所述媒体码流对应的应用场景;封装文件发送单元,用于将所述封装文件发送至第一设备,以便所述第一设备根据所述第一应用场景类型字段确定所述媒体码流对应的应用场景,并根据所述媒体码流对应的应用场景确定所述媒体码流的解码方式和渲染方式中的至少一种。
本申请实施例提供一种媒体文件的解封装装置,所述装置包括:封装文件接收单元,用于接收目标媒体内容在其对应的应用场景下的媒体码流的封装文件,所述封装文件中包括第一应用场景类型字段,所述第一应用场景类型字段用于指示所述媒体码流对应的应用场景;文件解封装单元,用于解封装所述封装文件,获得所述第一应用场景类型字段;应用场景获得单元,用于根据所述第一应用场景类型字段,确定所述媒体码流对应的应用场景;解码渲染确定单元,用于根据所述媒体码流对应的应用场景,确定所述媒体码流的解码方式和渲染方式中的至少一种。
本申请实施例提供了一种计算机可读存储介质,其上存储有计算机程序,所述程序被处理器执行时实现如上述实施例中所述的媒体文件的封装方法或者媒体文件的解封装方法。
本申请实施例提供了一种电子设备,包括:至少一个处理器;存储装置,配置为存储至少一个程序,当所述至少一个程序被所述至少一个处理器执行时,使得所述至少一个处理器实现如上述实施例中所述的媒体文件的封装方法或者媒体文件的解封装方法。
本申请实施例提供了一种计算机程序产品,包括指令,当其在计算机上运行时,使得计算机执行如上述实施例中所述的媒体文件的封装方法或者媒体文件的解封装方法。
在本申请的一些实施例所提供的技术方案中,通过在生成对应应用场景下的媒体码流的封装文件时,在封装文件中扩展第一应用场景类型字段,通过该第一应用场景类型字段来指示该媒体码流对应的应用场景,由此实现在媒体文件的封装中对不同媒体码流对应的应用场景进行区分,一方面,将该封装文件发送至第一设备时,该第一设备可以根据该封装文件中的第一应用场景类型字段区分该媒体码流的应用场景,从而可以根据该媒体码流对应的应用场景确定对该媒体码流采取何种解码方式或者渲染方式,可以节约第一设备的运算能力和资源;另一方面,由于在封装阶段即可确定媒体码流的应用场景,因此即使第一设备不具备媒体码流的解码能力,也可以确定该媒体码流对应的应用场景,而不需要等到解码该媒体码流之后才能区分。
附图说明
图1示意性示出了三自由度的示意图;
图2示意性示出了三自由度+的示意图;
图3示意性示出了六自由度的示意图;
图4示意性示出了根据本申请的一实施例的媒体文件的封装方法的流程图;
图5示意性示出了根据本申请的一实施例的媒体文件的封装方法的流程图;
图6示意性示出了根据本申请的一实施例的媒体文件的封装方法的流程图;
图7示意性示出了根据本申请的一实施例的六自由度媒体上下拼接方式的示意图;
图8示意性示出了根据本申请的一实施例的六自由度媒体左右拼接方式的示意图;
图9示意性示出了根据本申请的一实施例的六自由度媒体深度图1/4分辨率拼接方式的示意图;
图10示意性示出了根据本申请的一实施例的媒体文件的封装方法的流程图;
图11示意性示出了根据本申请的一实施例的媒体文件的封装方法的流程图;
图12示意性示出了根据本申请的一实施例的媒体文件的封装方法的流程图;
图13示意性示出了根据本申请的一实施例的第一多视角视频上下拼接方式的示意图;
图14示意性示出了根据本申请的一实施例的第二多视角视频上下拼接方式的示意图;
图15示意性示出了根据本申请的一实施例的媒体文件的解封装方法的流程图;
图16示意性示出了根据本申请的一实施例的媒体文件的封装装置的框图;
图17示意性示出了根据本申请的一实施例的媒体文件的解封装装置的框图;
图18示出了适于用来实现本申请实施例的电子设备的结构示意图。
具体实施方式
现在将参考附图更全面地描述示例实施方式。然而,示例实施方式能够以多种形式实施,且不应被理解为限于在此阐述的范例;相反,提供这些实施方式使得本申请将更加全面和完整,并将示例实施方式的构思全面地传达给本领域的技术人员。
此外,所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施例中。在下面的描述中,提供许多具体细节从而给出对本申请的实施例的充分理解。然而,本领域技术人员将意识到,可以实践本申请的技术方案而没有特定细节中的一个或更多,或者可以采用其它的方法、组元、装置、步骤等。在其它情况下,不详细示出或描述公知方法、装置、实现或者操作以避免模糊本申请的各方面。
附图中所示的方框图仅仅是功能实体,不一定必须与物理上独立的实体相对应。即,可以采用软件形式来实现这些功能实体,或在至少一个硬件模块或集成电路中实现这些功能实体,或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。
附图中所示的流程图仅是示例性说明,不是必须包括所有的内容和操作/步骤,也不是必须按所描述的顺序执行。例如,有的操作/步骤还可以分解,而有的操作/步骤可以合并或部分合并,因此实际执行的顺序有可能根据实际情况改变。
首先对本申请实施例中涉及的部分术语进行说明。
点云(Point Cloud):点云是空间中一组无规则分布的、表达三维物体或场景的空间结构及表面属性的离散点集。点云是指海量三维点的集合,点云中的每个点至少具有三维位置信息,根据应用场景的不同,还可能具有色彩(颜色)、材质或其他信息(例如反射率等附加属性)。通常,点云中的每个点都具有相同数量的附加属性。例如,根据激光测量原理得到的点云,包括三维坐标(XYZ)和激光反射强度(reflectance);根据摄影测量原理得到的点云,包括三维坐标(XYZ)和颜色信息(RGB,红绿蓝);结合激光测量和摄影测量原理得到点云,包括三维坐标(XYZ)、激光反射强度(reflectance)和颜色信息(RGB)。
其中,按用途可以将点云分为两大类:机器感知点云和人眼感知点云;其中,机器感知点云例如可以用于自主导航系统、实时巡检系统、地理信息系统、视觉分拣机器人、抢险救灾机器人等场景;人眼感知点云例如可以用于数字文化遗产、自由视点广播、三维沉浸通信、三维沉浸交互等场景。
其中,按获取的途径可以将点云分为三类:静态点云、动态点云和动态获取点云;第一类静态点云:即物体是静止的,获取点云的设备也是静止的;第二类动态点云:物体是运动的,但获取点云的设备是静止的;第三类动态获取点云:获取点云的设备是运动的。
PCC:Point Cloud Compression,点云压缩。点云是海量点的集合,这些点云数据不仅会消耗大量的存储内存,而且不利于传输,相关技术中没有这么大的带宽可以支持将点云不经过压缩直接在网络层进行传输,因此对点云进行压缩是很有必要的。
G-PCC:Geometry-based Point Cloud Compression,基于几何特征的点云压缩。G-PCC可以用于对第一类静态点云和第三类动态获取点云进行压缩,相应获得的点云媒体可以称之为基于几何特征进行压缩的点云媒体,简称为G-PCC点云媒体。
V-PCC:Video-based Point Cloud Compression,基于传统视频编码的点云压缩。V-PCC可以用于对第二类动态点云进行压缩,相应获得的点云媒体可以称之为基于传统视频编码方式进行压缩的点云媒体,简称为V-PCC点云媒体。
sample:样本,媒体文件封装过程中的封装单位,一个媒体文件由很多个样本组成。以媒体文件为视频媒体为例,视频媒体的一个样本通常为一个视频帧。
DoF:Degree of Freedom,自由度。力学系统中是指独立坐标的个数,除了平移的自由度外,还有转动及振动自由度。本申请实施例中,自由度是指用户在观看沉浸式媒体时,支持的运动并产生内容交互的自由度。
3DoF:即三自由度,指用户头部围绕XYZ轴旋转的三种自由度。图1示意性示出了三自由度的示意图。如图1所示,就是在某个地方、某一个点在三个轴上都可以旋转,可以转头,也可以上下低头,也可以摆头。通过三自由度的体验,用户能够360度地沉浸在一个现场中。如果是静态的,可以理解为是全景的图片。如果全景的图片是动态,就是全景视频,也就是VR视频。但是三自由度的VR视频有一定局限性,即用户是不能够移动的,不能选择任意的一个地方去看。
3DoF+:即在三自由度的基础上,用户还拥有沿XYZ轴做有限运动的自由度,也可以将其称之为受限六自由度,对应的媒体码流可以称之为受限六自由度媒体码流。图2示意性示出了三自由度+的示意图。
6DoF:即在三自由度的基础上,用户还拥有沿XYZ轴自由运动的自由度,对应的媒体码流可以称之为六自由度媒体码流。图3示意性示出了六自由度的示意图。其中,6DoF媒体是指的6自由度视频,是指视频可以提供用户在三维空间的XYZ轴方向自由移动视点,以及围绕XYX轴自由旋转视点的高自由度观看体验。6DoF媒体是以摄像机阵列采集得到的对应空间不同视角的视频组合。为了便于6DoF媒体的表达、存储、压缩和处理,将6DoF媒体数据表达为以下信息的组合:多摄像机采集的纹理图,多摄像机纹理图所对应的深度图,以及相应的6DoF媒体内容描述元数据,元数据中包括多摄像机的参数,以及6DoF媒体的拼接布局和边缘保护等描述信息。在编码端,把多摄像机的纹理图信息和对应的深度图信息进行拼接处理,并且根据所定义的语法和语义将拼接得到的描述数据写入元数据。通过平面视频压缩方式对拼接后的多摄像机深度图和纹理图信息进行编码,并且传输到终端解码后,进行用户所请求的6DoF虚拟视点的合成,从而提供用户6DoF媒体的观看体 验。
容积媒体:是沉浸式媒体的一种,例如可以包括容积视频。容积视频是三维数据表示,由于目前主流的编码都是基于二维的视频数据,所以对原始的容积视频数据在系统层进行封装、传输等处理之前,需要先将其从三维转化到二维后再进行编码。容积视频的内容呈现的过程中,又需要将二维表示的数据转化成三维数据来表示最终呈现的容积视频。容积视频如何在二维平面表示将会直接作用系统层封装、传输、以及最后容积视频的内容呈现处理。
图集(atlas):指示2D(2-dimension,二维)的平面帧上的区域信息,3D(3-dimension,三维)呈现空间的区域信息,以及二者之间的映射关系和映射所需的必要参数信息。图集包括图块和图块对应到容积数据的三维空间中一个区域的关联信息的集合。图块(patch)是图集中的一个矩形区域,与三维空间的容积信息关联。对容积视频的二维表示的组件数据进行处理生成图块,根据几何组件数据中表示的容积视频的位置,将容积视频的二维表示所在的二维平面区域分割成多个不同大小的矩形区域,一个矩形区域为一个图块,图块包含将该矩形区域反投影到三维空间的必要信息;打包图块生成图集,将图块放入一个二维网格中,并保证各个图块中的有效部分是没有重叠的。一个容积视频生成的图块可以打包成一个或多个图集。基于图集数据生成对应的几何数据、属性数据和占位数据,根据图集数据、几何数据、属性数据、占位数据组合生成容积视频在二维平面的最终表示。其中,几何组件为必选,占位组件为条件必选,属性组件为可选。
AVS:Audio Video Coding Standard,音视频编码标准。
ISOBMFF:ISO Based Media File Format,基于ISO(International Standard Organization,国际标准化组织)标准的媒体文件格式。ISOBMFF是媒体文件的封装标准,最典型的ISOBMFF文件即MP4(Moving Picture Experts Group 4,动态图像专家组4)文件。
深度图(Depth map):作为一种三维场景信息表达方式,深度图的每个像素点的灰度值可用于表征场景中某一点距离摄像机的远近。
本申请实施例提供的媒体文件的封装方法可以由任意的电子设备来执行,在下面的举例说明中,以应用于沉浸式系统的服务器执行该媒体文件的封装为例进行举例说明,但本申请并不限定于此。
图4示意性示出了根据本申请的一实施例的媒体文件的封装方法的流程图。如图4所示,本申请实施例提供的方法可以包括以下步骤。
在步骤S410中,获取目标媒体内容在其对应的应用场景下的媒体码流。
本申请实施例中,目标媒体内容可以是视频、音频、图像等中的任意一种或者多种的组合,在下面的举例说明中,以视频为例进行举例说明,但本申请并不限定于此。
本申请实施例中,所述媒体码流可以包括六自由度(6DoF)媒体码流和受限六自由度(3DoF+)媒体码流等存在3D空间内的可渲染的任意媒体码流,下面的举例说明中以6DoF媒体码流为例进行举例说明。本申请实施例提供的方法可以适用于6DoF媒体内容录播、点播、直播、通信、节目编辑、制作等应用。
沉浸式媒体按照用户在消费目标媒体内容时可支持的自由度,可以分为3DoF媒体、3DoF+媒体以及6DoF媒体。其中6DoF媒体可以包括多视角视频以及点云媒体。
其中,点云媒体从编码方式上又可以分为基于传统视频编码方式进行压缩的点云媒体(即V-PCC)、以及基于几何特征进行压缩的点云媒体(G-PCC)。
多视角视频通常由摄像机阵列从多个角度(也可以称之为视角)对同一场景进行拍摄,形成包括场景的纹理信息(色彩信息等)的纹理图和包括深度信息(空间距离信息等)的深度图,再加上2D平面帧到3D呈现空间的映射信息,即构成了可在用户侧进行消费的6DoF媒体。
由相关技术可知,6DoF媒体的应用形式多种多样,用户在对不同应用场景的6DoF媒体进行解封装、解码和渲染时,需要的操作步骤和处理能力各有差别。
例如,多视角视频和V-PCC的编码是一套规则,G-PCC的编码是一套规则,二者的编码标准都不一样,相应地,解码处理的方式也是有区别的。
再例如,多视角视频和V-PCC的编码标准虽然一样,但是一个是把图片渲染到3D空间,一个是把一堆点渲染到3D空间,所以会有一些差别。另外多视角视频需要纹理图和深度图,V-PCC除了这些还可能需要占位图,这也是一个差别。
在步骤S420中,封装所述媒体码流,生成所述媒体码流的封装文件,所述封装文件中包括第一应用场景类型字段,所述第一应用场景类型字段用于指示所述媒体码流对应的应用场景。
例如,可以针对6DoF媒体的应用,本申请实施例可以区分不同6DoF媒体的应用场景。
由于当前业界对于6DoF媒体统一定义为容积媒体,若不能在文件封装过程中对不同的应用场景进行区分,将会给用户侧的处理带来不必要的麻烦。例如,如果不能够在媒体文件封装过程中区分这些媒体文件对应的不同应用场景,就需要解码媒体码流之后再区分,一方面这会导致浪费运算资源,另一方面,由于有些中间节点例如CDN(Content Delivery Network,内容分发网络)节点是不具备解码能力的,因此将导致解码失败的情况发生。
如上所述,这些不同的应用本身处理方式就不一样,是需要进行区分的,在文件封装过程中区分应用场景的好处是在媒体文件的很高层即可获取这个信息,从而可以节约运算资源,同时让一些不具备解码能力的中间节点例如 CDN节点也可以获取到这个信息。
在步骤S430中,将所述封装文件发送至第一设备,以便所述第一设备根据所述第一应用场景类型字段确定所述媒体码流对应的应用场景,并根据所述媒体码流的应用场景确定所述媒体码流的解码方式和渲染方式中的至少一种。
本申请实施例中,第一设备可以是任意的中间节点,也可以是消费该媒体码流的任意用户终端,本申请对此不做限定。
本申请实施方式提供的媒体文件的封装方法,通过在生成对应的应用场景下的媒体码流的封装文件时,在封装文件中扩展第一应用场景类型字段,通过该第一应用场景类型字段指示该媒体码流对应的应用场景,由此使得在媒体文件的封装中即可对不同媒体码流的不同应用场景进行区分,一方面,将该封装文件发送至第一设备时,该第一设备可以根据该封装文件中的第一应用场景类型字段区分该媒体码流的应用场景,从而可以根据该媒体码流对应的应用场景确定对该媒体码流采取何种解码方式和\或者渲染方式,可以节约第一设备的运算能力和资源;另一方面,由于在封装阶段即可确定媒体码流的应用场景,因此即使第一设备不具备媒体码流的解码能力,也可以确定该媒体码流对应的应用场景,而不需要等到解码该媒体码流之后才能区分。
图5示意性示出了根据本申请的一实施例的媒体文件的封装方法的流程图。如图5所示,本申请实施例提供的方法可以包括以下步骤。
图5实施例中的步骤S410可以参照上述实施例。
在图5实施例中,上述图4实施例中的步骤S420可以进一步包括以下步骤。
在步骤S421中,在目标媒体文件格式数据盒的容积可视媒体头数据盒(例如下文中例举的Volumetric Visual Media Header Box)中添加所述第一应用场景类型字段。
本申请实施例中,为了能够根据例如6DoF媒体的应用场景对媒体文件进行对应标识,可以在系统层添加若干描述性字段,包括文件封装层面的字段扩展。例如,在下面的举例说明中,以扩展ISOBMFF数据盒(作为目标媒体文件格式数据盒)为例进行举例说明,但本申请并不限定于此。
在步骤S422中,根据所述媒体码流对应的应用场景确定所述第一应用场景类型字段的取值。
在示例性实施例中,所述第一应用场景类型字段的取值可以包括以下中的任意一项:表示所述媒体码流为非大尺度图集信息的多视角视频的第一值(例如“0”);表示所述媒体码流为大尺度图集信息的多视角视频的第二值(例如“1”);表示所述媒体码流为基于传统视频编码方式进行压缩的点云媒体的第三值(例如“2”);表示所述媒体码流为基于几何特征进行压缩的 点云媒体的第四值(例如“3”)。
应当理解的是,第一应用场景类型字段的取值并不限于指示上述应用场景,其可以指示更多或者更少的应用场景,可以根据实际需求进行设置。
图5实施例中的步骤S430可以参照上述实施例。
本申请实施方式提供的媒体文件的封装方法,通过区分不同的6DoF媒体的应用场景,可以使消费6DoF媒体的第一设备有针对性地在6DoF媒体的解封装、解码、渲染环节等进行策略选择。
图6示意性示出了根据本申请的一实施例的媒体文件的封装方法的流程图。如图6所示,本申请实施例提供的方法可以包括以下步骤。
图6实施例中的步骤S410可以参照上述实施例。
在图6实施例中,上述实施例中的步骤S420可以进一步包括以下步骤S4221,即在封装时通过第一应用场景类型字段,确定了该媒体码流为大尺度图集信息的多视角视频。
在步骤S4221中,封装媒体码流,生成媒体码流的封装文件,所述封装文件中包括第一应用场景类型字段(例如下文举例的application_type),所述第一应用场景类型字段的取值为表示所述媒体码流为大尺度图集信息的多视角视频的第二值。
对于多视角视频来说,其2D平面帧到3D呈现空间的映射信息决定了多视角视频的6DoF体验。对于这种映射关系的指示,存在两种方法,一种方法定义了图集来对2D平面的区域进行较为细致的划分,进而指示这些2D小区域集合到3D空间的映射关系,这种称之为非大尺度图集信息,对应的多视角视频为非大尺度图集信息的多视角视频。另一种方法则更加粗略,直接从采集设备(均以摄像机为例进行举例说明)的角度,标识每个摄像机生成的深度图和纹理图,并根据每个摄像机参数来还原对应的2D深度图和纹理图在3D空间的映射关系,这种称之为大尺度图集信息,对应的多视角视频为大尺度图集信息的多视角视频。可以理解的是,这里的大尺度图集信息和非大尺度图集信息是相对而言的,并不直接限定具体尺寸。
其中,摄像机参数通常分为摄像机的外参和内参,外参通常包括摄像机拍摄的位置、角度等信息,内参通常包括摄像机的光心位置、焦距长度等信息。
由此可知,6DoF媒体中的多视角视频可以进一步包括大尺度图集信息的多视角视频和非大尺度图集信息的多视角视频,即6DoF媒体的应用形式多种多样,用户在对不同应用场景的6DoF媒体进行解封装、解码和渲染时,需要的操作步骤和处理能力各有差别。
例如,大尺度图集信息和非大尺度图集信息的区别在于2D区域到3D空间的映射和渲染的粒度不同,假设大尺度图集信息是6块2D拼图映射到3D 空间,非大尺度图集信息可能就是60块拼图映射到3D空间。那么这两种映射的算法的复杂度肯定是有区别的,大尺度图集信息的算法会比非大尺度图集信息的算法简单。
特别地,对于多视角视频来说,若其2D区域到3D空间的映射关系是由摄像机参数得到的,即其为大尺度图集信息的多视角视频,那么在封装文件中并不需要定义更小的2D区域到3D空间的映射关系。
当该媒体码流为大尺度图集信息的多视角视频时,则所述方法还可以包括以下步骤。
在步骤S601中,若所述媒体码流按照单轨道封装,则在所述目标媒体文件格式数据盒的位流样本入口(例如下文举例说明的V3CbitstreamSampleEntry,但本申请并不限定于此)中添加大尺度图集标识(例如下文举例的large_scale_atlas_flag)。
在步骤S602中,若所述大尺度图集标识指示所述媒体码流为大尺度图集信息的多视角视频,则在所述位流样本入口中添加采集所述媒体码流的摄像机数量标识(例如下文举例的camera_count)、所述媒体码流的当前文件中包括的摄像机对应的视角数目标识(例如下文举例的camera_count_contained)。
在步骤S603中,在所述位流样本入口中添加所述当前文件中包括的摄像机对应的视角采集的纹理图和深度图的分辨率(例如下文举例的camera_resolution_x和camera_resolution_y)。
继续参考图6,进一步地所述方法还包括以下步骤S604-S607中的至少一项。
在步骤S604中,在所述位流样本入口中添加所述当前文件中包括的摄像机对应的视角采集的深度图的降采样倍数因子(例如下文中举例的depth_downsample_factor)。
在步骤S605中,在所述位流样本入口中添加所述当前文件中包括的摄像机对应的视角采集的纹理图左上顶点相对于所述大尺度图集信息中的平面帧的原点偏移量(例如下文中举例的texture_vetex_x和texture_vetex_y)。
在步骤S606中,在所述位流样本入口中添加所述当前文件中包括的摄像机对应的视角采集的深度图左上顶点相对于所述大尺度图集信息中的平面帧的原点偏移量(例如下文中举例的depth_vetex_x和depth_vetex_y)。
在步骤S607中,在所述位流样本入口中添加所述当前文件中包括的摄像机对应的视角采集的纹理图和深度图的保护带宽度(例如下文中举例的padding_size_texture和padding_size_depth)。
本申请实施例中,padding_size_texture和padding_size_depth分别定义了每个纹理图以及深度图的边缘保护区域的大小,为了保护对于拼接图像进行压缩时的边缘突变区域。padding_size_texture和padding_size_depth的值表示 了纹理图和深度图边缘保护区域的宽度,padding_size_texture和padding_size_depth等于0则表示没有任何的边缘保护。
继续参考图6,当该媒体码流为大尺度图集信息的多视角视频时,则所述方法还可以包括以下步骤。
在步骤S608中,若所述媒体码流按照多轨道封装,则在所述目标媒体文件格式数据盒的样本入口中添加大尺度图集标识。
在步骤S609中,若所述大尺度图集标识指示所述媒体码流为大尺度图集信息的多视角视频,则在所述样本入口中添加采集所述媒体码流的摄像机数量标识、所述媒体码流的当前文件中包括的摄像机对应的视角数目标识。
在步骤S610中,在所述样本入口中添加所述当前文件中包括的摄像机对应的视角采集的纹理图和深度图的分辨率。
继续参考图6,进一步地所述方法还包括以下步骤S611-S614中的至少一项。
在步骤S611中,在所述样本入口中添加所述当前文件中包括的摄像机对应的视角采集的深度图的降采样倍数因子。
在步骤S612中,在所述样本入口中添加所述当前文件中包括的摄像机对应的视角采集的纹理图左上顶点相对于所述大尺度图集信息中的平面帧的原点偏移量。
在步骤S613中,在所述样本入口中添加所述当前文件中包括的摄像机对应的视角采集的深度图左上顶点相对于所述大尺度图集信息中的平面帧的原点偏移量。
在步骤S614中,在所述样本入口中添加所述当前文件中包括的摄像机对应的视角采集的纹理图和深度图的保护带宽度。
图6实施例中的步骤S430可以参照上述实施例。
本申请实施例中,可以采用six_dof_stitching_layout字段来指示6DoF媒体中各个摄像机对应的视角采集的深度图和纹理图的拼接方法,用于标识6DoF媒体的纹理图和深度图拼接布局,具体取值可以参见下表1。
表1 6DoF媒体拼接布局
six_dof_stitching_layout的值 6DoF媒体拼接布局
0 深度图和纹理图的上下拼接
1 深度图和纹理图的左右拼接
2 深度图1/4降采样拼接
其他 保留(reserved)
图7示意性示出了根据本申请的一实施例的六自由度媒体上下拼接方式的示意图。
当six_dof_stitching_layout值为0时,6DoF媒体拼接模式为上下拼接,如图7所示,在上下拼接模式中,多摄像机采集的纹理图(例如图7中的视角1纹理图,视角2纹理图,视角3纹理图,视角4纹理图)按顺序排布于图像的上方,而相对应的深度图(例如图7中的视角1深度图,视角2深度图,视角3深度图,视角4深度图)则按次序排布于图像的下方。
设定拼接后6DoF媒体的分辨率为nWidth×nHeight,重建模块则可利用camera_resolution_x以及camera_resolution_y的值计算出相应的各个摄像机的纹理图以及深度图的布局位置,从而进一步利用多摄像机的纹理图和深度图信息来进行6DoF媒体的重建。
图8示意性示出了根据本申请的一实施例的六自由度媒体左右拼接方式的示意图。
当six_dof_stitching_layout值为1时,6DoF媒体拼接模式为左右拼接,如图8所示,在左右拼接模式中,多摄像机采集的纹理图(例如图8中的视角1纹理图,视角2纹理图,视角3纹理图,视角4纹理图)按顺序排布于图像的左方,而相对应的深度图(例如图8中的视角1深度图,视角2深度图,视角3深度图,视角4深度图)则按次序排布于图像的右方。
图9示意性示出了根据本申请的一实施例的六自由度媒体深度图1/4分辨率拼接方式的示意图。
当six_dof_stitching_layout值为2时,6DoF媒体拼接模式为深度图1/4降采样的拼接,如图9所示,深度图1/4降采样的拼接方式中,深度图(例如图9中的视角1深度图,视角2深度图,视角3深度图,视角4深度图)在1/4分辨率降采样后,拼接于纹理图(例如图9中的视角1纹理图,视角2纹理图,视角3纹理图,视角4纹理图)的右下方。如果深度图的拼接无法填满最终拼接图的矩形区域,则剩余的部分以空白的图像进行填充。
本申请实施方式提供的媒体文件的封装方法,不仅能够区分不同6DoF媒体的应用场景,可以使消费6DoF媒体的第一设备有针对性地在6DoF媒体的解封装、解码、渲染环节进行策略选择。进一步地,对于6DoF媒体中的多视角视频应用,提出一种在文件封装中指示多视角视频深度图、纹理图相关信息的方法,使得多视角视频不同视角的深度图、纹理图的封装组合方式更加灵活。
在示例性实施例中,所述方法还可以包括:生成所述目标媒体内容的目标描述文件,所述目标描述文件中包括第二应用场景类型字段,所述第二应用场景类型字段用于指示所述媒体码流对应的应用场景;发送所述目标描述文件至所述第一设备,以便所述第一设备根据所述第二应用场景类型字段,在所述媒体码流的封装文件中确定目标媒体码流对应的目标封装文件。
相应地,将所述封装文件发送至第一设备,以便所述第一设备根据所述 第一应用场景类型字段确定所述媒体码流对应的应用场景,可以包括:将所述目标封装文件发送至所述第一设备,以便所述第一设备根据所述目标封装文件中的第一应用场景类型字段,确定所述目标媒体码流对应的目标应用场景。
图10示意性示出了根据本申请的一实施例的媒体文件的封装方法的流程图。如图10所示,本申请实施例提供的方法可以包括以下步骤。
图10实施例中的步骤S410-S420可以参照上述实施例,其可以进一步包括以下步骤。
在步骤S1010中,生成所述目标媒体内容的目标描述文件,所述目标描述文件中包括第二应用场景类型字段(例如下述举例的v3cAppType),所述第二应用场景类型字段指示所述媒体码流对应的应用场景。
本申请实施例中,在系统层添加若干描述性字段,除了包括上述文件封装层面的字段扩展以外,还可以针对信令传输层面的字段进行扩展,在下面的实施例中,以支持DASH(Dynamic adaptive streaming over HTTP(HyperText Transfer Protocol,超文本传送协议),基于超文本传送协议的动态自适应流媒体传输)MPD(Media Presentation Description,媒体文件的描述文件)信令(作为目标描述文件)的形式举例说明,定义了6DoF媒体的应用场景类型指示以及大尺度图集指示。
在步骤S1020中,发送所述目标描述文件至所述第一设备,以便所述第一设备根据所述第二应用场景类型字段从所述媒体码流的封装文件中确定目标媒体码流的目标封装文件。
在步骤S1030中,将所述目标封装文件发送至所述第一设备,以便所述第一设备根据所述目标封装文件中的第一应用场景类型字段,确定所述目标媒体码流的目标应用场景。
本申请实施方式提供的媒体文件的封装方法,不仅可以在封装文件中通过第一应用场景类型字段标识媒体码流对应的应用场景,还可以在目标描述文件中通过第二应用场景类型字段标识媒体码流对应的应用场景,这样,第一设备可以首先根据目标描述文件中的第二应用场景类型字段确定其需要获取何种媒体码流,从而可以向服务器端请求相应的目标媒体码流,从而减少数据的传输量,且保证所请求的目标媒体码流能够匹配第一设备的实际能力,当第一设备接收到所请求的目标媒体码流后,可以进一步根据封装文件中的第一应用场景类型字段确定目标媒体码流的目标应用场景,从而获知该采取何种解码和渲染方式,降低了运算资源。
图11示意性示出了根据本申请的一实施例的媒体文件的封装方法的流程图。如图11所示,本申请实施例提供的方法可以包括以下步骤。
图11实施例中的步骤S410-S420可以参照上述实施例。
图11实施例中,上述图10实施例中的步骤S1010可以进一步包括以下步骤。
在步骤S1011中,在所述目标媒体内容的基于超文本传送协议的动态自适应流媒体传输的目标描述文件中添加所述第二应用场景类型字段。
在步骤S1012中,根据所述媒体码流对应的应用场景,确定所述第二应用场景类型字段的取值。
图11实施例中的步骤S1020和步骤S1030可以参照上述实施例。
下面对本申请实施例提出的媒体文件的封装方法进行举例说明。以6DoF媒体为例,本申请实施例提出的方法可以用于6DoF媒体应用场景指示,可以包括以下步骤:
1.根据6DoF媒体的应用场景,对媒体文件进行对应标识。
2.特别地,对于多视角视频,判断其2D平面帧到3D空间的映射是否以采集摄像机的输出为单位进行映射,即2D平面帧到3D空间的映射是以每个摄像机采集的纹理图和深度图为单位进行映射的,就称之为大尺度图集信息;若需要对每个摄像机采集的纹理图和深度图做进一步的较为细致的划分,指示划分后的2D小区域集合到3D空间的映射,称之为非大尺度图集信息。
3.若多视角视频从2D平面帧到3D空间的映射以采集摄像机的输出为单位进行映射,则在封装文件中指示不同采集摄像机输出的相关信息。
本实施例可以在系统层添加若干描述性字段,可以包括文件封装层面的字段扩展以及信令传输层面的字段扩展,以支持本申请实施例的上述步骤。下面以扩展ISOBMFF数据盒和DASH MPD信令的形式举例,定义了6DoF媒体的应用类型指示以及大尺度图集指示,具体如下(其中扩展部分以斜体字标识)。
一、ISOBMFF数据盒扩展
本部分中使用的数学运算符和优先级参照C语言。除特别说明外,约定编号和计数从0开始。
Figure PCTCN2021118755-appb-000001
aligned(8)class V3CBitstreamSampleEntry()extends VolumetricVisualSampleEntry('v3e1'){//因为6DoF媒体在封装的时候可以按照单轨道或者多轨道来封装,这个结构对应单轨道的情况。
V3CConfigurationBox config;
unsigned int(1)large_scale_atlas_flag;
bit(7)reserved;//保留字段,一般字段需要是整数个byte,所以需要用保留的bit(位)来补足。
Figure PCTCN2021118755-appb-000002
aligned(8)class V3CSampleEntry()extends VolumetricVisualSampleEntry('v3c1'){//这个结构对应多轨道的情况。
V3CConfigurationBox config;
V3CUnitHeaderBox unit_header;
unsigned int(1)large_scale_atlas_flag;
bit(7)reserved;
Figure PCTCN2021118755-appb-000003
本申请实施例中,第一应用场景类型字段application_type指示6DoF媒体的应用场景类型,具体取值包含但不仅限于下表2所示内容:
表2
取值 语义
0 多视角视频(非大尺度图集信息)
1 多视角视频(大尺度图集信息)
2 基于传统视频编码方式进行压缩的点云媒体
3 基于几何特征进行压缩的点云媒体
其中,大尺度图集标识large_scale_atlas_flag指示图集信息是否为大尺度图集信息,即图集信息是否仅通过摄像机参数等相关信息即可获取,这里假设large_scale_atlas_flag等于1时指示为多视角视频(大尺度图集信息),等于0时指示为多视角视频(非大尺度图集信息)。
需要说明的是,从上述表2可以看出,第一应用场景类型字段application_type已经能够指示是否为大尺度图集信息的多视角视频,考虑到application_type的指示比较上层,增加large_scale_atlas_flag方便解析。使用上只需要一个就够了,但是因为不确定到时候哪个字段会被采用,所以这里 的信息冗余了。
其中,camera_count用于指示采集6DoF媒体的所有摄像机的个数,称之为采集该媒体码流的摄像机数量标识。camera_number的取值为1~255。camera_count_contained用于表示6DoF媒体的当前文件中包括的摄像机对应的视角的数目,称之为当前文件中包括的摄像机对应的视角数目标识。
其中,padding_size_depth表示深度图的保护带宽度。padding_size_texture:纹理图的保护带宽度。在视频编码的过程中,通常会加一些保护带来提升视频解码的容错率,就是在图片帧的边缘填充一些额外的像素。
camera_id表示每个视角对应的摄像机标识符。camera_resolution_x,camera_resolution_y表示摄像机采集的纹理图、深度图的分辨率宽度与高度,分别表示对应摄像机采集的X和Y方向上的分辨率。depth_downsample_factor表示对应深度图的降采样倍数因子,深度图的实际分辨率宽度与高度为摄像机采集分辨率宽度与高度的 1/2 depth_downsample_factor
depth_vetex_x,depth_vetex_y分别表示对应深度图左上顶点相对于平面帧的原点(平面帧的左上顶点)偏移量中的X,Y分量值。
texture_vetex_x,texture_vetex_y分别表示对应纹理图左上顶点相对于平面帧的原点(平面帧的左上顶点)偏移量中的X,Y分量值。
二、DASH MPD信令扩展
可以在DASH MPD信令如下表3所示的表格中扩展第二应用场景类型字段v3cAppType。
表3—表示元素的语义(Semantics of Representation element)
Figure PCTCN2021118755-appb-000004
Figure PCTCN2021118755-appb-000005
对应上述图7实施例,假设服务器端存在1个多视角视频A,且该多视角视频A的图集信息为大尺度图集信息。
此时:application_type=1;
large_scale_atlas_flag=1:camera_count=4;camera_count_contained=4;
padding_size_depth=0;padding_size_texture=0;
{camera_id=1;camera_resolution_x=100;camera_resolution_y=100;
depth_downsample_factor=0;texture_vetex=(0,0);depth_vetex=(0,200)}//视角1纹理图和视角1深度图
{camera_id=2;camera_resolution_x=100;camera_resolution_y=100;
depth_downsample_factor=0;texture_vetex=(100,0);depth_vetex=(100,200)}//视角2纹理图和视角2深度图
{camera_id=3;camera_resolution_x=100;camera_resolution_y=100;
depth_downsample_factor=0;texture_vetex=(0,100);depth_vetex=(0,300)}//视角3纹理图和视角3深度图
{camera_id=4;camera_resolution_x=100;camera_resolution_y=100;
depth_downsample_factor=0;texture_vetex=(100,100);depth_vetex=(100,300)}//视角4纹理图和视角4深度图
以上的系统描述,对应了图7的平面帧各个区域的数据构成。
对应上述图8实施例,假设服务器端存在1个多视角视频A,且该多视角视频A的图集信息为大尺度图集信息。
此时:application_type=1;
large_scale_atlas_flag=1:camera_count=4;camera_count_contained=4;
padding_size_depth=0;padding_size_texture=0;
{camera_id=1;camera_resolution_x=100;camera_resolution_y=100;
depth_downsample_factor=0;texture_vetex=(0,0);depth_vetex= (200,0)}
{camera_id=2;camera_resolution_x=100;camera_resolution_y=100;
depth_downsample_factor=0;texture_vetex=(100,0);depth_vetex=(300,0)}
{camera_id=3;camera_resolution_x=100;camera_resolution_y=100;
depth_downsample_factor=0;texture_vetex=(0,100);depth_vetex=(200,100)}
{camera_id=4;camera_resolution_x=100;camera_resolution_y=100;
depth_downsample_factor=0;texture_vetex=(100,100);depth_vetex=(300,100)}
以上的系统描述,对应了图8的平面帧各个区域的数据构成。
对应上述图9实施例,假设服务器端存在1个多视角视频A,且该多视角视频A的图集信息为大尺度图集信息。
此时:application_type=1;
large_scale_atlas_flag=1:camera_count=4;camera_count_contained=4;
padding_size_depth=0;padding_size_texture=0;
{camera_id=1;camera_resolution_x=100;camera_resolution_y=100;
depth_downsample_factor=1;texture_vetex=(0,0);depth_vetex=(0,200)}
{camera_id=2;camera_resolution_x=100;camera_resolution_y=100;
depth_downsample_factor=1;texture_vetex=(100,0);depth_vetex=(50,200)}
{camera_id=3;camera_resolution_x=100;camera_resolution_y=100;
depth_downsample_factor=1;texture_vetex=(0,100);depth_vetex=(100,200)}
{camera_id=4;camera_resolution_x=100;camera_resolution_y=100;
depth_downsample_factor=1;texture_vetex=(100,100);depth_vetex=(150,200)}
以上的系统描述,对应了图9的平面帧各个区域的数据构成。
需要说明的是,padding_size_depth和padding_size_texture没有绝对的取值范围,不同取值对本申请实施例提供的方法没有影响。本方案只是指示了padding_size_depth和padding_size_texture的大小,至于padding_size_depth和padding_size_texture的大小为什么是这样,是编码算法决定的,与本申请实施例提供的方法无关。
其中,camera_resolution_x和camera_resolution_y是用于计算深度图的实际分辨率宽度和高度的,就是每个摄像机的分辨率,多视角视频是由多个摄像机拍摄的,不同的摄像机分辨率可以不同,这里将所有视角的分辨率宽度和高度均举例为100像素只是为了举例方便而取值的,实际并不限定于此。
可以理解的是,并不限于上述组合方式,本申请实施例提供的方法可以对任意组合进行对应指示。
当第一设备上安装的客户端收到服务器端发送的多视角视频的封装文件后,通过解析封装文件中的对应字段,即可将多视角视频平面帧的各个区域与不同摄像机的纹理图、深度图对应。再通过解码多视角视频的媒体码流中的摄像机参数信息,即可将平面帧的各个区域还原到3D渲染呈现区域,从而消费多视角视频。
对应上述图10实施例进行举例说明。假设对于同一目标媒体内容,服务器端存在3个不同形式的6DoF媒体,分别为多视角视频A(大尺度图集信息)、V-PCC点云媒体B、G-PCC点云媒体C。则服务器端在对这三个媒体码流进行封装时,对VolumetricVisualMediaHeaderBox数据盒中的application_type字段对应赋值。具体而言,多视角视频A:application_type=1;V-PCC点云媒体B:application_type=2;G-PCC点云媒体C:application_type=3。
同时,在MPD文件中描述多视角视频A(大尺度图集信息)、V-PCC点云媒体B、G-PCC点云媒体C三个Representation的应用场景类型,即v3cAppType字段的取值分别为多视角视频A:v3cAppType=1;V-PCC点云媒体B:v3cAppType=2;G-PCC点云媒体C:v3cAppType=3。
然后,服务器将MPD信令对应的目标描述文件下发给第一设备上安装的客户端。
当客户端收到服务器端发送的MPD信令对应的目标描述文件后,根据客户端设备能力和呈现需求,请求对应的应用场景类型的目标媒体码流的目标封装文件。假设第一设备的客户端处理能力较低,因此客户端请求多视角视频A的目标封装文件。
则服务器端将多视角视频A的目标封装文件发送给第一设备的客户端。
第一设备的客户端在收到服务器端发送的多视角视频A的目标封装文件后,根据VolumetricVisualMediaHeaderBox数据盒中的application_type字段,确定当前6DoF媒体文件的应用场景类型,即可对应处理。不同的应用场景 类型会有不同的解码和渲染处理算法。
以多视角视频为例,若application_type=1,则说明该多视角视频的图集信息是以摄像机采集的深度图、纹理图为单位,因此客户端可以通过相对简单的处理算法对该多视角视频进行处理。
需要说明的是,在其他实施例中,除了DASH MPD,还可以对类似的信令文件进行相似扩展,在信令文件中指示不同媒体文件的应用场景类型。
在示例性实施例中,获取所述目标媒体内容在其对应的应用场景下的媒体码流,可以包括:接收第二设备发送的第一多视角视频的第一封装文件和第三设备发送的第二多视角视频的第二封装文件;分别解封装所述第一封装文件和所述第二封装文件,获得所述第一多视角视频和所述第二多视角视频;分别解码所述第一多视角视频和所述第二多视角视频,获得所述第一多视角视频中的第一深度图和第一纹理图、以及所述第二多视角视频中的第二深度图和第二纹理图;根据所述第一深度图、所述第二深度图、所述第一纹理图和所述第二纹理图,获得合并多视角视频。
其中,所述第二设备上可以安装有第一数量的摄像机,所述第三设备上可以安装有第二数量的摄像机,所述第二设备和所述第三设备分别利用各自的摄像机针对同一场景进行多视角视频采集拍摄,获得所述第一多视角视频和所述第二多视角视频。
其中,所述第一封装文件和所述第二封装文件中均可以包括所述第一应用场景类型字段,且所述第一封装文件和所述第二封装文件中的第一应用场景类型字段的取值分别用于表示所述第一多视角视频和所述第二多视角视频为大尺度图集信息的多视角视频的第二值。
图12示意性示出了根据本申请的一实施例的媒体文件的封装方法的流程图。如图12所示,本申请实施例提供的方法可以包括以下步骤。
在步骤S1210中,接收第二设备发送的第一多视角视频的第一封装文件和第三设备发送的第二多视角视频的第二封装文件。
在步骤S1220中,分别解封装所述第一封装文件和所述第二封装文件,获得所述第一多视角视频和所述第二多视角视频。
在步骤S1230中,分别解码所述第一多视角视频和所述第二多视角视频,获得所述第一多视角视频中的第一深度图和第一纹理图、以及所述第二多视角视频中的第二深度图和第二纹理图。
在步骤S1240中,根据所述第一深度图、所述第二深度图、所述第一纹理图和所述第二纹理图,获得合并多视角视频。
在步骤S1250中,封装合并多视角视频,生成合并多视角视频的封装文件,封装文件中包括第一应用场景类型字段,第一应用场景类型字段用于指示合并多视角视频对应的应用场景为大尺度图集信息的多视角视频的第二 值。
在步骤S1260中,将所述封装文件发送至第一设备,以便所述第一设备根据所述第一应用场景类型字段获得所述合并多视角视频对应的应用场景,并根据所述合并多视角视频对应的应用场景确定所述合并多视角视频的解码或者渲染方式。
下面结合图13和14对图12实施例提供的方法进行举例说明。假设第二设备和第三设备分别为无人机A和无人机B(但本申请并不限定于此),且假设无人机A和无人机B上各自安装了2个摄像机(即第一数量和第二数量均等于2,但本申请并不限定于此,可以根据实际场景进行设置)。利用无人机A和无人机B对同一场景进行多视角视频采集拍摄,则无人机A在采集制作第一多视角视频的过程中,对应封装第一多视角视频的第一封装文件如下:
application_type=1;
large_scale_atlas_flag=1:camera_count=4;camera_count_contained=2;
padding_size_depth=0;padding_size_texture=0;
{camera_id=1;camera_resolution_x=100;camera_resolution_y=100;
depth_downsample_factor=1;texture_vetex=(0,0);depth_vetex=(0,100)}//视角1纹理图和视角1深度图
{camera_id=2;camera_resolution_x=100;camera_resolution_y=100;
depth_downsample_factor=1;texture_vetex=(100,0);depth_vetex=(100,100)}//视角2纹理图和视角2深度图
以上的系统描述,对应了图13的平面帧各个区域的数据构成,这里以上下拼接方式进行举例说明。
无人机B在采集制作第二多视角视频的过程中,对应封装第二多视角视频的第二封装文件如下:
application_type=1;
large_scale_atlas_flag=1:camera_count=4;camera_count_contained=2;
padding_size_depth=0;padding_size_texture=0;
{camera_id=3;camera_resolution_x=100;camera_resolution_y=100;
depth_downsample_factor=1;texture_vetex=(0,0);depth_vetex=(0,100)}//视角3纹理图和视角3深度图
{camera_id=4;camera_resolution_x=100;camera_resolution_y=100;
depth_downsample_factor=1;texture_vetex=(100,0);depth_vetex= (100,100)}//视角4纹理图和视角4深度图
以上的系统描述,对应了如图14所示的平面帧各个区域的数据构成。
在服务器端,服务器端接收到不同无人机拍摄后的第一封装文件和第二封装文件后,对第一封装文件和第二封装文件解封装和解码后,将所有的深度图、纹理图合并,且假设对深度图降采样后,得到合并多视角视频。
深度图重要性不如纹理图,降采样后可以降低数据量,本申请实施例是指示这种情景,但限定这种情景。
对合并多视角视频进行封装后,可以获得如下所示的封装文件:
application_type=1;
large_scale_atlas_flag=1:camera_count=4;camera_count_contained=4;
padding_size_depth=0;padding_size_texture=0;
{camera_id=1;camera_resolution_x=100;camera_resolution_y=100;
depth_downsample_factor=1;texture_vetex=(0,0);depth_vetex=(0,200)}
{camera_id=2;camera_resolution_x=100;camera_resolution_y=100;
depth_downsample_factor=1;texture_vetex=(100,0);depth_vetex=(50,200)}
{camera_id=3;camera_resolution_x=100;camera_resolution_y=100;
depth_downsample_factor=1;texture_vetex=(0,100);depth_vetex=(100,200)}
{camera_id=4;camera_resolution_x=100;camera_resolution_y=100;
depth_downsample_factor=1;texture_vetex=(100,100);depth_vetex=(150,200)}
以上的系统描述,对应了上述图9所示的平面帧各个区域的数据构成。
当第一设备的客户端接收到服务器端发送的合并多视角视频的封装文件后,通过解析封装文件中的对应字段,即可将合并多视角视频平面帧的各个区域与不同摄像机的纹理图、深度图对应。再通过解码合并多视角视频的媒体码流中的摄像机参数信息,即可将平面帧的各个区域还原到3D渲染呈现区域,从而消费合并多视角视频。
本申请实施方式提供的媒体文件的封装方法,对于6DoF媒体中的多视角视频应用,提出一种在文件封装中指示多视角视频深度图、纹理图相关信息的方法,使得多视角视频不同视角的深度图、纹理图的封装组合方式更加灵活。可以支持不同的应用场景,如上述实施例所述,有些场景就是不同的 设备在拍摄,会封装成两个文件,而本申请实施例提供的方法可以将这两个文件关联起来,结合消费。否则上述实施例中,只能把两个文件分别呈现,无法联合呈现。
本申请实施例提供的媒体文件的解封装方法可以由任意的电子设备来执行,在下面的举例说明中,以该媒体文件的解封装方法应用于沉浸式系统的中间节点或者第一设备(例如播放器端)为例进行举例说明,但本申请并不限定于此。
图15示意性示出了根据本申请的一实施例的媒体文件的解封装方法的流程图。如图15所示,本申请实施例提供的方法可以包括以下步骤。
在步骤S1510中,接收目标媒体内容在其对应的应用场景下的媒体码流的封装文件,所述封装文件中包括第一应用场景类型字段,所述第一应用场景类型字段用于指示所述媒体码流对应的应用场景。
在示例性实施例中,所述方法还可以包括:接收所述目标媒体内容的目标描述文件,所述目标描述文件中包括第二应用场景类型字段,所述第二应用场景类型字段用于指示所述媒体码流对应的应用场景;根据所述第二应用场景类型字段,在所述媒体码流的封装文件中确定目标媒体码流的目标封装文件。
相应地,接收目标媒体内容在其对应的应用场景下的媒体码流的封装文件,可以包括:接收所述目标封装文件,以根据所述目标封装文件中的第一应用场景类型字段,确定所述目标媒体码流的目标应用场景。
在步骤S1520中,解封装所述封装文件,获得所述第一应用场景类型字段。
在步骤S1530中,根据所述第一应用场景类型字段,确定所述媒体码流对应的应用场景。
在步骤S1540中,根据所述媒体码流对应的应用场景,确定所述媒体码流的解码方式和渲染方式中的至少一种。
在示例性实施例中,若所述第一应用场景类型字段的取值为表示所述媒体码流为大尺度图集信息的多视角视频的第二值,则所述方法还可以包括:解析所述封装文件,获得所述媒体码流中包括的摄像机对应视角采集的纹理图和深度图与所述大尺度图集信息中的平面帧之间的映射关系;解码所述媒体码流,获得所述媒体码流中的摄像机参数;根据所述映射关系和所述摄像机参数,在三维空间呈现所述多视角视频。
本申请实施例提供的媒体文件的解封装方法的其他内容可以参照上述其他实施例中的媒体文件的封装方法。
本申请实施例提供的媒体文件的封装装置可以设置于任意的电子设备, 在下面的举例说明中,以设置于沉浸式系统的服务器端为例进行举例说明,但本申请并不限定于此。
图16示意性示出了根据本申请的一实施例的媒体文件的封装装置的框图。如图16所示,本申请实施例提供的媒体文件的封装装置1600可以包括媒体码流获取单元1610、媒体码流封装单元1620以及封装文件发送单元1630。
本申请实施例中,媒体码流获取单元1610可以用于获取目标媒体内容在其对应的应用场景下的媒体码流。媒体码流封装单元1620可以用于封装所述媒体码流,生成所述媒体码流的封装文件,所述封装文件中包括第一应用场景类型字段,所述第一应用场景类型字段用于指示所述媒体码流对应的应用场景。封装文件发送单元1640可以用于将所述封装文件发送至第一设备,以便所述第一设备根据所述第一应用场景类型字段确定所述媒体码流对应的应用场景,并根据所述媒体码流对应的应用场景确定所述媒体码流的解码方式和渲染方式中的至少一种。
本申请实施方式提供的媒体文件的封装装置,通过在生成对应应用场景下的媒体码流的封装文件时,在封装文件中扩展第一应用场景类型字段,通过该第一应用场景类型字段来指示该媒体码流对应的应用场景,由此实现在媒体文件的封装中对不同媒体码流对应的应用场景进行区分,一方面,将该封装文件发送至第一设备时,该第一设备可以根据该封装文件中的第一应用场景类型字段区分该媒体码流的应用场景,从而可以根据该媒体码流对应的应用场景确定对该媒体码流采取何种解码方式或者渲染方式,可以节约第一设备的运算能力和资源;另一方面,由于在封装阶段即可确定媒体码流的应用场景,因此即使第一设备不具备媒体码流的解码能力,也可以确定该媒体码流对应的应用场景,而不需要等到解码该媒体码流之后才能区分。
在示例性实施例中,媒体码流封装单元1620可以包括:第一应用场景类型字段添加单元,可以用于在目标媒体文件格式数据盒的容积可视媒体头数据盒中添加所述第一应用场景类型字段;第一应用场景类型字段取值确定单元,可以用于根据所述媒体码流对应的应用场景,确定所述第一应用场景类型字段的取值。
在示例性实施例中,所述第一应用场景类型字段的取值可以包括以下中的任意一项:表示所述媒体码流为非大尺度图集信息的多视角视频的第一值;表示所述媒体码流为大尺度图集信息的多视角视频的第二值;表示所述媒体码流为基于传统视频编码方式进行压缩的点云媒体的第三值;表示所述媒体码流为基于几何特征进行压缩的点云媒体的第四值。
在示例性实施例中,在所述第一应用场景类型字段的取值等于所述第二值的情况下,媒体文件的封装装置1600还可以包括:单轨道大尺度图集标识添加单元,可以用于若所述媒体码流按照单轨道封装,则在所述目标媒体文 件格式数据盒的位流样本入口中添加大尺度图集标识;单轨道摄像机视角标识添加单元,可以用于若所述大尺度图集标识指示所述媒体码流为大尺度图集信息的多视角视频,则在所述位流样本入口中添加采集所述媒体码流的摄像机数量标识、所述媒体码流的当前文件中包括的摄像机对应的视角数目标识;单轨道纹理深度图分辨率添加单元,可以用于在所述位流样本入口中添加所述当前文件中包括的摄像机对应的视角采集的纹理图和深度图的分辨率。
在示例性实施例中,媒体文件的封装装置1600还可以包括以下中的至少一项:单轨道降采样倍数因子添加单元,可以用于在所述位流样本入口中添加所述当前文件中包括的摄像机对应的视角采集的深度图的降采样倍数因子;单轨道纹理图偏移量添加单元,可以用于在所述位流样本入口中添加所述当前文件中包括的摄像机对应的视角采集的纹理图左上顶点相对于所述大尺度图集信息中的平面帧的原点偏移量;单轨道深度图偏移量添加单元,可以用于在所述位流样本入口中添加所述当前文件中包括的摄像机对应的视角采集的深度图左上顶点相对于所述大尺度图集信息中的平面帧的原点偏移量;单轨道保护带宽度添加单元,可以用于在所述位流样本入口中添加所述当前文件中包括的摄像机对应的视角采集的纹理图和深度图的保护带宽度。
在示例性实施例中,在所述第一应用场景类型字段的取值等于所述第二值的情况下,媒体文件的封装装置1600还可以包括:多轨道大尺度图集标识添加单元,可以用于若所述媒体码流按照多轨道封装,则在所述目标媒体文件格式数据盒的样本入口中添加大尺度图集标识;多轨道摄像机视角标识添加单元,可以用于若所述大尺度图集标识指示所述媒体码流为大尺度图集信息的多视角视频,则在所述样本入口中添加采集所述媒体码流的摄像机数量标识、所述媒体码流的当前文件中包括的摄像机对应的视角数目标识;多轨道纹理深度图分辨率添加单元,可以用于在所述样本入口中添加所述当前文件中包括的摄像机对应的视角采集的纹理图和深度图的分辨率。
在示例性实施例中,媒体文件的封装装置1600还可以包括以下中的至少一项:多轨道降采样倍数因子添加单元,可以用于在所述样本入口中添加所述当前文件中包括的摄像机对应的视角采集的深度图的降采样倍数因子;多轨道纹理图偏移量添加单元,可以用于在所述样本入口中添加所述当前文件中包括的摄像机对应的视角采集的纹理图左上顶点相对于所述大尺度图集信息中的平面帧的原点偏移量;多轨道深度图偏移量添加单元,可以用于在所述样本入口中添加所述当前文件中包括的摄像机对应的视角采集的深度图左上顶点相对于所述大尺度图集信息中的平面帧的原点偏移量;多轨道保护带宽度添加单元,可以用于在所述样本入口中添加所述当前文件中包括的摄像机对应的视角采集的纹理图和深度图的保护带宽度。
在示例性实施例中,媒体文件的封装装置1600还可以包括:目标描述文 件生成单元,可以用于生成所述目标媒体内容的目标描述文件,所述目标描述文件中包括第二应用场景类型字段,所述第二应用场景类型字段用于指示所述媒体码流对应的应用场景;目标描述文件发送单元,可以用于发送所述目标描述文件至所述第一设备,以便所述第一设备根据所述第二应用场景类型字段,在所述媒体码流的封装文件中确定目标媒体码流对应的目标封装文件。其中,封装文件发送单元1640可以包括:目标封装文件发送单元,可以用于将所述目标封装文件发送至所述第一设备,以便所述第一设备根据所述目标封装文件中的第一应用场景类型字段确定所述目标媒体码流对应的目标应用场景。
在示例性实施例中,目标描述文件生成单元可以包括:第二应用场景类型字段添加单元,可以用于在所述目标媒体内容的基于超文本传送协议的动态自适应流媒体传输的目标描述文件中添加所述第二应用场景类型字段;第二应用场景类型字段取值确定单元,可以用于根据所述媒体码流对应的应用场景,确定所述第二应用场景类型字段的取值。
在示例性实施例中,媒体码流获取单元1620可以包括:封装文件接收单元,可以用于接收第二设备发送的第一多视角视频的第一封装文件和第三设备发送的第二多视角视频的第二封装文件;封装文件解封装单元,可以用于分别解封装所述第一封装文件和所述第二封装文件,获得所述第一多视角视频和所述第二多视角视频;多视角视频解码单元,可以用于分别解码所述第一多视角视频和所述第二多视角视频,获得所述第一多视角视频中的第一深度图和第一纹理图、以及所述第二多视角视频中的第二深度图和第二纹理图;多视角视频合并单元,可以用于根据所述第一深度图、所述第二深度图、所述第一纹理图和所述第二纹理图,获得合并多视角视频。
在示例性实施例中,所述第二设备上可以安装有第一数量的摄像机,所述第三设备上安装有第二数量的摄像机,所述第二设备和所述第三设备分别利用各自的摄像机可以针对同一场景进行多视角视频采集拍摄,获得所述第一多视角视频和所述第二多视角视频。其中,所述第一封装文件和所述第二封装文件中可以均包括所述第一应用场景类型字段,且所述第一封装文件和所述第二封装文件中的第一应用场景类型字段的取值可以分别用于表示所述第一多视角视频和所述第二多视角视频为大尺度图集信息的多视角视频的第二值。
在示例性实施例中,所述媒体码流可以包括六自由度媒体码流和受限六自由度媒体码流。
本申请实施例提供的媒体文件的封装装置中的各个单元的具体实现可以参照上述媒体文件的封装方法中的内容,在此不再赘述。
本申请实施例提供的媒体文件的解封装装置可以设置于任意的电子设 备,在下面的举例说明中,以设置于沉浸式系统的中间节点或者第一设备(例如播放器端)为例进行举例说明,但本申请并不限定于此。
图17示意性示出了根据本申请的一实施例的媒体文件的解封装装置的框图。如图17所示,本申请实施例提供的媒体文件的解封装装置1700可以包括封装文件接收单元1710、文件解封装单元1720、应用场景获得单元1730以及解码渲染确定单元1740。
本申请实施例中,封装文件接收单元1710可以用于接收目标媒体内容在其对应的应用场景下的媒体码流的封装文件,所述封装文件中包括第一应用场景类型字段,所述第一应用场景类型字段用于指示所述媒体码流对应的应用场景。文件解封装单元1720可以用于解封装所述封装文件,获得所述第一应用场景类型字段。应用场景获得单元1730可以用于根据所述第一应用场景类型字段,确定所述媒体码流对应的应用场景。解码渲染确定单元1740可以用于根据所述媒体码流对应的应用场景,确定所述媒体码流的解码方式和渲染方式中的至少一种。
在示例性实施例中,在所述第一应用场景类型字段的取值为表示所述媒体码流为大尺度图集信息的多视角视频的第二值的情况下,媒体文件的解封装装置1700还可以包括:封装文件解析单元,可以用于解析所述封装文件,获得所述媒体码流中包括的摄像机对应视角采集的纹理图和深度图与所述大尺度图集信息中的平面帧之间的映射关系;媒体码流解码单元,可以用于解码所述媒体码流,获得所述媒体码流中的摄像机参数;多视角视频呈现单元,可以用于根据所述映射关系和所述摄像机参数,在三维空间呈现所述多视角视频。
在示例性实施例中,媒体文件的解封装装置1700还可以包括:目标描述文件接收单元,可以用于接收所述目标媒体内容的目标描述文件,所述目标描述文件中包括第二应用场景类型字段,所述第二应用场景类型字段用于指示所述媒体码流对应的应用场景;目标封装文件确定单元,可以用于根据所述第二应用场景类型字段,在所述媒体码流的封装文件中确定目标媒体码流的目标封装文件。其中,封装文件接收单元1710可以包括:目标应用场景确定单元,可以用于接收所述目标封装文件,以根据所述目标封装文件中的第一应用场景类型字段,确定所述目标媒体码流的目标应用场景。
本申请实施例提供的媒体文件的解封装装置中的各个单元的具体实现可以参照上述媒体文件的解封装方法中的内容,在此不再赘述。
应当注意,尽管在上文详细描述中提及了用于动作执行的设备的若干单元,但是这种划分并非强制性的。实际上,根据本申请的实施方式,上文描述的两个或更多单元的特征和功能可以在一个单元中具体化。反之,上文描述的一个单元的特征和功能可以进一步划分为由多个单元来具体化。
本申请实施例提供了一种计算机可读存储介质,其上存储有计算机程序,所述程序被处理器执行时实现如上述实施例中所述的媒体文件的封装方法。
本申请实施例提供了一种计算机可读存储介质,其上存储有计算机程序,所述程序被处理器执行时实现如上述实施例中所述的媒体文件的解封装方法。
本申请实施例提供了一种电子设备,包括:至少一个处理器;存储装置,配置为存储至少一个程序,当所述至少一个程序被所述至少一个处理器执行时,使得所述至少一个处理器实现如上述实施例中所述的媒体文件的封装方法。
本申请实施例提供了一种电子设备,包括:至少一个处理器;存储装置,配置为存储至少一个程序,当所述至少一个程序被所述至少一个处理器执行时,使得所述至少一个处理器实现如上述实施例中所述的媒体文件的解封装方法。
图18示出了适于用来实现本申请实施例的电子设备的结构示意图。
需要说明的是,图18示出的电子设备1800仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。
如图18所示,电子设备1800包括中央处理单元(CPU,Central Processing Unit)1801,其可以根据存储在只读存储器(ROM,Read-Only Memory)1802中的程序或者从储存部分1808加载到随机访问存储器(RAM,Random Access Memory)1803中的程序而执行各种适当的动作和处理。在RAM 1803中,还存储有系统操作所需的各种程序和数据。CPU 1801、ROM 1802以及RAM 1803通过总线1804彼此相连。输入/输出(input/output,I/O)接口1805也连接至总线1804。
以下部件连接至I/O接口1805:包括键盘、鼠标等的输入部分1806;包括诸如阴极射线管(CRT,Cathode Ray Tube)、液晶显示器(LCD,Liquid Crystal Display)等以及扬声器等的输出部分1807;包括硬盘等的储存部分1808;以及包括诸如LAN(Local Area Network,局域网)卡、调制解调器等的网络接口卡的通信部分1809。通信部分1809经由诸如因特网的网络执行通信处理。驱动器1810也根据需要连接至I/O接口1805。可拆卸介质1811,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器1810上,以便于从其上读出的计算机程序根据需要被安装入储存部分1808。
特别地,根据本申请的实施例,下文参考流程图描述的过程可以被实现为计算机软件程序。例如,本申请的实施例包括一种计算机程序产品,其包括承载在计算机可读存储介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信部分1809从网络上被下载和安装,和/或从可拆卸介质1811被安装。在该计算机程序被中央处理单元(CPU)1801执行时,执行本申请的方法和/ 或装置中限定的各种功能。
需要说明的是,本申请所示的计算机可读存储介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有至少一个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM(Erasable Programmable Read Only Memory,可擦除可编程只读存储器)或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本申请中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本申请中,计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读存储介质,该计算机可读存储介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读存储介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:无线、电线、光缆、RF(Radio Frequency,射频)等等,或者上述的任意合适的组合。
附图中的流程图和框图,图示了按照本申请各种实施例的方法、装置和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,上述模块、程序段、或代码的一部分包含至少一个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图或流程图中的每个方框、以及框图或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本申请实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现,所描述的单元也可以设置在处理器中。其中,这些单元的名称在某种情况下并不构成对该单元本身的限定。
作为另一方面,本申请还提供了一种计算机可读存储介质,该计算机可读存储介质可以是上述实施例中描述的电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。上述计算机可读存储介质承载有一个或者多个程序,当上述一个或者多个程序被一个该电子设备执行时,使得该电子 设备实现如下述实施例中所述的方法。例如,所述的电子设备可以实现如图4或图5或图6或图10或图11或图12或图15所示的各个步骤。
通过以上的实施方式的描述,本领域的技术人员易于理解,这里描述的示例实施方式可以通过软件实现,也可以通过软件结合必要的硬件的方式来实现。因此,根据本申请实施方式的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中或网络上,包括若干指令以使得一台计算设备(可以是个人计算机、服务器、触控终端、或者网络设备等)执行根据本申请实施方式的方法。
本领域技术人员在考虑说明书及实践这里申请的发明后,将容易想到本申请的其它实施方案。本申请旨在涵盖本申请的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本申请的一般性原理并包括本申请未申请的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本申请的真正范围和精神由下面的权利要求指出。
应当理解的是,本申请并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本申请的范围仅由所附的权利要求来限制。

Claims (19)

  1. 一种媒体文件的封装方法,由电子设备执行,包括:
    获取目标媒体内容在其对应的应用场景下的媒体码流;
    封装所述媒体码流,生成所述媒体码流的封装文件;所述封装文件中包括第一应用场景类型字段,所述第一应用场景类型字段用于指示所述媒体码流对应的应用场景;
    将所述封装文件发送至第一设备,以便所述第一设备根据所述第一应用场景类型字段确定所述媒体码流对应的应用场景,并根据所述媒体码流对应的应用场景确定所述媒体码流的解码方式和渲染方式中的至少一种。
  2. 根据权利要求1所述的媒体文件的封装方法,所述封装所述媒体码流,生成所述媒体码流的封装文件,包括:
    在目标媒体文件格式数据盒的容积可视媒体头数据盒中添加所述第一应用场景类型字段;
    根据所述媒体码流对应的应用场景,确定所述第一应用场景类型字段的取值。
  3. 根据权利要求2所述的媒体文件的封装方法,所述第一应用场景类型字段的取值包括以下中的任意一项:
    表示所述媒体码流为非大尺度图集信息的多视角视频的第一值;
    表示所述媒体码流为大尺度图集信息的多视角视频的第二值;
    表示所述媒体码流为基于传统视频编码方式进行压缩的点云媒体的第三值;
    表示所述媒体码流为基于几何特征进行压缩的点云媒体的第四值。
  4. 根据权利要求3所述的媒体文件的封装方法,在所述第一应用场景类型字段的取值等于所述第二值的情况下,所述方法还包括:
    若所述媒体码流按照单轨道封装,则在所述目标媒体文件格式数据盒的位流样本入口中添加大尺度图集标识;
    若所述大尺度图集标识指示所述媒体码流为大尺度图集信息的多视角视频,则在所述位流样本入口中添加采集所述媒体码流的摄像机数量标识、所述媒体码流的当前文件中包括的摄像机对应的视角数目标识;
    在所述位流样本入口中添加所述当前文件中包括的摄像机对应的视角采集的纹理图和深度图的分辨率。
  5. 根据权利要求4所述的媒体文件的封装方法,所述方法还包括以下的至少一种信息添加方式:
    在所述位流样本入口中添加所述当前文件中包括的摄像机对应的视角采集的深度图的降采样倍数因子;
    在所述位流样本入口中添加所述当前文件中包括的摄像机对应的视角采集的纹理图左上顶点相对于所述大尺度图集信息中的平面帧的原点偏移量;
    在所述位流样本入口中添加所述当前文件中包括的摄像机对应的视角采集的深度图左上顶点相对于所述大尺度图集信息中的平面帧的原点偏移量;
    在所述位流样本入口中添加所述当前文件中包括的摄像机对应的视角采集的纹理图和深度图的保护带宽度。
  6. 根据权利要求3所述的媒体文件的封装方法,在所述第一应用场景类型字段的取值等于所述第二值的情况下,所述方法还包括:
    若所述媒体码流按照多轨道封装,则在所述目标媒体文件格式数据盒的样本入口中添加大尺度图集标识;
    若所述大尺度图集标识指示所述媒体码流为大尺度图集信息的多视角视频,则在所述样本入口中添加采集所述媒体码流的摄像机数量标识、所述媒体码流的当前文件中包括的摄像机对应的视角数目标识;
    在所述样本入口中添加所述当前文件中包括的摄像机对应的视角采集的纹理图和深度图的分辨率。
  7. 根据权利要求1所述的媒体文件的封装方法,所述方法还包括:
    生成所述目标媒体内容的目标描述文件;所述目标描述文件中包括第二应用场景类型字段,所述第二应用场景类型字段用于指示所述媒体码流对应的应用场景;
    发送所述目标描述文件至所述第一设备,以便所述第一设备根据所述第二应用场景类型字段,在所述媒体码流的封装文件中确定目标媒体码流对应的目标封装文件;
    所述将所述封装文件发送至第一设备,以便所述第一设备根据所述第一应用场景类型字段确定所述媒体码流对应的应用场景,包括:
    将所述目标封装文件发送至所述第一设备,以便所述第一设备根据所述目标封装文件中的第一应用场景类型字段确定所述目标媒体码流对应的目标应用场景。
  8. 根据权利要求7所述的媒体文件的封装方法,所述生成所述目标媒体内容的目标描述文件,包括:
    在所述目标媒体内容的基于超文本传送协议的动态自适应流媒体传输的目标描述文件中添加所述第二应用场景类型字段;
    根据所述媒体码流对应的应用场景,确定所述第二应用场景类型字段的取值。
  9. 根据权利要求1所述的媒体文件的封装方法,所述获取目标媒体内容在其对应的应用场景下的媒体码流,包括:
    接收第二设备发送的第一多视角视频的第一封装文件和第三设备发送的第二多视角视频的第二封装文件;
    分别解封装所述第一封装文件和所述第二封装文件,获得所述第一多视角视频和所述第二多视角视频;
    分别解码所述第一多视角视频和所述第二多视角视频,获得所述第一多视角视频中的第一深度图和第一纹理图、以及所述第二多视角视频中的第二深度图和第二纹理图;
    根据所述第一深度图、所述第二深度图、所述第一纹理图和所述第二纹理图,获得合并多视角视频。
  10. 根据权利要求9所述的媒体文件的封装方法,所述第二设备上安装有第一数量的摄像机,所述第三设备上安装有第二数量的摄像机,所述第二设备和所述第三设备分别利用各自的摄像机针对同一场景进行多视角视频采集拍摄,获得所述第一多视角视频和所述第二多视角视频;
    其中,所述第一封装文件和所述第二封装文件中均包括所述第一应用场景类型字段,且所述第一封装文件和所述第二封装文件中的第一应用场景类型字段的取值分别用于表示所述第一多视角视频和所述第二多视角视频为大尺度图集信息的多视角视频的第二值。
  11. 根据权利要求1~10任一项所述的媒体文件的封装方法,所述媒体码流包括六自由度媒体码流和受限六自由度媒体码流。
  12. 一种媒体文件的解封装方法,由电子设备执行,包括:
    接收目标媒体内容在其对应的应用场景下的媒体码流的封装文件,所述封装文件中包括第一应用场景类型字段,所述第一应用场景类型字段用于指示所述媒体码流对应的应用场景;
    解封装所述封装文件,获得所述第一应用场景类型字段;
    根据所述第一应用场景类型字段,确定所述媒体码流对应的应用场景;
    根据所述媒体码流对应的应用场景,确定所述媒体码流的解码方式和渲染方式中的至少一种。
  13. 根据权利要求12所述的媒体文件的解封装方法,在所述第一应用场景类型字段的取值为用于表示所述媒体码流为大尺度图集信息的多视角视频的第二值的情况下,所述方法还包括:
    解析所述封装文件,获得所述媒体码流中包括的摄像机对应视角采集的纹理图和深度图与所述大尺度图集信息中的平面帧之间的映射关系;
    解码所述媒体码流,获得所述媒体码流中的摄像机参数;
    根据所述映射关系和所述摄像机参数,在三维空间呈现所述多视角视频。
  14. 根据权利要求12~13任一项所述的媒体文件的解封装方法,所述方法还包括:
    接收所述目标媒体内容的目标描述文件,所述目标描述文件中包括第二应用场景类型字段,所述第二应用场景类型字段用于指示所述媒体码流对应的应用场景;
    根据所述第二应用场景类型字段,在所述媒体码流的封装文件中确定目标媒体码流的目标封装文件;
    所述接收目标媒体内容在其对应的应用场景下的媒体码流的封装文件,包括:
    接收所述目标封装文件,以根据所述目标封装文件中的第一应用场景类型字段,确定所述目标媒体码流的目标应用场景。
  15. 一种媒体文件的封装装置,包括:
    媒体码流获取单元,用于获取目标媒体内容在其对应的应用场景下的媒体码流;
    媒体码流封装单元,用于封装所述媒体码流,生成所述媒体码流的封装文件,所述封装文件中包括第一应用场景类型字段,所述第一应用场景类型字段用于指示所述媒体码流对应的应用场景;
    封装文件发送单元,用于将所述封装文件发送至第一设备,以便所述第一设备根据所述第一应用场景类型字段确定所述媒体码流对应的应用场景,并根据所述媒体码流对应的应用场景确定所述媒体码流的解码方式和渲染方式中的至少一种。
  16. 一种媒体文件的解封装装置,包括:
    封装文件接收单元,用于接收目标媒体内容在其对应的应用场景下的媒体码流的封装文件,所述封装文件中包括第一应用场景类型字段,所述第一应用场景类型字段用于指示所述媒体码流对应的应用场景;
    文件解封装单元,用于解封装所述封装文件,获得所述第一应用场景类型字段;
    应用场景获得单元,用于根据所述第一应用场景类型字段,确定所述媒体码流对应的应用场景;
    解码渲染确定单元,用于根据所述媒体码流对应的应用场景,确定所述媒体码流的解码方式和渲染方式中的至少一种。
  17. 一种电子设备,包括:
    至少一个处理器;
    存储装置,配置为存储至少一个程序,当所述至少一个程序被所述至少一个处理器执行时,使得所述至少一个处理器实现如权利要求1至11中任一项所述的媒体文件的封装方法或者如权利要求12至14中任一项所述的媒体文件的解封装方法。
  18. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1至11中任一项所述的媒体文件的封装方法或者如权利要求12至14中任一项所述的媒体文件的解封装方法。
  19. 一种计算机程序产品,包括指令,当其在计算机上运行时,使得计算机执行如权利要求1至11中任一项所述的媒体文件的封装方法或者如权利要求12至14中任一项所述的媒体文件的解封装方法。
PCT/CN2021/118755 2020-10-14 2021-09-16 媒体文件的封装方法、媒体文件的解封装方法及相关设备 WO2022078148A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
JP2022561600A JP7471731B2 (ja) 2020-10-14 2021-09-16 メディアファイルのカプセル化方法、メディアファイルのカプセル化解除方法及び関連機器
EP21879194.5A EP4231609A4 (en) 2020-10-14 2021-09-16 METHOD FOR ENCAPSULATING MEDIA FILES, METHOD FOR ENCAPSULATING MEDIA FILES AND ASSOCIATED APPARATUS
KR1020227037594A KR102661694B1 (ko) 2020-10-14 2021-09-16 미디어 파일 캡슐화 방법, 미디어 파일 캡슐화 해제 방법 및 관련 디바이스
US17/954,134 US20230034937A1 (en) 2020-10-14 2022-09-27 Media file encapsulating method, media file decapsulating method, and related devices

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011098190.7A CN114374675B (zh) 2020-10-14 2020-10-14 媒体文件的封装方法、媒体文件的解封装方法及相关设备
CN202011098190.7 2020-10-14

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/954,134 Continuation US20230034937A1 (en) 2020-10-14 2022-09-27 Media file encapsulating method, media file decapsulating method, and related devices

Publications (1)

Publication Number Publication Date
WO2022078148A1 true WO2022078148A1 (zh) 2022-04-21

Family

ID=81138930

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/118755 WO2022078148A1 (zh) 2020-10-14 2021-09-16 媒体文件的封装方法、媒体文件的解封装方法及相关设备

Country Status (6)

Country Link
US (1) US20230034937A1 (zh)
EP (1) EP4231609A4 (zh)
JP (1) JP7471731B2 (zh)
KR (1) KR102661694B1 (zh)
CN (2) CN116248642A (zh)
WO (1) WO2022078148A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150269736A1 (en) * 2014-03-20 2015-09-24 Nokia Technologies Oy Method, apparatus and computer program product for filtering of media content
CN108833937A (zh) * 2018-05-30 2018-11-16 华为技术有限公司 视频处理方法和装置
WO2020002122A1 (en) * 2018-06-27 2020-01-02 Canon Kabushiki Kaisha Method, device, and computer program for transmitting media content
CN110651482A (zh) * 2017-03-30 2020-01-03 联发科技股份有限公司 发信isobmff的球面区域信息的方法和装置
CN111435991A (zh) * 2019-01-11 2020-07-21 上海交通大学 基于分组的点云码流封装方法和系统

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120212579A1 (en) * 2009-10-20 2012-08-23 Telefonaktiebolaget Lm Ericsson (Publ) Method and Arrangement for Multi-View Video Compression
US9544612B2 (en) 2012-10-04 2017-01-10 Intel Corporation Prediction parameter inheritance for 3D video coding
CN108616751B (zh) * 2016-12-12 2023-05-12 上海交通大学 媒体信息的处理方法、装置及系统
WO2018120294A1 (zh) * 2016-12-30 2018-07-05 华为技术有限公司 一种信息的处理方法及装置
US10559126B2 (en) 2017-10-13 2020-02-11 Samsung Electronics Co., Ltd. 6DoF media consumption architecture using 2D video decoder
WO2019235849A1 (ko) * 2018-06-06 2019-12-12 엘지전자 주식회사 360 비디오 시스템에서 오버레이 미디어 처리 방법 및 그 장치
CN110706355B (zh) * 2018-07-09 2021-05-11 上海交通大学 基于视频内容的指示信息标识方法、系统及存储介质
CN110704673B (zh) * 2018-07-09 2022-09-23 上海交通大学 基于视频内容消费的反馈信息标识方法、系统及存储介质
KR20240010535A (ko) * 2018-08-06 2024-01-23 파나소닉 인텔렉츄얼 프로퍼티 코포레이션 오브 아메리카 3차원 데이터 저장 방법, 3차원 데이터 취득 방법, 3차원 데이터 저장 장치, 및 3차원 데이터 취득 장치
CN110944222B (zh) * 2018-09-21 2021-02-12 上海交通大学 沉浸媒体内容随用户移动变化的方法及系统
EP3905532A4 (en) 2018-12-28 2022-11-30 Sony Group Corporation INFORMATION PROCESSING DEVICE AND INFORMATION PROCESSING METHOD
JPWO2020166612A1 (ja) 2019-02-12 2021-12-16 パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカPanasonic Intellectual Property Corporation of America 三次元データ多重化方法、三次元データ逆多重化方法、三次元データ多重化装置、及び三次元データ逆多重化装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150269736A1 (en) * 2014-03-20 2015-09-24 Nokia Technologies Oy Method, apparatus and computer program product for filtering of media content
CN110651482A (zh) * 2017-03-30 2020-01-03 联发科技股份有限公司 发信isobmff的球面区域信息的方法和装置
CN108833937A (zh) * 2018-05-30 2018-11-16 华为技术有限公司 视频处理方法和装置
WO2020002122A1 (en) * 2018-06-27 2020-01-02 Canon Kabushiki Kaisha Method, device, and computer program for transmitting media content
CN111435991A (zh) * 2019-01-11 2020-07-21 上海交通大学 基于分组的点云码流封装方法和系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4231609A4

Also Published As

Publication number Publication date
CN114374675A (zh) 2022-04-19
EP4231609A1 (en) 2023-08-23
KR102661694B1 (ko) 2024-04-26
CN116248642A (zh) 2023-06-09
JP2023520736A (ja) 2023-05-18
US20230034937A1 (en) 2023-02-02
JP7471731B2 (ja) 2024-04-22
KR20220160646A (ko) 2022-12-06
EP4231609A4 (en) 2024-03-13
CN114374675B (zh) 2023-02-28

Similar Documents

Publication Publication Date Title
KR102241082B1 (ko) 복수의 뷰포인트들에 대한 메타데이터를 송수신하는 방법 및 장치
US11765407B2 (en) Method, device, and computer program for transmitting media content
KR102559862B1 (ko) 미디어 콘텐츠 전송을 위한 방법, 디바이스, 및 컴퓨터 프로그램
US10869017B2 (en) Multiple-viewpoints related metadata transmission and reception method and apparatus
US20230421810A1 (en) Encapsulation and decapsulation methods and apparatuses for point cloud media file, and storage medium
CN113891117B (zh) 沉浸媒体的数据处理方法、装置、设备及可读存储介质
WO2023061131A1 (zh) 媒体文件封装方法、装置、设备及存储介质
WO2023098279A1 (zh) 视频数据处理方法、装置、计算机设备、计算机可读存储介质及计算机程序产品
WO2024041239A1 (zh) 一种沉浸媒体的数据处理方法、装置、设备、存储介质及程序产品
JP2022541908A (ja) ボリュメトリックビデオコンテンツを配信するための方法および装置
WO2023226504A1 (zh) 一种媒体数据处理方法、装置、设备以及可读存储介质
WO2022078148A1 (zh) 媒体文件的封装方法、媒体文件的解封装方法及相关设备
WO2022193875A1 (zh) 多视角视频的处理方法、装置、设备及存储介质
US20230360678A1 (en) Data processing method and storage medium
WO2023169003A1 (zh) 点云媒体的解码方法、点云媒体的编码方法及装置
WO2023024839A1 (zh) 媒体文件封装与解封装方法、装置、设备及存储介质
WO2023024841A1 (zh) 点云媒体文件的封装与解封装方法、装置及存储介质
WO2022037423A1 (zh) 点云媒体的数据处理方法、装置、设备及介质
WO2023024843A1 (zh) 媒体文件封装与解封装方法、设备及存储介质
CN116137664A (zh) 点云媒体文件封装方法、装置、设备及存储介质
CN117082262A (zh) 点云文件封装与解封装方法、装置、设备及存储介质

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2022561600

Country of ref document: JP

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 20227037594

Country of ref document: KR

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2021879194

Country of ref document: EP

Effective date: 20230515