WO2023207119A1 - Immersive media processing method and apparatus, device, and storage medium - Google Patents

Immersive media processing method and apparatus, device, and storage medium Download PDF

Info

Publication number
WO2023207119A1
WO2023207119A1 PCT/CN2022/137788 CN2022137788W WO2023207119A1 WO 2023207119 A1 WO2023207119 A1 WO 2023207119A1 CN 2022137788 W CN2022137788 W CN 2022137788W WO 2023207119 A1 WO2023207119 A1 WO 2023207119A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
information
track
auxiliary information
field
Prior art date
Application number
PCT/CN2022/137788
Other languages
French (fr)
Chinese (zh)
Inventor
胡颖
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2023207119A1 publication Critical patent/WO2023207119A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs
    • H04N21/2343Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/235Processing of additional data, e.g. scrambling of additional data or processing content descriptors
    • H04N21/2353Processing of additional data, e.g. scrambling of additional data or processing content descriptors specifically adapted to content descriptors, e.g. coding, compressing or processing of metadata
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/45Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
    • H04N21/462Content or additional data management, e.g. creating a master electronic program guide from data received from the Internet and a Head-end, controlling the complexity of a video stream by scaling the resolution or bit-rate based on the client capabilities
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/8126Monomedia components thereof involving additional data, e.g. news, sports, stocks, weather forecasts

Definitions

  • the present application relates to the field of computer technology, and in particular, to an immersive media processing method, device, computer equipment, and computer-readable storage medium.
  • An immersive media processing method executed by a media processing device, the method includes:
  • the video bit stream of the immersive media which is captured by multiple cameras; the information collected by any camera through photography includes: video image information and corresponding video auxiliary information;
  • the data box includes metadata information used to indicate whether the corresponding video track contains both video image information and corresponding video auxiliary information.
  • An immersion media processing device is provided, and the device includes:
  • a processing unit used to obtain the video bit stream of the immersive media, which is captured by multiple cameras; the information collected by any camera through photography includes: video image information and corresponding video auxiliary information;
  • An encapsulating unit configured to encapsulate the video bit stream into M video tracks to obtain the media file of the immersive media; where M is a positive integer, and among the M video tracks, there is at least one video track
  • the video media header data box in includes metadata information, which is used to indicate whether the corresponding video track contains both video image information and corresponding video auxiliary information.
  • An immersive media processing method is provided, which is executed by a media processing device.
  • the method includes:
  • the immersive media is captured by multiple cameras.
  • the information collected by any camera includes: video image information and corresponding video auxiliary information; and the media file It is obtained by encapsulating the video bit stream of the immersive media into M video tracks; where M is a positive integer, and among the M video tracks, there is a video media header data box in at least one video track including Metadata information, the metadata information is used to indicate whether the corresponding video track contains both video image information and corresponding video auxiliary information;
  • An immersion media processing device includes:
  • the acquisition unit is used to acquire the target video track in the media file of the immersive media.
  • the immersive media is captured by multiple cameras.
  • the information collected by any camera through photography includes: video image information and corresponding video auxiliary information;
  • the media file is obtained by encapsulating the video bit stream of the immersive media into M video tracks; where M is a positive integer, and among the M video tracks, there is a video in at least one video track
  • the media header data box includes metadata information, the metadata information is used to indicate whether the corresponding video track contains both video image information and corresponding video auxiliary information;
  • a decoding unit configured to decode information in the target video track.
  • a media processing device includes an input interface and an output interface.
  • the media processing device further includes:
  • a processor adapted to implement one or more computer-readable instructions; and, a computer-readable storage medium storing one or more computer-readable instructions;
  • the one or more computer-readable instructions are adapted to be loaded by the processor and perform the following steps:
  • the video bit stream of the immersive media which is captured by multiple cameras; the information collected by any camera through photography includes: video image information and corresponding video auxiliary information;
  • the data box includes metadata information used to indicate whether the corresponding video track contains both video image information and corresponding video auxiliary information.
  • the one or more computer readable instructions are adapted to be loaded by the processor and perform the following steps:
  • the immersive media is captured by multiple cameras.
  • the information collected by any camera includes: video image information and corresponding video auxiliary information; and the media file It is obtained by encapsulating the video bit stream of the immersive media into M video tracks; where M is a positive integer, and among the M video tracks, there is a video media header data box in at least one video track including Metadata information, the metadata information is used to indicate whether the corresponding video track contains both video image information and corresponding video auxiliary information;
  • a computer-readable storage medium that stores one or more computer-readable instructions, and the one or more computer-readable instructions are suitable for being loaded by a processor and executing any of the above-mentioned tasks.
  • An approach to immersive media An approach to immersive media.
  • a computer program product which includes computer-readable instructions; when the computer-readable instructions are executed by a processor, any one of the above-mentioned immersive media processing methods is implemented.
  • Figure 1a is a schematic diagram of a 3DoF provided by an embodiment of the present application.
  • Figure 1b is a schematic diagram of 3DoF+ provided by an embodiment of the present application.
  • Figure 1c is a schematic diagram of a 6DoF provided by the embodiment of the present application.
  • Figure 1d is a schematic diagram of free-view video data provided by an embodiment of the present application.
  • Figure 1e is a schematic diagram of an immersive media processing process provided by an embodiment of the present application.
  • FIG. 2 is a schematic flowchart of an immersive media processing method provided by an embodiment of the present application
  • FIG. 3 is a schematic flowchart of an immersive media processing method provided by another embodiment of the present application.
  • Figure 4a is a schematic diagram of a single-rail packaging method for immersive media provided by an embodiment of the present application
  • Figure 4b is a schematic diagram of the transmission of media files of immersive media provided by an embodiment of the present application.
  • Figure 5a is a schematic diagram of a multi-track packaging method for immersive media provided by an embodiment of the present application
  • Figure 5b is a schematic diagram of the transmission of media files of another immersive media provided by an embodiment of the present application.
  • Figure 6a is a schematic diagram of a multi-track packaging method for immersive media provided by another embodiment of the present application.
  • Figure 6b is a schematic diagram of the transmission of media files of immersive media provided by another embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of an immersion media processing device provided by an embodiment of the present application.
  • Figure 8 is a schematic structural diagram of an immersion media processing device provided by another embodiment of the present application.
  • Figure 9 is a schematic structural diagram of a media processing device provided by an embodiment of the present application.
  • immersive media refers to media content that can provide consumers with an immersive experience, so that consumers immersed in the media content can obtain visual and auditory experiences in the real world.
  • Immersive media can be divided into 3DoF media, 3DoF+ media and 6DoF media according to the degree of freedom (DoF) of consumers when consuming media content; 3DoF media can support users to consume corresponding content through 3DoF (i.e. three degrees of freedom).
  • 3DoF+ media can support users to consume corresponding media content through 3DoF+
  • 6DoF media can support users to consume corresponding media content through 6DoF.
  • the so-called 3DoF refers to the three degrees of freedom that the user's head can rotate around the X-axis, Y-axis and Z-axis when the user is fixed at the center point of a three-dimensional space, as shown in Figure 1a;
  • the so-called 3DoF+ is Refers to: On the basis of three degrees of freedom, the user's head can also perform limited degrees of freedom along the X-axis, Y-axis, and Z-axis. Limited movement refers to translational movement in a limited space, as shown in Figure 1b ;
  • the so-called 6DoF refers to: on the basis of three degrees of freedom, consumers can also freely move along the X-axis, Y-axis and Z-axis. Free movement refers to free translational movement, as shown in Figure 1c .
  • the degree of freedom mentioned in the embodiments of this application can be understood as the degree of freedom of the user to support user movement and generate content interaction when watching immersive media.
  • immersive media captured by multiple cameras can be called free-view video; that is to say, free-view video is a Common 3DoF+ media and 6DoF media can also be called multi-view video, multi-view video or free-view video.
  • free-view video is a Common 3DoF+ media
  • 6DoF media can also be called multi-view video, multi-view video or free-view video.
  • texture information of the scene which can also be called texture map information or texture map, which can be, for example, color information).
  • depth information (which can also be called depth map information or depth map, which can be, for example, spatial distance information, etc.), thereby combining texture information and depth information from different cameras according to the user's location information to constitute user-side consumption.
  • Free-view video i.e. 6DoF media or 3DoF+ media).
  • free-view video is a combination of videos from different perspectives in space collected by multiple cameras
  • embodiments of the present application can express free-view video data as a combination of the following information :
  • the metadata can include the parameters of multiple cameras, as well as description information such as the splicing layout and edge protection of free-view videos.
  • the processing process for immersive media may mainly include: video acquisition, video encoding, video file encapsulation, video transmission, video file decapsulation, video decoding and video presentation. in:
  • Video capture is used to convert analog video signals into digital videos and save them in the format of digital video files; that is to say, video capture can convert video signals into binary digital information. Since the binary digital information converted from the video signal is a binary data stream, the binary digital information can also be called a code stream or bit stream (Bitstream) of the video signal.
  • Video encoding refers to the process of converting files in the original video format into another video format file through compression technology. That is to say, video encoding can be achieved through standardized video compression algorithms, such as AVS (Audio Video Coding Standard, China’s national digital audio and video coding standard) encoding technology, AVS3 (the third generation of audio and video launched by the AVS standard group) Codec standard) coding technology, HEVC (High Efficiency Video Coding, international video coding standard HEVC/H.265) coding technology, VVC (versatile video coding, international video coding standard VVC/H.266) coding technology and other algorithm implementations, or Achieved through some specific compression algorithms designed for content.
  • AVS Analog Video Coding Standard, China’s national digital audio and video coding standard
  • AVS3 the third generation of audio and video launched by the AVS standard group Codec standard
  • HEVC High Efficiency Video Coding, international video coding standard HEVC/H.265
  • VVC versatile video coding,
  • Video file encapsulation refers to storing the encoded and compressed media bit stream (such as video bit stream) in a file in a certain format according to the encapsulation format (or container, or file container).
  • the embodiment of the present application mainly takes the media bit stream as a video bit stream as an example; if the immersive media includes not only video frames, but also one or more media frames such as audio frames and subtitle frames, then the media bit stream The stream also includes at least one of an audio bit stream and a subtitle bit stream.
  • the embodiments of this application please refer to the embodiments of this application.
  • Common encapsulation formats include AVI format (Audio Video Interleaved, audio and video interleaved format) or ISOBMFF (ISO Based Media File Format, media file format based on ISO (International Standard Organization, International Organization for Standardization) standards).
  • ISOBMFF International Standard Organization, International Organization for Standardization
  • ISOBMFF is the packaging standard for media files.
  • the most typical ISOBMFF file is the MP4 (Moving Picture Experts Group 4) file.
  • the media sample (which may be referred to as a sample for short) can be used as the encapsulation unit in the file encapsulation process; that is, the media code stream obtained by compression and encoding is encapsulated into one or more media tracks, thereby obtaining the corresponding media files (i.e. video files).
  • a media file can be composed of multiple media tracks.
  • a media file can contain one or more video tracks, one or more audio tracks, and one or more subtitle tracks; the so-called video track refers to a component used to encapsulate the video bits.
  • the media track of the stream refers to the media track used to encapsulate the audio bit stream
  • the so-called subtitle track refers to the media track used to encapsulate the subtitle bit stream.
  • the media track mentioned here (which may be referred to as a track for short) refers to a collection of media data in the process of encapsulating media files; a media track may include one or many samples, and during the encapsulation process, one or Multiple media frames as a sample.
  • a sample in a video track is usually a video frame
  • a sample in an audio track is usually an audio frame
  • a sample in a subtitle track is usually a subtitle frame, etc.
  • Video file transmission means after the media production device obtains the media file through the above steps, the media file is transferred to the media consumption device.
  • the media file can be considered to include the media resources corresponding to each media track, and the media resources corresponding to each media track are the corresponding media
  • the bit stream is obtained after encapsulating it into a media track. For example, during encapsulation, the video bit stream is encapsulated into one or more video tracks, thereby obtaining a media file; then at the encapsulation level, it can be considered that after encapsulating the video bit stream into one or more video tracks, the resulting It is a video track that encapsulates corresponding information.
  • the media file can be considered to include one or more video tracks.
  • each video is obtained.
  • the video resources corresponding to the track therefore the media file can be considered to include the video resources corresponding to each video track.
  • the transmission process of media files can be carried out based on various transmission protocols; the transmission protocols here can include but are not limited to: DASH (Dynamic Adaptive Streaming over HTTP, dynamic adaptive streaming media transmission based on HTTP) protocol, HLS (HTTP Live Streaming, dynamic code rate adaptive transmission) protocol, SMTP (Smart Media Transport Protocol, intelligent media transmission protocol), TCP (Transmission Control Protocol, transmission control protocol), etc.
  • DASH is an adaptive bitrate streaming technology that enables high-quality streaming media to be delivered over the Internet through a traditional HTTP network server; in DASH, it is described by MPD (media presentation description, media presentation description signaling in DASH) Media clip information.
  • DASH it involves the concept of Representation; the so-called Representation refers to a combination of one or more media components (media fragments).
  • a Representation can be understood as a video resource corresponding to a track; for example, when converting a certain After the video bit stream of a certain resolution is encapsulated into a video track to obtain a video file (i.e., a media file) of a certain resolution, and the video file of a certain resolution includes a video resource, then the video file of a certain resolution can Seen as a Representation.
  • DASH also involves the concept of Adaptation Set; the so-called Adaptation Set refers to a collection of one or more video streams, and an Adaptation Set can contain multiple Representations.
  • Video file decapsulation means after receiving the media file, the media consumption device decapsulates the media file. It can be understood that the decapsulation process of the media consumption device is opposite to the encapsulation process performed by the media production device mentioned above; the media consumption device can decapsulate the media file according to the file format requirements during encapsulation to obtain the corresponding Media bit streams (such as video bit streams), etc.
  • Video decoding refers to decoding the media bit stream to restore the video. It should be understood that the decoding process of the media consumption device is inverse to the encoding process performed by the media production device mentioned above.
  • media production equipment refers to equipment used by providers of immersive media (such as content producers of immersive media), which can be such as smartphones, tablets, laptops, desktop computers, smart speakers, Smart watches, smart cars and other terminal equipment.
  • the media production equipment can also be a server.
  • the server here can be an independent physical server, a server cluster or a distributed system composed of multiple physical servers, or a server that provides cloud services, cloud databases, and cloud services.
  • Cloud servers for basic cloud computing services such as computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN (Content Delivery Network), and big data and artificial intelligence platforms .
  • the so-called media consumption device refers to a device used by users of immersive media (for example, users who consume media content of immersive media), which can be such as personal computers, smart phones, VR (Virtual Reality, virtual reality) devices (for example, VR helmets, VR glasses, etc.), smart watches, smart cars and other terminal devices; further, the media consumption device can also install and run a client with media consumption capabilities, thereby consuming media through the client.
  • immersive media for example, users who consume media content of immersive media
  • VR Virtual Reality, virtual reality
  • the media consumption device can also install and run a client with media consumption capabilities, thereby consuming media through the client.
  • the media production equipment needs to compress and encode the collected free-view video; during the compression and encoding process
  • the video compression algorithms involved can be completed by AVS3 encoding technology, HEVC encoding technology, etc.
  • the encoded data stream (such as a video bit stream) needs to be encapsulated to obtain a media file, so that the media file can be transmitted to a media consumption device.
  • reverse operations such as decapsulation and decoding of media files are required to finally present the video content of the free-view video (that is, the media content).
  • the production process of free-view video content can roughly include the following steps: 1 Capture and shoot through a camera array to obtain texture map information and depth map information collected by multiple cameras in the camera array . 2 Splice the texture map information collected by multiple cameras and the corresponding depth map information. This step can be an optional step. 3 Encode the depth map information and texture map information collected by multiple cameras through planar video compression to obtain the encoded video bit stream. 4 Encapsulate the encoded video bit stream and metadata into the media file of the free-view video (i.e., video file).
  • the media consumption device For media consumption devices, it needs to reconstruct the free-view video into a virtual viewpoint (or virtual viewpoint) based on the media file of the free-view video, so as to render and present the video content corresponding to the virtual viewpoint to the user; Specifically, when reconstructing the free-view video to the virtual viewpoint, the media consumption device can perform the following steps: obtain the media file of the free-view video, and decode it from the media file to obtain a or Depth map information and texture map information collected by multiple cameras. Process the decoded free-view video data, and combine the depth map information and texture map information according to the free-view video algorithm (the application's own algorithm can be used) to synthesize the virtual viewpoint corresponding to the target perspective switched by the user.
  • the free-view video algorithm the application's own algorithm can be used
  • the texture map information structure is used to indicate: the encoded metadata information of the texture map corresponding to the original camera (ie camera) involved in the free-view video data.
  • the syntax of the texture map information structure can be as follows:
  • aligned(8) indicates alignment by 8 bits
  • texture_padding_size indicates the guard band width of the texture map
  • texture_top_left_x indicates that the texture map corresponding to the corresponding original camera is in the video frame plane
  • the x coordinate of the upper left corner of The x-coordinate of the lower right corner of the texture map in the video frame.
  • the depth map information structure is used to indicate: the encoded metadata information of the depth map corresponding to the original camera (ie camera) involved in the free-view video data.
  • the syntax of the depth map information structure can be as follows:
  • depth_padding_size indicates the guard band width of the depth map
  • depth_top_left_x indicates the x-coordinate of the upper left corner of the depth map corresponding to the corresponding original camera in the video frame plane
  • depth_top_left_y indicates the corresponding original camera.
  • depth_bottom_right_x indicates the x coordinate of the lower right corner of the depth map corresponding to the corresponding original camera in the video frame plane
  • depth_bottom_right_y indicates the depth map corresponding to the original camera in the video frame plane.
  • depth_downsample_factor indicates the multiple factor of depth map downsampling.
  • the actual resolution width and height of the depth map is 1/2 depth_downsample_factor of the camera acquisition resolution width and height.
  • the existing video coding and decoding standards stipulate that the free-view video track should be represented in the media file as a restricted video, and the SchemeTypeBox (mode) in the RestrictedSchemeInfoBox (restricted mode information data box) under the sample entry of this type of track type data box) should be set to 'as3f'; however, encoding of free-view video has no restrictions on the video codec.
  • existing video coding and decoding standards also define free-view video encapsulation methods, which can include single-track encapsulation and multi-track encapsulation:
  • the single-track encapsulation of free-view video means that a single track represents the free-view video code stream (that is, the video bit stream of free-view video).
  • the single-track encapsulation method is used to directly encapsulate the free-view video data into the media track container, and the data structure is retained in the code stream as it is.
  • Single-track encapsulated free-view video tracks use a VisualSampleEntry (sample entry), defined in ISO/IEC 14496-12.
  • the sample entry can contain an AvsFreeViewInfoBox (free view information data box) and PayloadInfoBox (load information data box).
  • AvsFreeViewInfoBox indicates the camera information corresponding to the perspective in the free-view video track
  • PayloadInfoBox indicates the texture map and depth map information corresponding to the camera.
  • Multi-track encapsulation of free-view video means: multiple media tracks (such as video tracks) represent the free-view video code stream; multi-track encapsulation can encapsulate the texture map and depth map collected by the same camera into one track, or encapsulate into different tracks, there is no limit to this.
  • the multi-track package of the free-view video can contain two types of media tracks: texture map track (the video track that encapsulates the texture map (i.e. texture information)) and depth Image track (video track that encapsulates depth map (i.e., depth information)); in this case, the texture map track and depth map track corresponding to the same camera need to be associated.
  • TrackReferenceBox (track) defined in ISO/IEC 14494-12 can be used.
  • Reference data box reference or associate the depth map track to the texture map track corresponding to the same camera
  • reference_type reference type field
  • the multi-track encapsulation method is used to map the free-view video to the container of the respective media track according to the data type; the tracks belonging to the same camera are related to each other, and the time is aligned.
  • the texture map track and depth map track of the multi-track packaged free-view video both use VisualSampleEntry, which is defined in ISO/IEC19946-12; and the sample entries of the texture map track and depth map track contain AvsFreeViewInfoBox, and the camera described in the AvsFreeViewInfoBox Parameter information should not appear repeatedly in the texture map track and its corresponding depth map track.
  • the texture map track contains texture maps from multiple cameras
  • the sample entry contains a PayloadInfoBox describing the texture map information
  • the depth map track contains depth maps from multiple cameras
  • the sample entry contains a PayloadInfoBox describing the depth map information.
  • each sample in the texture map track or depth map track corresponds to the texture map or depth map of one or more cameras.
  • each sample contains texture maps or depth maps of multiple cameras
  • each sample may contain multiple subsamples.
  • Each subsample corresponds to the texture map or depth map of a camera and is described using SubSampleInformationBox (subsample information data box). ; Specifically, when the value of the flags field in the relevant syntax of SubSampleInformationBox in the existing video encoding and decoding standards is 0, the subsample can be described through the field of codec_specific_parameters.
  • AvsFreeViewGroupBox can include a track group identification field (track_group_id). Different video tracks that contain some cameras and correspond to the same free-view video should have the same track_group_id value.
  • the free-angle data box is used to indicate the camera metadata information corresponding to one or more free-angle views contained in the corresponding track (ie, the metadata information of the camera).
  • the syntax of the free-view data box can be as follows:
  • texture_in_track indicates whether the corresponding track contains a texture map; when the value of texture_in_track is 1, it indicates that the corresponding track contains a texture map; when the value of texture_in_track is 0, it indicates that the corresponding track does not contain a texture picture.
  • depth_in_track indicates whether the corresponding track contains a depth map; when the value of depth_in_track is 1, it indicates that the corresponding track contains a depth map; when the value of depth_in_track is 0, it indicates that the corresponding track does not contain a depth map.
  • camera_count indicates the number of all cameras collecting video.
  • IntCameraInfoStruct() indicates the internal parameter information of the camera
  • ExtCameraInfoStruct() indicates the external parameter information of the camera.
  • codec_independency indicates the codec independence between the texture maps and depth maps corresponding to each camera in the corresponding track. The values involved can be seen in Table 1 below:
  • the load information data box is used to describe the metadata information of the texture map and depth map corresponding to one or more cameras in the corresponding track.
  • the syntax of the payload information data box can be as follows:
  • texture_info_flag indicates whether to express the metadata information describing the texture map; when the value of texture_info_flag is 1, it expresses the metadata information describing the texture map; when the value is 0, it does not express the metadata information describing the texture map.
  • depth_info_flag indicates whether to express the metadata information describing the depth map; when the value of depth_info_flag is 1, the metadata information describing the depth map is expressed; when the value is 0, the metadata information describing the depth map is not expressed.
  • num_cameras indicates the number of cameras in this media track, and camera_id indicates the identity of the camera.
  • TextureInfostruct() describes the metadata information of the texture map
  • DepthInfostruct() describes the metadata information of the depth map.
  • the free-view track group data box is used to associate various video tracks of the same free-view video.
  • the syntax of the free-view track group data box can be as follows:
  • camera_count is used to indicate the number of cameras corresponding to the free-view texture information and depth information contained in the track.
  • camera_id is used to indicate the camera identifier corresponding to each camera.
  • depth_texture_type is used to indicate the type of texture information or depth information captured by the corresponding camera contained in the track. Its value can be referred to the following Table 2:
  • depth_texture_type value meaning 0 reserve 1 Indicates that it contains texture information captured by the corresponding camera 2 Indicates that it contains depth information captured by the corresponding camera 3 Indicates that it contains texture and depth information captured by the corresponding camera
  • the existing video encoding and decoding standards have the problem of repeated instructions for the depth information and texture information contained in each video track, such as in the free-view information data box, load information data box, free-view track group There are corresponding fields in the data box indicating whether the track contains texture or depth information. This will cause the media consumption device to repeatedly parse during the decapsulation process, thus causing a waste of resources.
  • the existing video coding and decoding standards indicate the coding and decoding independence of free-view videos, that is, whether there is coding and decoding dependence between texture information and depth information in the video track and between the content captured by each camera.
  • the current indication method is limited to free-view application scenarios and does not have compatibility and scalability.
  • the embodiments of this application propose a file encapsulation and transmission signaling indication method for free-view video applications and other immersive media applications; through this method, the following can be achieved: 1 Organize free-view video media in a more optimized manner content, preventing the media consumption device from repeatedly parsing the relevant fields in other data boxes used to indicate the information in the video track during the decapsulation process, thereby saving processing resources, improving decapsulation efficiency, and thus improving decoding efficiency; 2 With more optimized The method indicates the encoding and decoding dependencies between various parts of the video content, so that the method is not limited to free-view application scenarios, improving compatibility and scalability; 3 Based on the information in 1 ⁇ 2, generate corresponding transmission signaling to guide Resource selection of media consumption devices such as clients in the process of free-view video consumption.
  • this method can be applied to the storage, transmission, encapsulation, and decapsulation process of any immersive media (such as free-view video).
  • any immersive media such as free-view video.
  • free-view video the specific implementation steps of this method are as follows:
  • Step 1 The media production equipment (such as a server) generates the video bit stream of the free-view video and encapsulates the video bit stream into a media file. According to whether the texture information and depth information in the free-view video are in the same video track, and whether Whether the perspective video is encapsulated into multiple video tracks, the texture information, depth information, and camera information contained in the video track of the free-view video are indicated according to the following guidelines.
  • N is a positive integer
  • the video media header data box (VideoMediaHeaderBox) in the video track identifies that the video track is accompanied by video auxiliary information (such as depth information) through metadata information.
  • a video track only contains texture information or only depth information, considering that the texture information is a kind of video image information and the depth information is a kind of video auxiliary information, it can be considered as a texture map track (i.e.
  • the video track encapsulating texture information) is a video image information track, and the depth map track is considered to be a video auxiliary information track.
  • the video image information track mentioned here can be referred to as a video track for short, which refers to a video track that encapsulates video image information; the video auxiliary information track mentioned here can also be called an auxiliary video media track or a video auxiliary track, which Refers to the video track that encapsulates video auxiliary information.
  • 1Video media uses the handler type in the HandlerBox of the, as defined in 8.4.3 (as defined in Chapter 8.4.3 of the relevant standards for video media use) handler_type (handler type) in HandlerBox (handler data box)).
  • handler_type handler type
  • Handler_type Handler type
  • Handler data box Handler data box
  • the handler type of video media is 'vide', that is, the corresponding field value in the HandlerBox data box in the MediaBox data box of the video media file is 'vide'; the handler of auxiliary video media
  • the type is 'auxv', that is, the value of the corresponding field in the HandlerBox data box in the MediaBox data box of the auxiliary video media file is 'auxv'.
  • auxiliary video media track i.e., video auxiliary information track
  • the video track i.e., video image information track
  • the auxiliary video media track is usually not used for visual Rendering (such as scenes where the auxiliary video track contains depth information). Therefore, if a video track only contains texture information or only depth information, the distinction between the video image information track and the video auxiliary information track in the existing technology (distinguished by handler_type) and the correlation technology ('vdep' track index) can be used , combined with the identification information (such as metadata information) in (1), the information contained in the video track can be distinguished as texture information or depth information.
  • the free-view track group data box is used to associate the multiple video tracks; further, the free-view track group data box can be used to encapsulate the video image information (such as Texture information) to associate multiple video tracks.
  • the load information ie, the information in the aforementioned load information data box
  • the free perspective can also be further distinguished based on the classification indicated by the metadata information in (1) (that is, whether the video track is accompanied by both video image information (such as texture information) and video auxiliary information (such as depth information)) The type of track group.
  • the type of free-view track group mentioned here can be used to indicate that each video track associated with the free-view track group only encapsulates video image information (such as texture information), or that the free-view track group is associated with Each video track of the video track simultaneously encapsulates video image information (such as texture information) and video auxiliary information (such as depth information).
  • video image information such as texture information
  • video auxiliary information such as depth information
  • Step 2 The media production device (such as a server) indicates the codec in the video track based on the file encapsulation structure involved in step 1 (i.e. whether it is multi-track encapsulation, whether a single video track contains both texture information and depth information)
  • the codec dependency information may include: an indication of the rules for dividing data sets (ie, video components) within the video, and an indication of whether codec dependencies exist in different data sets under the rules.
  • the codec dependency information may also include an indication of whether there is codec dependence on the texture information and depth information in the single video track, and so on.
  • Step 3 The media production equipment (such as a server) can provide a complete file download service.
  • the media production equipment such as a server
  • the media production device such as a server
  • the media production device slices the media file of the free-view video into media segments suitable for streaming transmission (according to the specifications of the existing transmission standards Perform), generate corresponding transmission signaling, and perform step 4.
  • the transmission signaling mentioned here may include but is not limited to DASH signaling or SMT signaling; and, the signaling description file of the transmission signaling indicates the following information:
  • Step 4 The media production device (such as a server) transmits the signaling description file of the transmission signaling to the user on the media consumption device side.
  • Step 5 The client in the media consumption device can execute the following logic:
  • the client can download the complete file (that is, the complete media file) and then play it locally. And based on the instructions about texture information, depth information and codec dependency information in the media file, it is determined whether the media file can be partially decoded at different stages to optimize the performance of client playback.
  • (B) Establish streaming transmission with media production equipment (such as servers) according to the application form, and adaptively select the appropriate freedom according to the texture information, depth information and codec dependency information indicated in the signaling description file of the transmission signaling. View video streams for consumption.
  • media production equipment such as servers
  • the embodiment of this application adds several descriptive fields at the system layer based on the existing technology.
  • the relevant fields are defined to support free-view video applications or other immersive media applications, as follows:
  • the ISOBMFF data box may include a video media header data box (VideoMediaHeaderBox).
  • the video track uses a video media header data box, which contains general presentation-related information of the video media that is independent of the codec method; the embodiment of the present application can
  • the video media header data box is extended to indicate in the video media header data box whether depth information or other auxiliary information is also included in the video track.
  • the expanded video media header data box can be as follows:
  • the three fields version, graphicsmode and opcolor are the original fields of the video media header data box.
  • version is used to indicate the version number of the video media header data box;
  • graphicsmode is used to indicate the image combination mode of the current video track.
  • a value of 0 in this field indicates that the image combination mode of the current video track is copy mode.
  • Other values can be determined by corresponding standards.
  • opcolor is an array of values for three colors (red, green, blue), used in conjunction with the image combination mode.
  • the embodiment of the present application can define the flag bit video_with_depth_info_flag (video depth information flag field) in the video media header data box to indicate whether the current video track also contains texture information ( Image information) and depth information corresponding to texture information (image information).
  • the value of this field is the first value (such as the value 1), it indicates that the current video track contains both texture information (image information) and depth information corresponding to the texture information (image information); when the value of this field is the second value ( If the value is 0), it indicates that the current video track only contains texture information (image information).
  • texture information is a type of video image information
  • depth information is a type of video auxiliary information
  • video image information is not limited to texture information
  • video auxiliary information is not limited to depth information.
  • it may also include Parallax information, etc. Therefore, other flag bits can also be defined in the video media header data box to indicate whether the current video track also contains corresponding auxiliary information.
  • flags can be defined:
  • video_with_parallax_info_flag video parallax information flag field
  • the value of this field is the first value (such as the value 1), it indicates that the current video track contains both video image information and disparity information corresponding to the video image information; when the value of this field is the second value (such as the value 0), Indicates that the current video track contains only video image information.
  • 2video_with_auxl_info_flag indicates whether the current video track contains both video image information and auxiliary information corresponding to the video image information.
  • the value of this field is the first value (such as the value 1)
  • the current video track contains both video image information and video auxiliary information corresponding to the video image information
  • the value of this field is the second value (such as the value 0)
  • the current video track only contains video image information.
  • the video auxiliary information here can be depth information, disparity information, or auxiliary information defined by other applications.
  • the video media header data box can be further extended to indicate in the video media header data box whether there is codec independence between the data content (information) contained in the video track.
  • the expanded video media header data box can be further as follows:
  • depth_codec_dependency when the value of depth_codec_dependency is the third value (such as the value 1), it indicates that there is a decoding dependency between the texture information (image information) contained in the current video track and the depth information corresponding to the texture information (image information); when depth_codec_dependency When the value is a fourth value (such as a value of 0), it indicates that there is no decoding dependency between the texture information (image information) contained in the current video track and the depth information corresponding to the texture information (image information).
  • video_with_parallax_info_flag if the value of this field is equal to 1, that is, the current video track contains both video image information and disparity information corresponding to the video image information, you can further define parallax_codec_dependency (disparity codec) in the video media header data box. Decoding dependency field) this flag.
  • parallax_codec_dependency when the value of parallax_codec_dependency is the third value (such as the value 1), it indicates that there is a decoding dependency relationship between the video image information contained in the current video track and the disparity information corresponding to the video image information; when the value of parallax_codec_dependency is the fourth value (such as a value of 0), it indicates that there is no decoding dependency between the video image information contained in the current video track and the disparity information corresponding to the video image information.
  • the value of parallax_codec_dependency when the value of parallax_codec_dependency is the third value (such as the value 1), it indicates that there is a decoding dependency relationship between the video image information contained in the current video track and the disparity information corresponding to the video image information; when the value of parallax_codec_dependency is the fourth value (such as a value of 0), it indicates that there is no decoding dependency between the video image information contained in the current video track
  • auxl_codec_dependency when the value of auxl_codec_dependency is the third value (such as the value 1), it indicates that there is a decoding dependency relationship between the video image information contained in the current video track and the video auxiliary information corresponding to the video image information; when the value of auxl_codec_dependency is the fourth value A value (such as a value of 0) indicates that there is no decoding dependency between the video image information contained in the current video track and the video auxiliary information corresponding to the video image information.
  • VideoCompositionDependencyInfoBox This video composition dependency information data box is used to indicate: the component type of the video content of the immersive media, and whether there is a codec dependency relationship between different components of the video content of the immersive media. (That is, the dependency on encoding and decoding).
  • the video component dependency information data box can be as follows:
  • composition_codec_dependency (component codec dependency field) is used to indicate whether there is a codec dependency relationship between different components contained in the video content of the immersive media.
  • this field is the fifth value (such as the value 0), it indicates that there is no codec dependency between different components of the video content of the immersive media; when the value of this field is the sixth value (such as the value 1), Indicates the existence of codec dependencies between different components of video content for immersive media.
  • composition_type is used to indicate the component type of the video content of the immersive media (that is, the type of components contained in the video content).
  • the value of this field is the seventh value (such as the value 0)
  • One component includes video content shot by one camera, that is, the video content of the immersive media includes video content shot by different cameras
  • the value of this field is the eighth value (such as the value 1)
  • it indicates that the component type of the video content of the immersive media is Based on the binocular component type, it can be indicated that the components in the video content of the immersive media are obtained based on binocular division.
  • composition_type is not limited to the seventh value and the eighth value, but can also be other values, and other values can be expanded by the corresponding standards.
  • TrackReferenceTypeBox Extended track reference type data box
  • reference_type reference type field
  • the embodiments of this application are extended as follows:
  • reference_type When the reference type field (reference_type) stores the track index 'vdep', it is used to reference or associate the depth map track to the texture map track corresponding to the corresponding camera, and indicates that the depth map track relies on the corresponding texture map track when decoding.
  • reference_type When the reference type field (reference_type) stores the track index 'vdpi', it is used to reference or associate the depth map track to the texture map track corresponding to the corresponding camera, and indicates that the depth map track does not rely on the corresponding texture map track when decoding. .
  • the texture map track is one of the video image information tracks
  • the depth map track is one of the video auxiliary information tracks
  • the video image information track is not limited to the texture map track
  • the video auxiliary information track is not limited to
  • the video auxiliary track may also include a disparity information track, etc. Therefore, the track index between the video image information track and the disparity information track, and the track index between the video image information track and the video auxiliary information track can also be extended as follows:
  • reference_type When the reference type field (reference_type) is stored as the track index of 'vplx', it is used to reference or associate the disparity information track to the corresponding video image information track, and indicates that the disparity information track relies on the corresponding video image information track when decoding.
  • reference_type When the reference type field (reference_type) stores the track index of 'vpli', it is used to reference or associate the disparity information track to the corresponding video image information track, and indicates that the disparity information track does not depend on the corresponding video image information track when decoding.
  • reference_type When the reference type field (reference_type) stores the track index of 'auxl', it is used to reference or associate the video auxiliary information track to the corresponding video image information track, and indicates that the video auxiliary information track relies on the corresponding video image information track when decoding.
  • reference_type When the reference type field (reference_type) stores the track index of 'auxi', it is used to reference or associate the video auxiliary information track to the corresponding video image information track, and indicates that the video auxiliary information track does not rely on the corresponding video image information track when decoding. .
  • the above-mentioned track index types can be interchanged regarding decoding dependency indications; for example, the track index 'vdep' is used to indicate that encoding and decoding are independent, and the track index 'vdpi' is used to indicate encoding and decoding. Dependencies, etc.
  • the existing technology defines the free-view track group as follows: If a free-view video is encapsulated into multiple video tracks, and one video track only contains texture information and/or depth information corresponding to some cameras, these video tracks should be passed through the free-view track group. The angle track group is associated. Different video tracks that contain some cameras and correspond to the same free-angle video should have the same track_group_id. However, the definition of the free-view track group in the embodiment of this application is expanded as follows:
  • the free-view track group is used to associate video tracks that correspond to different cameras and contain texture information, and the encoding and decoding between these different video tracks are independent of each other; and, different video tracks that contain some cameras and correspond to the same free-view video should Have the same track_group_id. That is to say, the embodiment of the present application only associates video tracks encapsulated with texture information through free-view track groups; considering that texture information is a type of video image information, it can be further expanded to: only video tracks encapsulated with video Video tracks of image information are associated through free-view track groups.
  • the video track that encapsulates video image information may include a free-view track group data box; the free-view track group data box here may be referred to as track group data for short. box, the track group data box may include a track group identification field (track_group_id) and a camera identification field (camera_id).
  • the track group identification field is used to store the identification information of the track group corresponding to the corresponding video track
  • the camera identification field is used to store the identification information of the camera corresponding to the corresponding video track.
  • the video auxiliary information descriptor (VideoAuxiliaryInfoDescriptor), which is used to describe the relevant attributes of the video auxiliary information of the video resource.
  • the video auxiliary information descriptor is a SupplementalProperty element, and its @schemeIdUri attribute is "urn:avs:ims:2018:av3f".
  • the video auxiliary information descriptor may exist at the adaptation set level (adaptation set level) or representation level (representation level). When the video auxiliary information descriptor exists at the adaptation set level, it describes the relevant attributes of the video auxiliary information of all representations in the corresponding adaptation set; when the video auxiliary information descriptor exists at the representation level, it describes the video auxiliary information of the corresponding representation.
  • the specific attributes of the video auxiliary information descriptor can be shown in Table 3 below:
  • the corresponding Chinese meanings of each English in the first column can be as follows:
  • the full name of Elementsand Attributes is Elementsand AttributesforVideoAuxiliaryInfoDescriptor, and its corresponding Chinese meaning is the elements and attributes of the video auxiliary information descriptor;
  • VideoAuxiliaryInfo is the video auxiliary information field
  • VideoAuxiliaryInfo@withDepthInfo is the video auxiliary information field with depth information
  • VideoAuxiliaryInfo@withParallaxInfo is the video auxiliary information field with parallax information
  • VideoAuxiliaryInfo@withAuxiliaryInfo is the video auxiliary information field with auxiliary information
  • VideoAuxiliaryInfo@depthCodecDependency is the video auxiliary information
  • the depth codec dependency field VideoAuxiliaryInfo@parallaxCodecDependency is the parallax codec dependency field of video auxiliary information
  • VideoAuxiliaryInfo@auxilliaryCodecDependency is the auxiliary codec dependency field of video
  • the Chinese meaning of Use in the second column is function, O means that the corresponding attribute is Optional (optional), CM means ConditionalMandatory (mandatory under conditions), M means Mandatory (mandatory); in the third column
  • CM means ConditionalMandatory (mandatory under conditions)
  • M means Mandatory (mandatory); in the third column
  • the Chinese meaning of Data type is data type, avs:ims:2018:av31 and xs:bool are two different data types; the Chinese meaning of Description in the fourth column is description.
  • the free-view camera descriptor (AvsFreeViewCamInfoDescriptor), which is used to describe the camera information corresponding to the video resource.
  • the free-view camera descriptor is a SupplementalProperty element, and its @schemeIdUri attribute is "urn:avs:ims:2018:av31".
  • the free-view camera descriptor can be at the adaptation set level (adaptation set level) or the representation level (representation level).
  • the free-view camera descriptor exists at the adaptation set level, it describes the camera information corresponding to all representations in the corresponding adaptation set; when the free-view camera descriptor exists at the representation level, it describes the camera information corresponding to the corresponding representation.
  • the two fields AvsFreeViewCam@sourceId (free-view camera source identification field) and AvsFreeViewCam@combinationId (free-view camera combination identification field) are fields extended in the free-view camera descriptor according to the embodiment of the present application. Other fields besides these two fields are the original fields in the free-view camera descriptor.
  • the full name of Elements and Attributes in the first column is Elementsand AttributesforAvsFreeViewCamInfoDescriptor, and its corresponding Chinese meaning is the elements and attributes of the free-view camera descriptor.
  • the Chinese meanings of each English in the second column can be found in the relevant explanations in Table 3.
  • the avs:ims:2018:av3l, xs:unsignedInt and xs:float in the third column are three different data types.
  • the representation (video resource) or adaptation set (adaptation set) corresponding to a video track can also be indicated through the preselection (pre-selection set) tool.
  • the preselection pre-selection set
  • video resources corresponding to different cameras of the same video content should exist in the same preselection.
  • VideoAuxiliaryInfoDescriptor a video auxiliary information descriptor
  • the video auxiliary information descriptor can also be called a video auxiliary information descriptor, which can be used to describe the relevant attributes of the video auxiliary information of the video resource.
  • the definition of the video auxiliary information descriptor can be shown in Table 5 below:
  • descriptor_tag (descriptor identifier field) is used to identify the type of descriptor (i.e., video auxiliary information descriptor);
  • descriptor_length (descriptor length field) is used to identify the length of the descriptor (i.e., video auxiliary information descriptor), unit is bytes.
  • video_with_depth_info_flag (video auxiliary information field with depth information) is used to indicate whether the current media resource (ie, the corresponding video resource) contains both texture information (image information) and depth information corresponding to the texture information (image information).
  • the value of this field is the ninth value (such as the value 1), it indicates that the current media resource contains both texture information (image information) and depth information corresponding to the texture information (image information); when the value of this field is the tenth value (such as the value 0), indicates that the current media resource only contains texture information (image information).
  • the video auxiliary information descriptor may also include depth_codec_dependency (the depth codec of the video auxiliary information). Decoding dependency field); otherwise, the video auxiliary information descriptor may include only_depth_info_flag (only depth information flag field of video auxiliary information).
  • the depth_codec_dependency field is used to indicate whether there is a decoding dependency between the texture information (image information) contained in the current media resource (that is, the corresponding video resource) and the depth information corresponding to the texture information (image information); when this field is When the value is the eleventh value (such as the value 1), it indicates that there is a decoding dependency between the texture information (image information) contained in the current media resource and the depth information corresponding to the texture information (image information); when the value of this field is The twelfth value (such as the value 0) indicates that there is no decoding dependency between the texture information (image information) contained in the current media resource and the depth information corresponding to the texture information (image information).
  • the only_depth_info_flag field is used to indicate the image information contained in the current media resource (i.e., the corresponding video resource); when the value of this field is the thirteenth value (such as the value 1), it indicates that the current media resource only contains depth information; when this field When the value is the fourteenth value (such as the value 0), it indicates that the current media resource only contains video image (texture) information.
  • video_with_parallax_info_flag (video auxiliary information field with parallax information) is used to indicate whether the current media resource (ie, the corresponding video resource) contains both video image information and parallax information corresponding to the video image information.
  • the value of this field is the ninth value (such as the value 1), it indicates that the current media resource contains both video image information and disparity information corresponding to the video image information; when the value of this field is the tenth value (such as the value 0), it indicates that the current Media assets contain only video image information.
  • the video auxiliary information descriptor may also include parallax_codec_dependency (the disparity of the video auxiliary information Codec dependency field); otherwise, the video auxiliary information descriptor may include only_parallax_info_flag (only parallax information flag field of video auxiliary information).
  • the field parallax_codec_dependency is used to indicate whether there is a decoding dependency between the video image information contained in the current media resource (i.e., the corresponding video resource) and the disparity information corresponding to the video image information; when the value of this field is the eleventh value (such as a value of 1), indicates that there is a decoding dependency between the video image information contained in the current media resource and the disparity information corresponding to the video image information; when the value of this field is a twelfth value (such as a value of 0), it indicates that There is no decoding dependency between the video image information contained in the current media resource and the disparity information corresponding to the video image information.
  • the eleventh value such as a value of 1
  • twelfth value such as a value of 0
  • the only_parallax_info_flag field is used to indicate the image information contained in the current media resource (i.e., the corresponding video resource); when the value of this field is the thirteenth value (such as the value 1), it indicates that the current media resource only contains parallax information; when this field When the value is the fourteenth value (such as the value 0), it indicates that the current media resource only contains video image information.
  • video_with_auxl_info_flag (video auxiliary information field with auxiliary information) is used to indicate whether the current media resource (ie, the corresponding video resource) contains both video image information and video auxiliary information corresponding to the video image information.
  • the value of this field is the ninth value (such as the value 1), it indicates that the current media resource contains both video image information and video auxiliary information corresponding to the video image information; when the value of this field is the tenth value (such as the value 0), it indicates The current media resource only contains video image information.
  • the video auxiliary information here can be depth information, disparity information, or auxiliary information defined by other applications.
  • the video auxiliary information descriptor may also include auxl_codec_dependency (video auxiliary information auxiliary codec dependency field); otherwise, the video auxiliary information descriptor may include only_auxiliary_info_flag (only auxiliary information flag field of video auxiliary information).
  • the auxl_codec_dependency field is used to indicate whether there is a decoding dependency between the video image information contained in the current media resource (i.e., the corresponding video resource) and the video auxiliary information corresponding to the video image information; when the value of this field is eleventh
  • the value of this field is a twelfth value (such as a value of 1)
  • it indicates that there is a decoding dependency between the video image information contained in the current media resource and the video auxiliary information corresponding to the video image information
  • the value of this field is a twelfth value (such as a value of 0) , indicating that there is no decoding dependency between the video image information contained in the current media resource and the video auxiliary information corresponding to the video image information.
  • the only_auxiliary_info_flag field is used to indicate the image information contained in the current media resource (i.e., the corresponding video resource); when the value of this field is the thirteenth value (such as the value 1), it indicates that the current media resource only contains video auxiliary information; when the field When the field value is the fourteenth value (such as the value 0), it indicates that the current media resource only contains video image information.
  • the immersive media processing method can be executed by a media production device.
  • the immersive media processing method may include the following steps S201-S202:
  • the immersive media may be free-view video, or other immersive media other than free-view video, which is not limited here.
  • immersive media can be captured by multiple cameras; the information collected by any camera includes: video image information and corresponding video auxiliary information.
  • the video image information collected by any camera may include texture information
  • the video auxiliary information collected by any camera may include depth information; of course, it is understandable that here Video image information and video auxiliary information are not limited.
  • the video auxiliary information collected by any camera may include parallax information, or auxiliary information defined by other applications.
  • S202 Encapsulate the video bit stream into M video tracks to obtain an immersive media media file.
  • M is a positive integer.
  • the encapsulation method of immersive media is a single-track encapsulation method, that is, the media production device can encapsulate the video bit stream into a video track to obtain the media file of the immersive media.
  • the media production equipment can encapsulate the video bit stream into a video track to obtain the media file of immersive media; in this case, the media production equipment
  • the device can encapsulate video image information and corresponding video auxiliary information collected by the same camera into the same video track, or it can encapsulate video image information and corresponding video auxiliary information collected by the same camera into different In the video track, there is no limit to this.
  • the immersive media packaging method is a multi-track packaging method
  • one video track can correspond to one or more cameras, and there is no limit to this; for example, the media production equipment can choose a video track to only package the video captured by one camera.
  • At least one of video image information and corresponding video auxiliary information you can also choose one video track to encapsulate the video image information and corresponding video auxiliary information collected by at least two cameras, you can also choose one video track to encapsulate at least two Video image information collected by the camera, or video auxiliary information collected by at least two cameras.
  • media production equipment can also perform the following operations:
  • the media production device can add metadata information to the media header data box of this only video track. If M>1 (that is, the video bit stream of immersive media is encapsulated into multiple video tracks), then in the case where the video image information collected by the same camera and the corresponding video auxiliary information are encapsulated into the same video track, media production The device can add metadata information to the video media header data box of each of the M video tracks; in the case where the video image information and corresponding video auxiliary information collected by the same camera are encapsulated into different video tracks , considering that the M video tracks can be specifically subdivided into one or more video image information tracks and one or more video auxiliary information tracks, and the video auxiliary information tracks can be referenced or associated to the corresponding video image information track, therefore To save resources, the media production equipment may add metadata information in the video media header data box of each video image information track among the M video tracks.
  • the media production device can at least add metadata information in the video media header data box of the video track that encapsulates the video image information; that is, if the m-th video track among the M video tracks encapsulates If the video image information of at least one camera is obtained, the video media header data box of the m-th video track includes metadata information, m ⁇ [1, M].
  • the metadata information includes a content information flag field; the media production device can set the value of the content information flag field according to whether the video track contains both video image information and corresponding video auxiliary information. Specifically, if the corresponding video track contains both video image information and corresponding video auxiliary information, the media production device can set the value of the corresponding content information flag field to the first value (such as the value 1). If the corresponding video track only Contains video image information, the media production device can set the value of the corresponding content information flag field to the second value (such as the value 0).
  • a content information flag for determining whether the video track contains both video image information and corresponding video auxiliary information is set in the metadata information. field, ensuring the processing speed of media consumption devices during the unblocking process, and also saving processing resources.
  • the content information flag field may be any of the following: video depth information flag field (i.e., the aforementioned video_with_depth_info_flag), video parallax information flag field (i.e., the aforementioned video_with_parallax_info_flag), video auxiliary information flag field (i.e. the video_with_auxl_info_flag mentioned above).
  • the video image information includes texture information
  • the video auxiliary information includes depth information
  • the content information flag field is the video disparity information flag field
  • the video auxiliary information includes disparity information
  • the video auxiliary information includes depth information, disparity information, or auxiliary information defined by other applications.
  • the media production device may also choose to add metadata information in the video media header data box of the video track that encapsulates the video auxiliary information; in this case, if the corresponding video track also contains video image information and the corresponding video auxiliary information, the media production equipment can set the value of the corresponding content information flag field to the first value (such as the value 1). If the corresponding video track only contains video auxiliary information, the media production equipment can set the corresponding The value of the content information flag field is set to the second value (such as the value 0).
  • the media production device may not pay attention to whether the video auxiliary information and video image information are encapsulated in the same video track, and directly choose to add metadata information in the video media header data box of each video track;
  • the information status of the video track can be indicated by the three values of the content information flag field. If the corresponding video track contains both video image information and corresponding video auxiliary information, the media production device can convert the corresponding content information The value of the flag field is set to the first value (such as the value 1). If the corresponding video track only contains video image information, the media production device can set the value of the corresponding content information flag field to the second value (such as the value 0). ), if the corresponding video track only contains video auxiliary information, the media production device can set the value of the corresponding content information flag field to another value (such as value 2), and so on.
  • the content information flag field is a video depth information flag field, a video disparity information flag field, or a video auxiliary information flag field, it is possible to promptly clarify whether the auxiliary information is depth information, parallax information, and auxiliary information during the encapsulation process. Which one.
  • the media The production equipment can also add a video information encoding and decoding dependency field in the video media header data box in the corresponding video, and can make a connection between the video image information and the video auxiliary information according to the video image information in the corresponding video track and the corresponding video auxiliary information. Whether there is a decoding dependency relationship, set the value of the video information encoding and decoding dependency field.
  • the video media header data box in the corresponding video track also includes a video information codec dependency field; when When the value of the video information codec dependency field is the third value, it indicates that there is a decoding dependency relationship between the video image information in the corresponding video track and the corresponding video auxiliary information; when the video information codec dependency field When the value of is the fourth value, it indicates that there is no decoding dependency between the video image information in the corresponding video track and the corresponding video auxiliary information.
  • the value of the video information codec dependency field can be set to a third value (such as the value 1); if the video track If there is no decoding dependency between the video image information and the corresponding video auxiliary information, the value of the video information codec dependency field can be set to a fourth value (such as a value of 0).
  • the specific form of the video information codec dependency field may depend on the specific form of the content flag field. If the content flag field is the video depth information flag field, the video information codec dependency field is the depth codec dependency field. ; If the content flag field is the video disparity information flag field, the video information codec dependency field is the parallax codec dependency field; if the content flag field is the video auxiliary information flag field, the video information codec dependency field is the auxiliary codec field. Decode dependency fields.
  • the video information codec dependency field is any of the following: a depth codec dependency field, a parallax codec dependency field, an auxiliary codec dependency field; where: when the video information codec dependency field When the sex field is a depth codec dependency field, the video image information includes texture information, and the video auxiliary information includes depth information; when the video information codec dependency field is a disparity codec dependency field, the video auxiliary information includes disparity information; when When the video information codec dependency field is an auxiliary codec dependency field, the video auxiliary information includes depth information, disparity information, or auxiliary information defined by other applications.
  • the depth codec dependency field is the aforementioned depth_codec_dependency
  • the parallax codec dependency field is the aforementioned parallax_codec_dependency
  • the auxiliary codec dependency field is the aforementioned auxl_codec_dependency.
  • the video information codec dependency field is a depth codec dependency field, a parallax codec dependency field, or an auxiliary codec dependency field, it can be timely clarified whether the auxiliary information in the video track is depth information, depth information, or auxiliary codec dependency field. Which of disparity information and auxiliary information.
  • the video component dependency information data box is used to indicate: the component type of the video content of the immersive media , and whether there are codec dependencies between different components of the video content of the immersive media.
  • a video component dependency data box is defined. Through the video component dependency data box, it is possible to clarify whether the component type of the video content and whether there is a codec dependency relationship between the components, thereby improving the efficiency of codec.
  • the media production equipment can add a video component dependency information data box to the video track;
  • a video track encapsulates If the video image information collected by at least two cameras is included, or the video auxiliary information collected by at least two cameras is encapsulated, the media production device can add a video component dependency information data box to the video track. It can be seen that if the m-th video track among the M video tracks encapsulates one or more types of information collected by at least two cameras, the m-th video track includes a video component dependency information data box.
  • the number of video tracks is 1, there is no need for multiple video component dependency data boxes to indicate component types of video content and codec dependencies between different components.
  • the number of video tracks is multiple, if there is a video track that encapsulates a variety of information indicating the collection by two cameras, the video component dependency information data box needs to be included in the video track to encode the information between the two cameras. Decoding dependencies are indicated to improve encoding and decoding efficiency.
  • the video component dependency information data box includes a component codec dependency field; when the value of the component codec dependency field is a fifth value, it indicates that codecs exist between different components of the video content of the immersive media. Dependency relationship; when the value of the component codec dependency field is the sixth value, it indicates that there is no codec dependency relationship between different components of the video content of the immersive media.
  • the video component dependency information data box may include a component codec dependency field (i.e., the aforementioned composition_codec_dependency); the media production device may set the component codec dependency according to whether there is a codec dependency relationship between different components of the immersive media video content. Decode the value of the dependency field. Specifically, if there is a codec dependency relationship between different components of the video content of the immersive media, the media production device can set the value of the codec dependency field of the corresponding component to a fifth value (such as a value of 1). If there is no codec dependency relationship between different components of the video content of the media, the media production device can set the value of the codec dependency field of the corresponding component to a sixth value (such as a value of 0).
  • a component codec dependency field i.e., the aforementioned composition_codec_dependency
  • the media production device may set the component codec dependency according to whether there is a codec dependency relationship between different components of the immersive media video content. Decode the value of the dependency field
  • the value of the component codec dependency field can accurately reflect whether there is a codec dependency relationship between different components of the video content of the immersive media.
  • the video component dependency information data box includes a component type field; when the value of the component type field is a seventh value, it indicates that the component type of the video content of the immersive media is a camera-based component type; wherein, a component Includes video content captured by a camera; when the value of the component type field is the eighth value, it indicates that the component type of the video content of the immersive media is a binocular-based component type; wherein one component includes left-eye video content or right-eye video content .
  • the component type field is the aforementioned composition_type; the media production device can set the value of the component type field according to the component type of the video content of the immersive media (that is, the division rule of the components of the video content of the immersive media). Specifically, if the component type of the video content of the immersive media is a camera-based component type, that is, one component includes video content shot by a camera, the media production device can set the value of the corresponding component type field to the seventh value (such as the value 0); if the component type of the immersive media video content is based on the binocular component type, that is, one component includes left-eye video content or right-eye video content, the media production device can set the value of the corresponding component type field to The eighth value (such as value 1).
  • the value of the component type field in the video component dependency information data box it is possible to accurately distinguish whether the component type of the video content of the immersive media is a camera-based component type or a binocular-based component type.
  • the video track encapsulating video image information among the M video tracks also includes a track group data box; wherein the track group data box includes a track group identification field and a camera identification field;
  • the track group identification field is used to store the identification information of the track group corresponding to the corresponding video track
  • the camera identification field is used to store the identification information of the camera corresponding to the corresponding video track.
  • the track group data box may include at least a track group identification field and a camera identification field; for any video track that encapsulates video image information, the media production device may store the identification information of the track group corresponding to the corresponding video track. to the track group identification field in the track group data box in the video track, and store the identification information of the camera corresponding to the corresponding video track in the camera identification field in the track group data box in the video track.
  • the media production device can accurately store the identification information of the track group and the identification information of the camera in the corresponding identification fields.
  • the media production equipment can also add a track reference type data box to each video auxiliary information track,
  • the video auxiliary information track is referenced or associated with the video image information track corresponding to the same camera, and indicates whether there is a dependency between the video auxiliary information track and the corresponding video image information track during decoding.
  • the M video tracks include: one or more video image information tracks and One or more video auxiliary information tracks; wherein, the video image information track refers to a video track that encapsulates video image information, and the video auxiliary information track refers to a video track that encapsulates video auxiliary information; any video auxiliary information track includes a track reference Type data box, track reference type data box is used to reference or associate the video auxiliary information track to the video image information track corresponding to the same camera, and is used to indicate whether the video auxiliary information track and the corresponding video image information track exist during decoding. Dependencies.
  • the video tracks are divided into tracks that encapsulate at least one video image information and tracks that encapsulate at least one video auxiliary information, and are represented by the track reference type data box corresponding to the video auxiliary information track.
  • the indicated decoding dependencies can improve decoding efficiency.
  • the track reference type data box contains a track identification field for storing identification information of the video image information track associated with the video auxiliary information track.
  • the track reference type data box contains a track identification field (track_ID); then for any video auxiliary information track, the media production device can store the identification information of the video image information track associated with the video auxiliary information track to the video auxiliary information track.
  • track_ID track identification field
  • the media production device can store the identification information of the video image information track associated with the video auxiliary information track to the video auxiliary information track.
  • Track ID field in the Track Reference Type data box in the Information Track In the Track ID field in the Track Reference Type data box in the Information Track.
  • the media production device can accurately store the identification information of the video image information track associated with each video auxiliary track.
  • the track reference type data box contains a reference type field; when the reference type field stores the first track index, it indicates that the video auxiliary information track relies on the corresponding video image information track when decoding; when the reference type field stores the first When the two-track index is used, it indicates that the video auxiliary information track does not rely on the corresponding video image information track when decoding.
  • the media production equipment can select the first one according to whether the video auxiliary information track relies on the corresponding video image information track when decoding.
  • the track index or second track index is stored into the reference type field in the track reference type data box in the video side information track.
  • the video auxiliary information track relies on the corresponding video image information track when decoding, you can choose to store the first track index in the reference type field in the track reference type data box in the video auxiliary information track; if When the video auxiliary information track is decoded, independent of the corresponding video image information track, the second track index may be stored in the reference type field in the track reference type data box in the video auxiliary information track.
  • the media production device can clarify whether there is a dependency relationship between the video auxiliary information track and the video image information track based on the track index that exists in the reference type field, so that the decoding can be completed accurately and quickly. Improved decoding efficiency.
  • the first track index is any one of the following: a first character string, the first character string is used to indicate that the video auxiliary information in the video auxiliary information track includes depth information; a second character string, a second character string used to indicate that the video auxiliary information in the video auxiliary information track includes parallax information; the third string is used to indicate that the video auxiliary information in the video auxiliary information track includes depth information, disparity information or auxiliary information defined by other applications information; the second track index is any of the following: the fourth string, the fourth string is used to indicate that the video auxiliary information in the video auxiliary information track includes depth information; the fifth string, the fifth string is used to indicate the video The video auxiliary information in the auxiliary information track includes disparity information; the sixth string is used to indicate that the video auxiliary information in the video auxiliary information track includes depth information, disparity information or auxiliary information defined by other applications.
  • the first character string is the aforementioned 'vdep'
  • the second character string is the aforementioned 'vplx'
  • the third character string is the aforementioned 'auxl'.
  • the fourth string is the aforementioned 'vdpi'
  • the fifth string is the aforementioned 'vpli'
  • the sixth string is the aforementioned 'auxi'.
  • corresponding strings are set for different video auxiliary information, which can effectively distinguish whether the video auxiliary information in the first track index or the second track index is depth information, Which of disparity information and auxiliary information.
  • the video bitstream of the immersive media is obtained and the video bitstream is encapsulated into M video tracks to obtain the media file of the immersive media; in this process, the media file of the immersive media can be obtained by
  • the video media header data box in the video media encapsulates metadata information to provide a unified indication of the information in the corresponding video track. This can organize the media content of immersive media in a more optimized way and avoid duplication of media consumption devices during the decapsulation process. Parse the relevant fields in other data boxes used to indicate the information status in the video track, thereby saving processing resources, improving decapsulation efficiency, and thus improving decoding efficiency.
  • the method further includes: generating a signaling description file corresponding to the media file; wherein the signaling description file includes at least one video auxiliary information descriptor, and the video auxiliary information descriptor is used to describe the relevant attributes of the video auxiliary information of the video resource. .
  • the transmission signaling corresponding to the signaling description file may be DASH signaling or SMT signaling, which is not limited.
  • the media production device may add at least one video auxiliary information descriptor in the signaling description file to describe the relevant attributes of the video auxiliary information of the video resource.
  • M when M>1, it indicates that the immersive media is packaged in a multi-track form, and the video resources or adaptive sets corresponding to one video track are indicated through the pre-selection set tool.
  • the pre-rotation set tool can effectively and accurately indicate the video resources or adaptive sets corresponding to each video track.
  • a signaling description file including at least one video auxiliary information descriptor is generated, so that a detailed and accurate description associated with the video auxiliary information can be related properties.
  • the video auxiliary information descriptor when the video auxiliary information descriptor is located at the adaptation set level of the signaling description file, it describes the relevant attributes of the video auxiliary information of each video resource in the corresponding adaptation set; when the video auxiliary information descriptor is located at the signaling description file When describing the presentation level of the file, describe the relevant attributes of the video auxiliary information of the corresponding video resource.
  • the video auxiliary information descriptor can be located at the presentation level to describe the relevant attributes of the video auxiliary information of the corresponding video resource.
  • the number of video auxiliary information descriptors in the signaling description file is the same as the number of video resources (representations) in the media file; alternatively, the video auxiliary information descriptors can be located at the adaptive set level to describe the corresponding Adapt to the relevant attributes of the video auxiliary information of each video resource in the set.
  • the number of video auxiliary information descriptors in the signaling description file is the same as the number of adaptive sets in the media file.
  • the relevant attributes matching the level can be accurately described.
  • the media production device may add a video auxiliary information field in the video auxiliary information descriptor.
  • the media production device can store the relevant attributes of the video auxiliary information of the corresponding video resource into the video auxiliary information field; when the video auxiliary information descriptor is located at the adaptive set level , the media production device can store relevant attributes of the video auxiliary information of the corresponding adaptive set into the video auxiliary information field.
  • the relevant attributes of the video auxiliary information or the video auxiliary information of the adaptive set can be stored in the corresponding video auxiliary information field in a timely and accurate manner, that is, improving The accuracy of video auxiliary information storage is improved.
  • the video auxiliary information descriptor includes an auxiliary information indication field; when the value of the auxiliary information indication field is the ninth value, it indicates that the corresponding video resource contains both video image information and corresponding video auxiliary information; when the auxiliary information When the value of the indication field is the tenth value, it indicates that the corresponding video resource only contains video image information.
  • the media production device can add an auxiliary information indication field in the video auxiliary information descriptor.
  • the media production device may set the value of the auxiliary information indication field according to whether the video resource described by the video auxiliary information descriptor contains both video image information and corresponding video auxiliary information.
  • the media production device sets the value of the auxiliary information indication field to the ninth value (such as the value 1); if the video resource described by the video auxiliary information descriptor only contains video image information, then the media production device sets the value of the auxiliary information indication field to the tenth value (such as the value 0).
  • the auxiliary information indication field it can be clarified in time whether the corresponding video resource only contains video image information, or whether it contains both video image information and corresponding video auxiliary information.
  • the auxiliary information indication field is any of the following: a video auxiliary information field with depth information, a video auxiliary information field with disparity information, a video auxiliary information field with auxiliary information; wherein: when assisting When the information indication field is a video auxiliary information field with depth information, the video image information includes texture information, and the video auxiliary information includes depth information; when the auxiliary information indication field is a video auxiliary information field with disparity information, the video auxiliary information includes Parallax information; when the auxiliary information indication field is a video auxiliary information field with auxiliary information, the video auxiliary information includes depth information, disparity information, or auxiliary information defined by other applications.
  • the video auxiliary information field with depth information is VideoAuxiliaryInfo@withDepthInfo or video_with_depth_info_flag involved in the aforementioned DASH signaling
  • the video auxiliary information field with disparity information is the aforementioned DASH signaling.
  • the video auxiliary information field with auxiliary information is VideoAuxiliaryInfo@withAuxiliaryInfo related to the aforementioned DASH signaling or video_with_auxl_info_flag related to the SMT signaling.
  • auxiliary information indication field is a video auxiliary information field with depth information, a video auxiliary information field with disparity information, or a video auxiliary information field with auxiliary information
  • timely transmission can be achieved during the transmission process. It is clarified that in addition to the video image information of the video resource, it also contains one of depth information, disparity information and auxiliary information.
  • the media production device can specify the video auxiliary information descriptor in the video resource. Add an auxiliary information codec dependency field in the auxiliary information codec dependency field, and set the value of the auxiliary information codec dependency field according to whether there is a decoding dependency relationship between the video image information in the corresponding video resource and the corresponding video auxiliary information.
  • the video auxiliary information descriptor includes an auxiliary information codec dependency field; when the auxiliary information codec depends When the value of the property field is the eleventh value, it indicates that there is a decoding dependency relationship between the video image information in the corresponding video resource and the corresponding video auxiliary information; when the value of the auxiliary information encoding and decoding dependency field is the twelfth value When, it indicates that there is no decoding dependency between the video image information and the corresponding video auxiliary information in the corresponding video resource.
  • the value of the auxiliary information codec dependency field can be set to the eleventh value (such as value 1); if there is no decoding dependency between the video image information in the video resource described by the video auxiliary information descriptor and the corresponding video auxiliary information, the value of the auxiliary information codec dependency field can be set to The twelfth value (such as the value 0).
  • auxiliary information encoding and decoding dependency field by identifying the value of the auxiliary information encoding and decoding dependency field, it can be quickly determined whether there is a decoding dependency between the video image information and the corresponding video auxiliary information in the video resource, thereby improving decoding efficiency.
  • the specific form of the auxiliary information encoding and decoding dependency field may depend on the specific form of the auxiliary information indication field. If the auxiliary information indication field is a video auxiliary information field with depth information, the auxiliary information codec dependency field is the depth codec dependency field of the video auxiliary information; if the auxiliary information indication field is video auxiliary information with disparity information field, then the auxiliary information codec dependency field is the disparity codec dependency field of the video auxiliary information; if the auxiliary information indication field is a video auxiliary information field with auxiliary information, the auxiliary information codec dependency field is the video auxiliary information Auxiliary codec dependency field.
  • the auxiliary information codec dependency field is any of the following: a depth codec dependency field of video auxiliary information, a disparity codec dependency field of video auxiliary information, an auxiliary codec of video auxiliary information Dependency field; where: when the auxiliary information codec dependency field is the depth codec dependency field of the video auxiliary information, the video image information includes texture information, and the video auxiliary information includes depth information; when the auxiliary information codec dependency field is when the auxiliary information codec dependency field is the auxiliary codec dependency field of the video auxiliary information, the video auxiliary information includes depth information, disparity information, or Auxiliary information defined by other applications.
  • the depth codec dependency field of the video auxiliary information is the VideoAuxiliaryInfo@depthCodecDependency involved in the aforementioned DASH signaling or the depth_codec_dependency involved in the SMT signaling
  • the parallax codec dependency field of the video auxiliary information is the VideoAuxiliaryInfo@parallaxCodecDependency involved in the aforementioned DASH signaling.
  • the auxiliary codec dependency field of the video auxiliary information is VideoAuxiliaryInfo@auxilliaryCodecDependency of DASH signaling or auxl_codec_dependency involved in SMT signaling.
  • the auxiliary information codec dependency field is the depth codec dependency field, the disparity codec dependency field, and the auxiliary codec dependency field, it can be clearly understood in time during the transmission process whether the auxiliary information is depth information, Which of disparity information and auxiliary information.
  • the media production device can also add a video auxiliary identification field to the video auxiliary information descriptor, and set the value of the video auxiliary identification field according to the information type in the corresponding video resource.
  • the video auxiliary information descriptor includes a video auxiliary identification field; when the value of the auxiliary information identification field is the thirteenth value, it indicates that the corresponding video resource only contains video auxiliary information corresponding to the video image information; when the auxiliary information When the value of the identification field is the fourteenth value, it indicates that the corresponding video resource only contains video image information.
  • the value of the video auxiliary identification field can be set to the thirteenth value (such as the value 1); if the video auxiliary information If the video resource described by the descriptor only contains video auxiliary information corresponding to the video image information, the value of the video auxiliary identification field can be set to a fourteenth value (such as a value of 0).
  • the value of the video auxiliary identification field can accurately indicate whether the corresponding video resource only contains video auxiliary information or only video image information during the transmission process.
  • the video auxiliary identification field is any of the following: a depth information only flag field of video auxiliary information, a disparity information only flag field of video auxiliary information, or an auxiliary information only flag field of video auxiliary information; wherein : When the video auxiliary identification field is the depth information only flag field of the video auxiliary information, the video image information includes texture information, and the video auxiliary information includes depth information; when the video auxiliary identification field is the parallax information only flag field of the video auxiliary information. , the video auxiliary information includes disparity information; when the video auxiliary identification field is the auxiliary information only flag field of the video auxiliary information, the video auxiliary information includes depth information, disparity information or auxiliary information defined by other applications.
  • the depth-only information flag field of the video auxiliary information is VideoAuxiliaryInfo@onlyDepthInfo involved in the aforementioned DASH signaling or the only_depth_info_flag involved in the SMT signaling
  • the only parallax information flag field of the video auxiliary information is VideoAuxiliaryInfo@onlyParallaxInfo involved in the aforementioned DASH signaling. or only_parallax_info_flag involved in SMT signaling
  • the auxiliary information only flag field of the video auxiliary information is VideoAuxiliaryInfo@onlyAuxiliaryInfo involved in the aforementioned DASH signaling or only_auxiliary_info_flag involved in the aforementioned SMT signaling.
  • auxiliary information flag field is a depth information flag field only, a disparity information flag field only, or an auxiliary information flag field only.
  • the signaling description file also includes at least one free-view camera descriptor, and the free-view camera descriptor is used to describe the camera information corresponding to the video resource; wherein the free-view camera descriptor includes a free-view camera source identification field, The value of the free-view camera source identification field is used to indicate the content identifier to which the corresponding video resource belongs. Video resources corresponding to different cameras under the same video content have the same content identifier; alternatively, the free-view camera descriptor includes a free-view camera Combination identification field, the value of the free-view camera combination identification field is used to indicate the combination identifier to which the corresponding video resource belongs. Video resources corresponding to different cameras under the same video content have the same combination identifier.
  • the transmission signaling is DASH signaling
  • the free-view camera descriptor can be located at the presentation level to describe the camera information corresponding to the corresponding video resource.
  • the signal Let the number of free-view camera descriptors in the description file be the same as the number of video resources (representations) in the media file; alternatively, the free-view camera descriptors can be located at the adaptation set level to describe each video resource in the corresponding adaptation set.
  • the number of free-view camera descriptors in the signaling description file is the same as the number of adaptive sets in the media file.
  • the media production device can add a free-view camera source identification field in the free-view camera descriptor, and store the content identifier belonging to the video resource described by the free-view camera descriptor in the free-view camera source identification field; or , the media production equipment can add a free-view camera combination identification field in the free-view camera descriptor, and store the combination identifier to which the video resource described by the free-view camera descriptor belongs in the free-view camera combination identification field.
  • the media production equipment can also add one or more AvsFreeViewCam (free-view camera fields) in the free-view camera descriptor, and store the camera information corresponding to the corresponding representation or adaptation set into AvsFreeViewCam.
  • AvsFreeViewCam@cameraId free-view camera identification field
  • AvsFreeViewCam@cameraId free-view camera identification field
  • other information of each camera corresponding to the corresponding video resource can also be added to the free-view camera descriptor, such as the x component value, y component value and z value of the corresponding camera position.
  • the component value corresponds to the x component value, y component value, z component value, etc. of the camera focal length.
  • the camera information corresponding to the video resource can be described accurately and comprehensively.
  • corresponding information is added to the signaling description file to indicate whether each video resource in the media file contains both video image information and video auxiliary information, and to indicate whether the video Whether there is a decoding dependency relationship between various information in the resource and whether there is a decoding dependency relationship between different components of the immersive media video content can enable the client in the media consumption device to follow the signaling during the free-view video consumption process. Instructions for transferring files to select corresponding resources for consumption and whether to perform segmented decoding can effectively improve the playback performance of the client.
  • FIG. 3 is a schematic flowchart of another immersive media processing method provided by an embodiment of the present application.
  • the immersive media processing method can be executed by the media consumption device, or by a client in the media consumption device; for ease of explanation, the following description will take the media consumption device executing the immersive media processing method as an example.
  • the immersive media processing method may include the following steps S301-S302:
  • immersive media is captured by multiple cameras.
  • the information collected by any camera includes: video image information and corresponding video auxiliary information; and the media file is obtained by encapsulating the video bit stream of the immersive media into M videos. obtained in orbit.
  • M is a positive integer.
  • the M video tracks there is at least one video media header data box in the video track that includes metadata information.
  • the metadata information is used to indicate whether the corresponding video track contains both video image information and corresponding video auxiliary information. information.
  • the video track encapsulating the video image information among the M video tracks also contains a track group data box.
  • the track group data box includes a track group identification field and a camera identification field; where, the track group data box includes a track group identification field and a camera identification field;
  • the group identification field is used to store the identification information of the track group corresponding to the corresponding video track
  • the camera identification field is used to store the identification information of the camera corresponding to the corresponding video track.
  • the M video tracks include: one or more video image information tracks and one or more video auxiliary information Information track; any video auxiliary information track includes a track reference type data box.
  • the track reference type data box is used to reference or associate the video auxiliary information track to the video image information track corresponding to the same camera, and is used to indicate the video auxiliary information track and Corresponding video image information track, whether there is any dependency during decoding.
  • the track reference type data box includes a track identification field, which is used to store identification information of the video image information track associated with the video auxiliary information track.
  • the track reference type data box contains a reference type field; when the reference type field stores the first track index, it indicates that the video auxiliary information track relies on the corresponding video image information track when decoding; when the reference type field stores the second track index , indicating that the video auxiliary information track does not rely on the corresponding video image information track when decoding.
  • the media consumption device may download the media file of the immersive media from the media production device. After downloading the media file, the media consumption device can decapsulate the media file to obtain M video tracks. The media consumption device can then select a target video track from the M video tracks. Specifically, if the immersive media is a free-view video, the media consumption device can determine a target camera that matches the user's angle, and select the video track corresponding to the target camera from the M video tracks as the target video track, and encapsulate it in the target video track. There is one or more of video image information and video depth information collected by the target camera.
  • the media consumption device can establish streaming transmission with the media production device, and obtain the signaling description file corresponding to the media file of the immersive media from the media production device , the signaling description file includes at least one video auxiliary information descriptor, and the video auxiliary information descriptor is used to describe the relevant attributes of the video auxiliary information of the video resource.
  • the media production device can request the media production device to deliver the target video resource according to the instructions of the signaling description file; after receiving the target video resource, the media production device can decapsulate the target video resource to obtain the target video track.
  • the media production equipment can determine the target video resource according to the instructions of the signaling description file and combined with the user's perspective, as follows:
  • the media consumption device can select a target camera that matches the user's angle from multiple cameras based on the user's angle and the camera position information described by each free-view camera descriptor in the signaling description file.
  • the camera identifier stored in the free-view camera identification field in each free-view camera descriptor determines the free-view camera descriptor corresponding to the target camera.
  • the video auxiliary information descriptor corresponding to the free-view camera descriptor can be determined from the signaling description file, and based on the value of the auxiliary information indication field in the determined video auxiliary information descriptor, it is determined whether the corresponding video resource It also contains video image information and video auxiliary information, so that the target video resource can be determined based on the first determination result.
  • the media consumption device may determine the video resource described by the determined video auxiliary information descriptor as the target video resource.
  • the media consumption device can further determine the video resource based on the determined video image information.
  • the value of the auxiliary information codec dependency field in the auxiliary information descriptor determines whether there is a decoding dependency relationship between the corresponding video image information and the video auxiliary information, so that the target video resource can be determined based on the second determination result.
  • the value of the auxiliary information codec dependency field is the eleventh value, it indicates that there is a decoding dependency relationship between the video image information in the corresponding video resource and the corresponding video auxiliary information; when the auxiliary information codec dependency When the value of the property field is the twelfth value, it indicates that there is no decoding dependency between the video image information in the corresponding video resource and the corresponding video auxiliary information.
  • the media consumption device may convert the video described by the determined video auxiliary information descriptor into resource as the target video resource; if the second determination result indicates that there is a decoding dependency relationship between the video image information in the corresponding video resource and the corresponding video auxiliary information, the media consumption device can use the determined video auxiliary information descriptor to describe The video resource, as well as other video resources associated with this video resource, are used as target video resources.
  • Other video resources here refer to the video resources where the video auxiliary information corresponding to the video image information in the video resource described by the determined video auxiliary information descriptor is located.
  • the media consumption device can identify whether each video resource contains video image information or video auxiliary information corresponding to the video image information based on the value of the video auxiliary identification field in each video auxiliary information descriptor in the signaling description file. , thereby determining other video resources associated with the video resource described by the determined video auxiliary information descriptor according to the recognition result.
  • the value of the auxiliary information identification field is the thirteenth value, it indicates that the corresponding video resource only contains video auxiliary information corresponding to the video image information; when the value of the auxiliary information identification field is the fourteenth value, it indicates that the corresponding video The resource only contains video image information.
  • the media consumption device can request the media production device to deliver the target video resource in real time based on the obtained signaling description file, and decapsulate the target video resource.
  • the target video track can be obtained quickly and accurately, and the efficiency of understanding the package can be improved.
  • the media consumption device can obtain the target video based on the metadata information in the video media header data box of the target video track.
  • the information in the track is decoded.
  • the metadata information may include a content information flag field.
  • the value of the content information flag field is the first value, it indicates that the corresponding video track contains both video image information and corresponding video auxiliary information; when the content information
  • the value of the flag field is the second value, it indicates that the corresponding video track only contains video image information.
  • the media consumption device can first determine whether the target video track contains both video image information and corresponding video auxiliary information based on the value of the content information flag field in the video media header data box in the target video track; if it is determined that the target video If the track only contains video image information, the media consumption device can decode the video image information.
  • the video media header data box in the target video track at this time also includes a video information encoding and decoding dependency field, and when the video information is encoded
  • the value of the decoding dependency field is the third numerical value, it indicates that there is a decoding dependency relationship between the video image information in the corresponding video track and the corresponding video auxiliary information
  • the value of the video information codec dependency field is the fourth numerical value
  • the media consumption device can first determine whether there is a decoding dependency between the video image information in the target video track and the corresponding video auxiliary information based on the value of the video information encoding and decoding dependency field in the video media header data box in the target video track. relationship, thereby determining whether to perform segment decoding processing on the video image information and video auxiliary information in the target video track. Specifically, if there is a decoding dependency, the video image information and video auxiliary information in the target video track can be decoded simultaneously; if there is no decoding dependency, the video image information and video auxiliary information in the target video track can be decoded simultaneously. The information is decoded in segments, that is, texture information is decoded at a certain stage and depth information is decoded at a certain stage. Of course, when there is no decoding dependency, the video image information and the video auxiliary information can also be decoded simultaneously, which is not limited.
  • the media consumption device obtains the target video resource through streaming transmission and obtains the target video track according to the target video resource, since the media consumption device has already obtained the target video resource according to the signaling
  • the relevant information in the description file knows whether the target video resource contains both video image information and video auxiliary information, and knows that when the target video resource contains both video image information and video auxiliary information, the video image information and video auxiliary information Whether there is a decoding dependency between them, therefore, after the media consumption device obtains the target video track, it can directly use the corresponding logic to decode the information in the target video track.
  • the media consumption device can decode the information in the target video track according to the metadata information in the video media header data box of the target video track, and the decoding logic is the same as the above-mentioned decoding logic.
  • the video track may also include a video component dependency information data box.
  • the video component dependency information data box may include a component codec dependency field and a component type. field; correspondingly, if the target video track contains the video dependent component information data box, the media consumption device can also obtain the components of the immersive media video content based on the value of the component type field in the video dependent component information data box.
  • Type and based on the value of the component encoding and decoding dependency field, we can learn whether there is a encoding and decoding dependency between different components of the immersive media video content, so as to determine whether the information corresponding to each camera in the target video track can be combined with the obtained results.
  • the value of the component codec dependency field when the value of the component codec dependency field is the fifth value, it indicates that there is a codec dependency relationship between different components of the video content of the immersive media; when the value of the component codec dependency field is the sixth value , indicating that there are no codec dependencies between different components of the video content of the immersive media.
  • the value of the component type field is the seventh value, it indicates that the component type of the video content of the immersive media is a camera-based component type; when the value of the component type field is the eighth value, it indicates that the video content of the immersive media is The ingredient type is based on the binocular ingredient type.
  • Embodiments of the present application can avoid the media consumption device from repeatedly parsing the relevant fields in other data boxes used to indicate the information in the video track during the decapsulation process, thereby saving processing resources, improving decapsulation efficiency, and thus improving decoding efficiency.
  • the embodiments of this application exemplarily propose the following three applications Scenes:
  • Server Generate the video bit stream of the free-view video, and encapsulate the video bit stream into a media file F0, and the free-view video bit stream is encapsulated in single-track mode, as shown in Figure 4a. in:
  • video_with_depth_info_flag 1, indicating that the video track is a track containing texture information and depth information;
  • depth_codec_dependency 0, indicating that the depth information of the video track does not depend on texture information during decoding
  • composition_type 0, indicating that the video content of this video track contains video content shot by different cameras;
  • composition_codec_dependency 0, indicating that the video content captured by different cameras in the video track does not depend on each other during decoding.
  • server S1 further repackages F0 into a multi-video track free-view file F1.
  • F1 For the processing of multi-track mode, please refer to the relevant descriptions of subsequent application scenarios 2 and 3.
  • the server S2 provides a download service and does not process the media file F0 of the free-angle video for users to download.
  • Server S3 supports streaming transmission, slices the media file F0 of the free-view video into media segments suitable for streaming transmission (in accordance with the specifications of the existing transmission standards), generates corresponding transmission signaling, and transmits the signaling Let the description file be transferred to the user.
  • the signaling description file of the transmission signaling indicates the following information:
  • VideoAuxiliaryInfo@withDepthInfo 1, indicating that the video resource contains both texture information (image information) and depth information corresponding to the texture information (image information);
  • VideoAuxiliaryInfo@depthCodecDependency 0, indicating that there is no decoding dependency between the texture information (image information) contained in the corresponding video resource and the depth information corresponding to the texture information (image information).
  • Client C1 downloads the complete file F0 and plays it locally.
  • the client can perform partial decoding at different stages, that is, decoding texture information at a certain stage and decoding depth information at a certain stage to optimize the performance of client playback.
  • Client C2 establishes streaming transmission with server S3, and selects the corresponding free-view video stream for consumption based on the texture information, depth information and codec dependency information indicated in the transmission signaling.
  • the client can perform partial decoding at different stages, that is, decoding texture information at a certain stage and decoding depth information at a certain stage to optimize the performance of client playback.
  • the server can generate multiple media files. One media file corresponds to a free-view video stream. There is a decoding dependency between the video image information and video auxiliary information in some media files, and the video image information and video auxiliary information in another part of the media files. There can be decoding dependencies between messages.
  • Server Generate the video bit stream of the free-view video and encapsulate the video bit stream into a media file F0, and the free-view video bit stream is encapsulated in multi-track mode, as shown in Figure 5a. in:
  • Track1 ⁇ Track4 all contain track group data boxes (AvsFreeViewGroupBox), and track_group_id is 100, indicating that Track1 ⁇ Track4 are K tracks of the same free-view video containing different camera shooting data.
  • the server S1 provides a download service and does not process the media file F0 of the free-angle video for users to download.
  • Server S2 supports streaming transmission, slices the media file F0 of the free-view video into media segments suitable for streaming transmission (in accordance with the specifications of the existing transmission standards), generates corresponding transmission signaling, and transmits the signaling Let the description file be transferred to the user.
  • the signaling description file of the transmission signaling indicates the following information:
  • VideoAuxiliaryInfo@withDepthInfo 1, indicating that the video resource contains both texture information (image information) and depth information corresponding to the texture information (image information);
  • VideoAuxiliaryInfo@depthCodecDependency 1, indicating that there is a decoding dependency between the texture information (image information) contained in the corresponding video resource and the depth information corresponding to the texture information (image information);
  • AvsFreeViewCamInfo@sourceId indicates that different video resources belong to the same video content
  • AvsFreeViewCamInfo@cameraId and AvsFreeViewCam@camera_pos_x/y/z indicate the camera identifier and camera location information respectively.
  • Client C1 downloads the complete file F0 and plays it locally.
  • the client According to the instructions in the file about texture, depth information, codec dependency information, and camera information, the client can only partially decode the texture information and depth information corresponding to a specific camera according to the user's viewing operation (at this time, because the texture and depth information are mutually exclusive dependencies and therefore cannot be decoded in stages) to optimize the performance of client playback.
  • Client C2 establishes streaming transmission with server S3, and according to the instructions on texture, depth information, codec dependency information and camera information in the transmission signaling, combined with the user's viewing operation, requests the server to only include textures corresponding to specific cameras.
  • the video resources of information and depth information are transmitted, and the corresponding video resources are decoded and consumed.
  • Server Generate the video bit stream of the free-view video and encapsulate the video bit stream into a media file F0, and the free-view video bit stream is encapsulated in multi-track mode, as shown in Figure 6a. in:
  • the processing type in the handlerbox in Track11 ⁇ Track14 is auxiliary video track, indicating that these video tracks (ie, video auxiliary information track) are used to assist the consumption of a certain video image information track (ie, texture map track).
  • auxiliary video track indicating that these video tracks (ie, video auxiliary information track) are used to assist the consumption of a certain video image information track (ie, texture map track).
  • the TrackReferenceTypeBox you can know the correspondence between the video tracks Track1 ⁇ Track4 and Track11 ⁇ Track14.
  • the track index type 'vdpi' we can know that the video auxiliary information contained in Track11 ⁇ Track14 is depth information, and the decoding of the depth information does not rely on the corresponding texture information.
  • Track1 ⁇ Track4 all contain track group data boxes (AvsFreeViewGroupBox), and track_group_id is 100, indicating that Track1 ⁇ Track4 are K tracks of the same free-view video containing different camera shooting data.
  • the server S1 provides a download service and does not process the media file F0 of the free-angle video for users to download.
  • Server S2 supports streaming transmission, slices the media file F0 of the free-view video into media segments suitable for streaming transmission (in accordance with the specifications of the existing transmission standards), generates corresponding transmission signaling, and transmits the signaling Let the description file be transferred to the user.
  • the signaling description file of the transmission signaling indicates the following information:
  • VideoAuxiliaryInfo@withDepthInfo 0, indicating that the corresponding video resource only contains texture information
  • AvsFreeViewCamInfo@sourceId 100, indicating that different video resources belong to the same video content
  • AvsFreeViewCamInfo@cameraId indicates the camera identifier and camera position information.
  • AvsFreeViewCam@camera_pos_x/y/z indicate the camera identifier and camera position information.
  • VideoAuxiliaryInfo@onlyDepthInfo 1, indicating that the corresponding video resource only contains depth information
  • the media resources corresponding to track1 ⁇ track4 can be associated with the media resources corresponding to track11 ⁇ track14.
  • Client C1 downloads the complete file F0 and plays it locally.
  • the client can only partially decode the texture information and depth information corresponding to a specific camera based on the user's viewing operation.
  • the client can also decode the texture information and depth information in stages to optimize the performance of client playback.
  • Client C2 establishes streaming transmission with server S3.
  • server S3 According to the instructions about texture, depth information, codec dependency information and camera information in the transmission signaling, combined with the user viewing operation, the server requests only the texture information and information corresponding to the specific camera.
  • the video resources of depth information are transmitted, and after receiving the corresponding video resources, the texture information and depth information can be decoded and consumed in stages.
  • an immersive media processing device may be run on the above-mentioned media production equipment.
  • a computer readable instruction (including program code).
  • the immersion media processing device can execute the method shown in Figure 2. Please refer to Figure 7.
  • the immersion media processing device can run the following units:
  • the processing unit 701 is used to obtain the video bit stream of the immersive media, which is captured by multiple cameras; the information collected by any camera through photography includes: video image information and corresponding video auxiliary information;
  • the encapsulating unit 702 is used to encapsulate the video bit stream into M video tracks to obtain a media file of immersive media; where M is a positive integer, and among the M video tracks, there is a video media header in at least one video track.
  • the data box includes metadata information, and the metadata information is used to indicate whether the corresponding video track contains both video image information and corresponding video auxiliary information.
  • an embodiment of the present application also discloses an immersive media processing device.
  • the immersive media processing device may be run on the above-mentioned media consumption device.
  • a computer readable instruction (including program code).
  • the immersion media processing device can execute the method shown in Figure 3. Please refer to Figure 8.
  • the immersion media processing device can run the following units:
  • the acquisition unit 801 is used to acquire the target video track in the media file of the immersive media.
  • the immersive media is captured by multiple cameras.
  • the information collected by any camera through photography includes: video image information and corresponding video auxiliary information; and
  • the media file is obtained by encapsulating the video bit stream of the immersive media into M video tracks; where M is a positive integer, and among the M video tracks, there is at least one video media header data box in the video track that includes metadata.
  • Information, metadata information is used to indicate whether the corresponding video track contains both video image information and corresponding video auxiliary information;
  • the decoding unit 802 is used to decode the information in the target video track.
  • the media file is produced and encapsulated by a media production device. After the video bit stream is encapsulated into M video tracks, video resources corresponding to each video track are obtained.
  • the media file includes videos corresponding to the M video tracks. resource;
  • the acquisition unit 801 when used to obtain the target video track in the media file of the immersive media, the acquisition unit 801 can be specifically used to:
  • the signaling description file includes at least one video auxiliary information descriptor, and the video auxiliary information descriptor is used to describe the relevant attributes of the video auxiliary information of the video resource;
  • the media production equipment is requested to deliver the target video resource; and the target video resource is decapsulated to obtain the target video track.
  • each unit in the immersion media processing device shown in Figure 7 or Figure 8 can be separately or entirely combined into one or several additional units to form, or one (some) of them can be
  • the unit can also be divided into multiple functionally smaller units, which can achieve the same operation without affecting the realization of the technical effects of the embodiments of the present application.
  • the above units are divided based on logical functions.
  • the function of one unit can also be realized by multiple units, or the functions of multiple units can be realized by one unit.
  • the immersive media-based processing device may also include other units. In practical applications, these functions may also be implemented with the assistance of other units, and may be implemented by multiple units in cooperation.
  • the method can be implemented on a general computing device such as a computer including a central processing unit (CPU), a random access storage medium (RAM), a read-only storage medium (ROM), and other processing elements and storage elements.
  • a general computing device such as a computer including a central processing unit (CPU), a random access storage medium (RAM), a read-only storage medium (ROM), and other processing elements and storage elements.
  • Run computer-readable instructions capable of executing each step involved in the corresponding method as shown in Figure 2 or Figure 3 to construct an immersive media processing device as shown in Figure 7 or Figure 8, And to implement the immersive media processing method of the embodiment of the present application.
  • the computer-readable instructions can be recorded on, for example, a computer-readable storage medium, loaded into the above-mentioned computing device through the computer-readable storage medium, and run therein.
  • the embodiment of the present application can obtain the video bit stream of the immersive media, and can encapsulate the video bit stream into M video tracks to obtain the media file of the immersive media; in this process, the video bit stream in at least one video track can be
  • the header data box encapsulates metadata information to provide a unified indication of the information in the corresponding video track. This can organize the media content of immersive media in a more optimized way and avoid the media consumption device from repeatedly parsing other data boxes during the decapsulation process.
  • Related fields used to indicate the information status in the video track thereby saving processing resources, improving decapsulation efficiency, and thus improving decoding efficiency.
  • the media processing device at least includes a processor 901, an input interface 902, an output interface 903, and a computer-readable storage medium 904.
  • the processor 901, input interface 902, output interface 903 and computer-readable storage medium 904 in the media processing device can be connected through a bus or other means.
  • Computer-readable storage media 904 may be stored in the memory of the media processing device, the computer-readable storage media 904 is used to store computer-readable instructions, the computer-readable instructions include program instructions, and the processor 901 is used to execute The computer-readable storage medium 904 stores program instructions.
  • the processor 901 (or CPU (Central Processing Unit)) is the computing core and control core of the media processing device. It is suitable for implementing one or more computer-readable instructions, and is specifically suitable for loading and executing one or more computer-readable instructions. computer-readable instructions to implement corresponding method processes or corresponding functions.
  • the processor 901 described in the embodiment of this application can be used to perform a series of immersive media processing procedures shown in Figure 2, specifically including: obtaining the immersive media Video bit stream, the immersive media is captured by multiple cameras; the information collected by any camera includes: video image information and corresponding video auxiliary information; the video bit stream is encapsulated into M video tracks , obtain the media file of the immersive media; wherein, M is a positive integer, and among the M video tracks, there is at least one video media header data box in the video track that includes metadata information, and the metadata information is To indicate whether the corresponding video track contains both video image information and corresponding video auxiliary information, etc.
  • the processor 901 described in the embodiment of this application can be used to perform a series of immersive media processing procedures shown in Figure 3, specifically including: obtaining the immersive media
  • the target video track in the media file the immersive media is obtained by shooting with multiple cameras, and the information collected by any camera includes: video image information and corresponding video auxiliary information; and the media file is obtained by photographing all
  • the video bit stream of the immersive media is obtained by encapsulating it into M video tracks; where M is a positive integer, and among the M video tracks, there is at least one video media header data box in the video track that includes metadata information,
  • the metadata information is used to indicate whether the corresponding video track contains both video image information and corresponding video auxiliary information; to decode the information in the target video track, and so on.
  • Embodiments of the present application also provide a computer-readable storage medium (Memory).
  • the computer-readable storage medium is a memory device in a media processing device and is used to store programs and data. It can be understood that the computer-readable storage medium here may include a built-in storage medium in the media processing device, and of course may also include an extended storage medium supported by the media processing device.
  • the computer-readable storage medium provides storage space that stores an operating system of the media processing device. Furthermore, the storage space also stores one or more computer-readable instructions suitable for being loaded and executed by the processor. These computer-readable instructions may be one or more computer programs (including program codes).
  • the computer-readable storage medium here can be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory; optionally, it can also be at least one located remotely The computer-readable storage medium of the aforementioned processor.
  • one or more computer-readable instructions stored in a computer-readable storage medium can be loaded and executed by the processor to implement the above-mentioned corresponding method in the immersive media processing method embodiment shown in FIG. 2 step.
  • one or more computer-readable instructions stored in a computer-readable storage medium may be loaded and executed by the processor to implement the above-mentioned corresponding method in the immersive media processing method embodiment shown in FIG. 3 step.
  • the embodiment of the present application can obtain the video bit stream of the immersive media, and can encapsulate the video bit stream into M video tracks to obtain the media file of the immersive media; in this process, the video bit stream in at least one video track can be
  • the header data box encapsulates metadata information to provide a unified indication of the information in the corresponding video track. This can organize the media content of immersive media in a more optimized way and avoid the media consumption device from repeatedly parsing other data boxes during the decapsulation process.
  • Related fields used to indicate the information status in the video track thereby saving processing resources, improving decapsulation efficiency, and thus improving decoding efficiency.
  • a computer program product or computer program includes computer readable instructions, and the computer readable instructions are stored in a computer readable storage medium.
  • the processor of the computer device reads the computer-readable instructions from the computer-readable storage medium, and the processor executes the computer-readable instructions, so that the computer device performs various possible aspects of the method embodiment shown in Figure 2 or Figure 3. Choose the method provided in the method.

Abstract

An immersive media processing method, comprising: obtaining a video bitstream of an immersive media (S201); and encapsulating the video bitstream into M video tracks to obtain a media file of the immersive media (S202), wherein among the M video tracks, a video media header data box in at least one video track comprises metadata information, and the metadata information is used for indicating whether the corresponding video track comprises video image information and corresponding video auxiliary information at the same time.

Description

沉浸媒体的处理方法、装置、设备及存储介质Immersive media processing methods, devices, equipment and storage media
本申请要求于2022年4月26日提交中国专利局,申请号202210445405.0,申请名称为“沉浸媒体的处理方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to the Chinese patent application submitted to the China Patent Office on April 26, 2022, with application number 202210445405.0, and the application name is "Immersive media processing method, device, equipment and storage medium", the entire content of which is incorporated by reference in this application.
技术领域Technical field
本申请涉及计算机技术领域,尤其涉及一种沉浸媒体的处理方法、装置、计算机设备及计算机可读存储介质。The present application relates to the field of computer technology, and in particular, to an immersive media processing method, device, computer equipment, and computer-readable storage medium.
背景技术Background technique
虽然传统的封装技术实现了在将沉浸媒体封装到一个或多个视频轨道后,可指示每个视频轨道是否包含视频图像信息和视频辅助信息;但是经实践发现,传统的封装技术对视频轨道内的信息情况存在重复指示的问题。Although traditional encapsulation technology can indicate whether each video track contains video image information and video auxiliary information after encapsulating immersive media into one or more video tracks, practice has found that traditional encapsulation technology has There is a problem of repeated instructions in the information situation.
这样会导致媒体消费设备在解封装过程中重复解析相应字段,浪费媒体消费设备的处理资源,还会导致解封装效率较低,从而导致解码效率较低。This will cause the media consumption device to repeatedly parse the corresponding fields during the decapsulation process, wasting the processing resources of the media consumption device, and will also lead to low decapsulation efficiency, resulting in low decoding efficiency.
发明内容Contents of the invention
一种沉浸媒体的处理方法,由媒体处理设备执行,所述方法包括:An immersive media processing method, executed by a media processing device, the method includes:
获取沉浸媒体的视频位流,所述沉浸媒体通过多个相机拍摄得到;任一相机通过拍摄所采集到的信息包括:视频图像信息和相应的视频辅助信息;Obtain the video bit stream of the immersive media, which is captured by multiple cameras; the information collected by any camera through photography includes: video image information and corresponding video auxiliary information;
将所述视频位流封装到M个视频轨道中,得到所述沉浸媒体的媒体文件;其中,M为正整数,且在所述M个视频轨道中,存在至少一个视频轨道中的视频媒体头数据盒包括元数据信息,所述元数据信息用于指示相应视频轨道是否同时包含视频图像信息和相应的视频辅助信息。Encapsulate the video bit stream into M video tracks to obtain the media file of the immersive media; where M is a positive integer, and among the M video tracks, there is a video media header in at least one video track The data box includes metadata information used to indicate whether the corresponding video track contains both video image information and corresponding video auxiliary information.
提供了一种沉浸媒体的处理装置,所述装置包括:An immersion media processing device is provided, and the device includes:
处理单元,用于获取沉浸媒体的视频位流,所述沉浸媒体通过多个相机拍摄得到;任一相机通过拍摄所采集到的信息包括:视频图像信息和相应的视频辅助信息;A processing unit, used to obtain the video bit stream of the immersive media, which is captured by multiple cameras; the information collected by any camera through photography includes: video image information and corresponding video auxiliary information;
封装单元,用于将所述视频位流封装到M个视频轨道中,得到所述沉浸媒体的媒体文件;其中,M为正整数,且在所述M个视频轨道中,存在至少一个视频轨道中的视频媒体头数据盒包括元数据信息,所述元数据信息用于指示相应视频轨道是否同时包含视频图像信息和相应的视频辅助信息。An encapsulating unit, configured to encapsulate the video bit stream into M video tracks to obtain the media file of the immersive media; where M is a positive integer, and among the M video tracks, there is at least one video track The video media header data box in includes metadata information, which is used to indicate whether the corresponding video track contains both video image information and corresponding video auxiliary information.
提供了一种沉浸媒体的处理方法,由媒体处理设备执行,所述方法包括:An immersive media processing method is provided, which is executed by a media processing device. The method includes:
获取沉浸媒体的媒体文件中的目标视频轨道,所述沉浸媒体通过多个相机拍摄得到,任一相机通过拍摄所采集到的信息包括:视频图像信息和相应的视频辅助信息;且所述媒体文件是通过将所述沉浸媒体的视频位流封装到M个视频轨道中得到的;其中,M为正整数,在所述M个视频轨道中,存在至少一个视频轨道中的视频媒体头数据盒包括元数据信息,所述元数据信息用于指示相应视频轨道是否同时包含视频图像信息和相应的视频辅助信息;Obtain the target video track in the media file of the immersive media. The immersive media is captured by multiple cameras. The information collected by any camera includes: video image information and corresponding video auxiliary information; and the media file It is obtained by encapsulating the video bit stream of the immersive media into M video tracks; where M is a positive integer, and among the M video tracks, there is a video media header data box in at least one video track including Metadata information, the metadata information is used to indicate whether the corresponding video track contains both video image information and corresponding video auxiliary information;
对所述目标视频轨道中的信息进行解码处理。Decode the information in the target video track.
一种沉浸媒体的处理装置,所述装置包括:An immersion media processing device, the device includes:
获取单元,用于获取沉浸媒体的媒体文件中的目标视频轨道,所述沉浸媒体通过多个相机拍摄得到,任一相机通过拍摄所采集到的信息包括:视频图像信息和相应的视频辅助信息;且所述媒体文件是通过将所述沉浸媒体的视频位流封装到M个视频轨道中得到的;其中,M为正整数,在所述M个视频轨道中,存在至少一个视频轨道中的视频媒体头数据盒包括元数据信息,所述元数据信息用于指示相应视频轨道是否同时包含视频图像信息和相应的视频辅助信息;The acquisition unit is used to acquire the target video track in the media file of the immersive media. The immersive media is captured by multiple cameras. The information collected by any camera through photography includes: video image information and corresponding video auxiliary information; And the media file is obtained by encapsulating the video bit stream of the immersive media into M video tracks; where M is a positive integer, and among the M video tracks, there is a video in at least one video track The media header data box includes metadata information, the metadata information is used to indicate whether the corresponding video track contains both video image information and corresponding video auxiliary information;
解码单元,用于对所述目标视频轨道中的信息进行解码处理。A decoding unit, configured to decode information in the target video track.
提供了一种媒体处理设备,所述媒体处理设备包括输入接口和输出接口,所述媒体处理设备还包括:A media processing device is provided. The media processing device includes an input interface and an output interface. The media processing device further includes:
处理器,适于实现一条或多条计算机可读指令;以及,计算机可读存储介质,所述计算机可读存储介质存储有一条或多条计算机可读指令;A processor adapted to implement one or more computer-readable instructions; and, a computer-readable storage medium storing one or more computer-readable instructions;
其中,所述一条或多条计算机可读指令适于由所述处理器加载并执行如下步骤:Wherein, the one or more computer-readable instructions are adapted to be loaded by the processor and perform the following steps:
获取沉浸媒体的视频位流,所述沉浸媒体通过多个相机拍摄得到;任一相机通过拍摄所采集到的信息包括:视频图像信息和相应的视频辅助信息;Obtain the video bit stream of the immersive media, which is captured by multiple cameras; the information collected by any camera through photography includes: video image information and corresponding video auxiliary information;
将所述视频位流封装到M个视频轨道中,得到所述沉浸媒体的媒体文件;其中,M为正整数,且在所述M个视频轨道中,存在至少一个视频轨道中的视频媒体头数据盒包括元数据信息,所述元数据信息用于指示相应视频轨道是否同时包含视频图像信息和相应的视频辅助信息。Encapsulate the video bit stream into M video tracks to obtain the media file of the immersive media; where M is a positive integer, and among the M video tracks, there is a video media header in at least one video track The data box includes metadata information used to indicate whether the corresponding video track contains both video image information and corresponding video auxiliary information.
或者,所述一条或多条计算机可读指令适于由所述处理器加载并执行如下步骤:Alternatively, the one or more computer readable instructions are adapted to be loaded by the processor and perform the following steps:
获取沉浸媒体的媒体文件中的目标视频轨道,所述沉浸媒体通过多个相机拍摄得到,任一相机通过拍摄所采集到的信息包括:视频图像信息和相应的视频辅助信息;且所述媒体文件是通过将所述沉浸媒体的视频位流封装到M个视频轨道中得到的;其中,M为正整数,在所述M个视频轨道中,存在至少一个视频轨道中的视频媒体头数据盒包括元数据信息,所述元数据信息用于指示相应视频轨道是否同时包含视频图像信息和相应的视频辅助信息;Obtain the target video track in the media file of the immersive media. The immersive media is captured by multiple cameras. The information collected by any camera includes: video image information and corresponding video auxiliary information; and the media file It is obtained by encapsulating the video bit stream of the immersive media into M video tracks; where M is a positive integer, and among the M video tracks, there is a video media header data box in at least one video track including Metadata information, the metadata information is used to indicate whether the corresponding video track contains both video image information and corresponding video auxiliary information;
对所述目标视频轨道中的信息进行解码处理。Decode the information in the target video track.
一种计算机可读存储介质,所述计算机可读存储介质存储有一条或多条计算机可读指令,所述一条或多条计算机可读指令适于由处理器加载并执行上述所提及的任一种沉浸媒体的处理方法。A computer-readable storage medium that stores one or more computer-readable instructions, and the one or more computer-readable instructions are suitable for being loaded by a processor and executing any of the above-mentioned tasks. An approach to immersive media.
一种计算机程序产品,该计算机程序产品包括计算机可读指令;所述计算机可读指令被处理器执行时,实现上述所提及的任一种沉浸媒体的处理方法。A computer program product, which includes computer-readable instructions; when the computer-readable instructions are executed by a processor, any one of the above-mentioned immersive media processing methods is implemented.
附图说明Description of the drawings
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are some embodiments of the present application, which are of great significance to this field. Ordinary technicians can also obtain other drawings based on these drawings without exerting creative work.
图1a是本申请实施例提供的一种3DoF的示意图;Figure 1a is a schematic diagram of a 3DoF provided by an embodiment of the present application;
图1b是本申请实施例提供的一种3DoF+的示意图;Figure 1b is a schematic diagram of 3DoF+ provided by an embodiment of the present application;
图1c是本申请实施例提供的一种6DoF的示意图;Figure 1c is a schematic diagram of a 6DoF provided by the embodiment of the present application;
图1d是本申请实施例提供的一种自由视角视频数据的示意图;Figure 1d is a schematic diagram of free-view video data provided by an embodiment of the present application;
图1e是本申请实施例提供的一种沉浸媒体的处理过程的示意图;Figure 1e is a schematic diagram of an immersive media processing process provided by an embodiment of the present application;
图2是本申请实施例提供的一种沉浸媒体的处理方法的流程示意图;Figure 2 is a schematic flowchart of an immersive media processing method provided by an embodiment of the present application;
图3是本申请另一实施例提供的一种沉浸媒体的处理方法的流程示意图;Figure 3 is a schematic flowchart of an immersive media processing method provided by another embodiment of the present application;
图4a是本申请实施例提供的一种沉浸媒体的单轨封装方式的示意图;Figure 4a is a schematic diagram of a single-rail packaging method for immersive media provided by an embodiment of the present application;
图4b是本申请实施例提供的一种沉浸媒体的媒体文件的传输示意图;Figure 4b is a schematic diagram of the transmission of media files of immersive media provided by an embodiment of the present application;
图5a是本申请实施例提供的一种沉浸媒体的多轨封装方式的示意图;Figure 5a is a schematic diagram of a multi-track packaging method for immersive media provided by an embodiment of the present application;
图5b是本申请实施例提供的另一种沉浸媒体的媒体文件的传输示意图;Figure 5b is a schematic diagram of the transmission of media files of another immersive media provided by an embodiment of the present application;
图6a是本申请另一实施例提供的一种沉浸媒体的多轨封装方式的示意图;Figure 6a is a schematic diagram of a multi-track packaging method for immersive media provided by another embodiment of the present application;
图6b是本申请另一实施例提供的一种沉浸媒体的媒体文件的传输示意图;Figure 6b is a schematic diagram of the transmission of media files of immersive media provided by another embodiment of the present application;
图7是本申请实施例提供的一种沉浸媒体的处理装置的结构示意图;Figure 7 is a schematic structural diagram of an immersion media processing device provided by an embodiment of the present application;
图8是本申请另一实施例提供的一种沉浸媒体的处理装置的结构示意图;Figure 8 is a schematic structural diagram of an immersion media processing device provided by another embodiment of the present application;
图9是本申请实施例提供的一种媒体处理设备的结构示意图。Figure 9 is a schematic structural diagram of a media processing device provided by an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application.
在本申请实施例中,沉浸媒体(也可称为沉浸式媒体)是指能够为消费者提供沉浸式体验的媒体内容,使沉浸于该媒体内容中的消费者能够获得现实世界中视觉、听觉等感官体验的媒体。沉浸媒体按照消费者在消费媒体内容时的自由度(Degree of Freedom,DoF),可以分为3DoF媒体、3DoF+媒体以及6DoF媒体;3DoF媒体可支持用户通过3DoF(即三自由度)来消费相应的媒体内容,3DoF+媒体可支持用户通过3DoF+来消费相应的媒体内容,6DoF媒体可支持用户通过6DoF来消费相应的媒体内容。其中,所谓的3DoF是指:当用户固定在一个三维空间的中心点时,用户头部可围绕X轴、Y轴和Z轴旋转的三种自由度,如图1a所示;所谓的3DoF+是指:在三自由度的基础上,用户头部还可沿着X轴、Y轴和Z轴进行有限运动的自由度,有限运动是指在有限空间中的平移的运动,如图1b所示;所谓的6DoF是指:在三自由度的基础上,消费者还可沿着X轴、Y轴和Z轴进行自由运动的自由度,自由运动是指自由平移的运动,如图1c所示。需说明的是,本申请实施例所提及的自由度,可理解成是用户在观看沉浸媒体时,支持用户运动并产生内容交互的自由度。In the embodiments of this application, immersive media (also called immersive media) refers to media content that can provide consumers with an immersive experience, so that consumers immersed in the media content can obtain visual and auditory experiences in the real world. Media for sensory experience. Immersive media can be divided into 3DoF media, 3DoF+ media and 6DoF media according to the degree of freedom (DoF) of consumers when consuming media content; 3DoF media can support users to consume corresponding content through 3DoF (i.e. three degrees of freedom). For media content, 3DoF+ media can support users to consume corresponding media content through 3DoF+, and 6DoF media can support users to consume corresponding media content through 6DoF. Among them, the so-called 3DoF refers to the three degrees of freedom that the user's head can rotate around the X-axis, Y-axis and Z-axis when the user is fixed at the center point of a three-dimensional space, as shown in Figure 1a; the so-called 3DoF+ is Refers to: On the basis of three degrees of freedom, the user's head can also perform limited degrees of freedom along the X-axis, Y-axis, and Z-axis. Limited movement refers to translational movement in a limited space, as shown in Figure 1b ; The so-called 6DoF refers to: on the basis of three degrees of freedom, consumers can also freely move along the X-axis, Y-axis and Z-axis. Free movement refers to free translational movement, as shown in Figure 1c . It should be noted that the degree of freedom mentioned in the embodiments of this application can be understood as the degree of freedom of the user to support user movement and generate content interaction when watching immersive media.
其中,由多个相机(也可称为摄像机)采集的,包含不同视角的,且支持用户3DoF+或6DoF交互的沉浸媒体,可以被称为自由视角视频;也就是说,自由视角视频为一种常见的3DoF+媒体和6DoF媒体,其也可称之为多视角视频、多视点视频或自由视点视频。具体的,可通过由多个相机组成的相机阵列,从多个角度对同一个三维场景进行拍摄,以形成场景的纹理信息(也可称为纹理图信息或纹理图,其例如可以是色彩信息等)和深度信息(也可称为深度图信息或深度图,其例如可以是空间距离信息等),从而根据用户位置信息结合来源于不同相机的纹理信息和深度信息,即可构成用户侧消费的自由视角视频(即6DoF媒体或3DoF+媒体)。由于自由视角视频是以多相机采集得到的空间不同视角的视频组合,因此为了便于自由视角视频的表达,存储,压缩和处理,本申请实施例可将自由视角视频数据表达可以为以下信息的组合:多相机采集的纹理信息,多相机采集的纹理信息所对应的深度信息,以及相应的自由视角视频的内容描述元数据(可简称为元数据),如图1d所示。元数据中可包含多相机的参数,以及自由视角视频的拼接布局和边缘保护等描述信息。Among them, immersive media captured by multiple cameras (also called cameras), including different viewing angles, and supporting user 3DoF+ or 6DoF interaction, can be called free-view video; that is to say, free-view video is a Common 3DoF+ media and 6DoF media can also be called multi-view video, multi-view video or free-view video. Specifically, the same three-dimensional scene can be photographed from multiple angles through a camera array composed of multiple cameras to form texture information of the scene (which can also be called texture map information or texture map, which can be, for example, color information). etc.) and depth information (which can also be called depth map information or depth map, which can be, for example, spatial distance information, etc.), thereby combining texture information and depth information from different cameras according to the user's location information to constitute user-side consumption. Free-view video (i.e. 6DoF media or 3DoF+ media). Since free-view video is a combination of videos from different perspectives in space collected by multiple cameras, in order to facilitate the expression, storage, compression and processing of free-view video, embodiments of the present application can express free-view video data as a combination of the following information : The texture information collected by multiple cameras, the depth information corresponding to the texture information collected by multiple cameras, and the corresponding content description metadata of the free-view video (can be referred to as metadata), as shown in Figure 1d. The metadata can include the parameters of multiple cameras, as well as description information such as the splicing layout and edge protection of free-view videos.
参见图1e所示,针对沉浸媒体(如自由视角视频)的处理过程可主要包括:视频采集,视频编码,视频文件封装,视频传输,视频文件解封装,视频解码和视频呈现。其中:As shown in Figure 1e, the processing process for immersive media (such as free-view video) may mainly include: video acquisition, video encoding, video file encapsulation, video transmission, video file decapsulation, video decoding and video presentation. in:
(1)视频采集用于将模拟的视频信号转换为数字视频,并按数字视频文件的格式进行保存;也就是说,视频采集可将视频信号转换为二进制数字信息。由于由视频信号转换为的二进制数字信息是一种二进制数据流,因此该二进制数字信息也可称为该视频信号的码流或者位流(Bitstream)。(1) Video capture is used to convert analog video signals into digital videos and save them in the format of digital video files; that is to say, video capture can convert video signals into binary digital information. Since the binary digital information converted from the video signal is a binary data stream, the binary digital information can also be called a code stream or bit stream (Bitstream) of the video signal.
(2)视频编码是指通过压缩技术,将原始视频格式的文件转换为另一种视频格式文件的处理过程。也就是说,视频编码可通过标准化的视频压缩算法实现,例如可通过AVS(Audio Video Coding Standard,中国国家数字音视频编解码标准)编码技术、AVS3(由AVS标准组推出的第三代音视频编解码标准)编码技术、HEVC(High Efficiency Video Coding,国际视频编码标准HEVC/H.265)编码技术、VVC(versatile video coding,国际视频编码标准VVC/H.266)编码技术等算法实现,或通过一些针对内容设计的特定压缩算法实现。(2) Video encoding refers to the process of converting files in the original video format into another video format file through compression technology. That is to say, video encoding can be achieved through standardized video compression algorithms, such as AVS (Audio Video Coding Standard, China’s national digital audio and video coding standard) encoding technology, AVS3 (the third generation of audio and video launched by the AVS standard group) Codec standard) coding technology, HEVC (High Efficiency Video Coding, international video coding standard HEVC/H.265) coding technology, VVC (versatile video coding, international video coding standard VVC/H.266) coding technology and other algorithm implementations, or Achieved through some specific compression algorithms designed for content.
(3)视频文件封装是指按照封装格式(或容器,或文件容器),将已经编码压缩好的媒体位流(如视频位流),按照一定的格式存放在一个文件中。需要说明的是,本申请实施例主要以媒体位流是视频位流为例进行说明;若沉浸媒体除了包括视频帧,还包括音频帧和字幕帧等一种或多种媒体帧,则媒体位流还包括音频位流和字幕位流中的至少一种,此情况下的实施方式可参见本申请实施例。常见的封装格式包括 AVI格式(Audio Video Interleaved,音频视频交错格式)或者ISOBMFF(ISO Based Media File Format,基于ISO(International Standard Organization,国际标准化组织)标准的媒体文件格式)。其中,ISOBMFF是媒体文件的封装标准,最典型的ISOBMFF文件即MP4(Moving Picture Experts Group 4,动态图像专家组4)文件。通过对编码后的媒体码流进行文件封装,可以使得媒体码流以一种更容易被解析的结构存储在设备中;在视频文件封装环节增添的信息,通常有助于提升视频文件在设备或中间传输节点的存储、关联、传输效率以及视频文件在视频消费终端的解码效率。(3) Video file encapsulation refers to storing the encoded and compressed media bit stream (such as video bit stream) in a file in a certain format according to the encapsulation format (or container, or file container). It should be noted that the embodiment of the present application mainly takes the media bit stream as a video bit stream as an example; if the immersive media includes not only video frames, but also one or more media frames such as audio frames and subtitle frames, then the media bit stream The stream also includes at least one of an audio bit stream and a subtitle bit stream. For implementation in this case, please refer to the embodiments of this application. Common encapsulation formats include AVI format (Audio Video Interleaved, audio and video interleaved format) or ISOBMFF (ISO Based Media File Format, media file format based on ISO (International Standard Organization, International Organization for Standardization) standards). Among them, ISOBMFF is the packaging standard for media files. The most typical ISOBMFF file is the MP4 (Moving Picture Experts Group 4) file. By encapsulating the encoded media stream, the media stream can be stored in the device in a structure that is easier to parse; the information added in the video file encapsulation process usually helps to improve the performance of the video file on the device or The storage, association, transmission efficiency of intermediate transmission nodes and the decoding efficiency of video files at video consumption terminals.
具体的,可以将媒体样本(可简称为样本(sample))作为文件封装过程中的封装单位;也就是说,将压缩编码得到的媒体码流封装至一个或多个媒体轨道中,从而得到相应的媒体文件(即视频文件)。可见,一个媒体文件可由多个媒体轨道组成,比如一个媒体文件可包含一个或多个视频轨道,一个或多个音频轨道以及一个或多个字幕轨道;所谓的视频轨道是指用于封装视频位流的媒体轨道,所谓的音频轨道是指用于封装音频位流的媒体轨道,所谓的字幕轨道是指用于封装字幕位流的媒体轨道。其中,此处提及的媒体轨道(可简称为轨道(Track))是指媒体文件封装过程中的媒体数据集合;一个媒体轨道可包括一个或很多个样本,在封装过程中,可将一个或多个媒体帧作为一个样本。比如,视频轨道中的一个样本通常为一个视频帧,音频轨道中的一个样本通常为一个音频帧,字幕轨道中的一个样本通常为一个字幕帧等。Specifically, the media sample (which may be referred to as a sample for short) can be used as the encapsulation unit in the file encapsulation process; that is, the media code stream obtained by compression and encoding is encapsulated into one or more media tracks, thereby obtaining the corresponding media files (i.e. video files). It can be seen that a media file can be composed of multiple media tracks. For example, a media file can contain one or more video tracks, one or more audio tracks, and one or more subtitle tracks; the so-called video track refers to a component used to encapsulate the video bits. The media track of the stream, the so-called audio track refers to the media track used to encapsulate the audio bit stream, and the so-called subtitle track refers to the media track used to encapsulate the subtitle bit stream. The media track mentioned here (which may be referred to as a track for short) refers to a collection of media data in the process of encapsulating media files; a media track may include one or many samples, and during the encapsulation process, one or Multiple media frames as a sample. For example, a sample in a video track is usually a video frame, a sample in an audio track is usually an audio frame, a sample in a subtitle track is usually a subtitle frame, etc.
(4)视频文件传输是指:在媒体制作设备通过上述步骤得到媒体文件后,将该媒体文件传输给媒体消费设备。需要说明的是,在传输层面上,不涉及媒体轨道的概念;在传输层面上,可认为媒体文件包括的是每个媒体轨道对应的媒体资源,各个媒体轨道对应的媒体资源是将相应的媒体位流封装到媒体轨道后得到的。例如,在封装时,将视频位流封装到了一个或多个视频轨道中,从而得到了媒体文件;那么在封装层面上,可认为将视频位流封装到一个或多个视频轨道后,得到的是封装有相应信息的视频轨道,因此可认为媒体文件包括这一个或多个视频轨道,但在传输层面上,便认为将视频位流封装到一个或多个视频轨道后,得到的是各个视频轨道对应的视频资源,因此可认为该媒体文件包括的是各个视频轨道对应的视频资源。(4) Video file transmission means: after the media production device obtains the media file through the above steps, the media file is transferred to the media consumption device. It should be noted that at the transmission level, the concept of media track is not involved; at the transmission level, the media file can be considered to include the media resources corresponding to each media track, and the media resources corresponding to each media track are the corresponding media The bit stream is obtained after encapsulating it into a media track. For example, during encapsulation, the video bit stream is encapsulated into one or more video tracks, thereby obtaining a media file; then at the encapsulation level, it can be considered that after encapsulating the video bit stream into one or more video tracks, the resulting It is a video track that encapsulates corresponding information. Therefore, the media file can be considered to include one or more video tracks. However, at the transmission level, it is considered that after the video bit stream is encapsulated into one or more video tracks, each video is obtained. The video resources corresponding to the track, therefore the media file can be considered to include the video resources corresponding to each video track.
需要说明的是,媒体文件的传输过程可基于各种传输协议来进行;此处的传输协议可以包括但不限于:DASH(Dynamic Adaptive Streaming over HTTP,基于HTTP的动态自适应流媒体传输)协议、HLS(HTTP Live Streaming,动态码率自适应传输)协议、SMTP(Smart Media Transport Protocol,智能媒体传输协议)、TCP(Transmission Control Protocol,传输控制协议)等。其中,DASH是一种自适应比特率流技术,可使高质量流媒体通过传统的HTTP网络服务器以互联网传递;在DASH中,用MPD(media presentation description,DASH中的媒体演示描述信令)描述媒体片段信息。且在DASH中,涉及Representation(表示)的概念;所谓的Representation是指一个或多个媒体成分(媒体片段)的组合,一个Representation可理解成是一个轨道对应的视频资源;比如,在将某种分辨率的视频位流封装到一个视频轨道以得到某种分辨率的视频文件(即媒体文件)后,该某种分辨率的视频文件包括一个视频资源,那么该某种分辨率的视频文件可以看作一个Representation。进一步的,DASH还涉及Adaptation Set(自适应集)的概念;所谓的Adaptation Set是指一个或多个视频流的集合,一个Adaptation Set中可以包含多个Representation。It should be noted that the transmission process of media files can be carried out based on various transmission protocols; the transmission protocols here can include but are not limited to: DASH (Dynamic Adaptive Streaming over HTTP, dynamic adaptive streaming media transmission based on HTTP) protocol, HLS (HTTP Live Streaming, dynamic code rate adaptive transmission) protocol, SMTP (Smart Media Transport Protocol, intelligent media transmission protocol), TCP (Transmission Control Protocol, transmission control protocol), etc. Among them, DASH is an adaptive bitrate streaming technology that enables high-quality streaming media to be delivered over the Internet through a traditional HTTP network server; in DASH, it is described by MPD (media presentation description, media presentation description signaling in DASH) Media clip information. And in DASH, it involves the concept of Representation; the so-called Representation refers to a combination of one or more media components (media fragments). A Representation can be understood as a video resource corresponding to a track; for example, when converting a certain After the video bit stream of a certain resolution is encapsulated into a video track to obtain a video file (i.e., a media file) of a certain resolution, and the video file of a certain resolution includes a video resource, then the video file of a certain resolution can Seen as a Representation. Furthermore, DASH also involves the concept of Adaptation Set; the so-called Adaptation Set refers to a collection of one or more video streams, and an Adaptation Set can contain multiple Representations.
(5)视频文件解封装是指:媒体消费设备在接收到媒体文件后,对媒体文件进行解封装处理。可以理解,媒体消费设备的解封装过程与上述所提及的媒体制作设备所执行的封装过程是相逆的;媒体消费设备可按照封装时的文件格式要求对媒体文件进行解封装,得到相应的媒体位流(如视频位流)等。(5) Video file decapsulation means: after receiving the media file, the media consumption device decapsulates the media file. It can be understood that the decapsulation process of the media consumption device is opposite to the encapsulation process performed by the media production device mentioned above; the media consumption device can decapsulate the media file according to the file format requirements during encapsulation to obtain the corresponding Media bit streams (such as video bit streams), etc.
(6)视频解码是指:对媒体位流进行解码处理以还原出视频。应理解的是,媒体消费设备的解码过程与上述所提及的媒体制作设备所执行的编码过程是相逆的。(6) Video decoding refers to decoding the media bit stream to restore the video. It should be understood that the decoding process of the media consumption device is inverse to the encoding process performed by the media production device mentioned above.
在本申请实施例中,媒体制作设备是指沉浸媒体的提供者(例如沉浸媒体的内容制作者)所使用的设 备,其可以是诸如智能手机、平板电脑、笔记本电脑、台式计算机、智能音箱、智能手表、智能车载等终端设备。或者,该媒体制作设备也可以是服务器,此处的服务器可以是独立的物理服务器,也可以是由多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN(Content Delivery Network,内容分发网络)、以及大数据和人工智能平台等基础云计算服务的云服务器。另外,所谓的媒体消费设备是指沉浸媒体的使用者(例如消费沉浸媒体的媒体内容的用户)所使用的设备,其可以是诸如个人计算机、智能手机、VR(Virtual Reality,虚拟现实)设备(例如VR头盔、VR眼镜等)、智能手表、智能车载等终端设备;进一步的,媒体消费设备还可安装并运行具有媒体消费能力的客户端,从而通过该客户端进行媒体消费。In the embodiment of this application, media production equipment refers to equipment used by providers of immersive media (such as content producers of immersive media), which can be such as smartphones, tablets, laptops, desktop computers, smart speakers, Smart watches, smart cars and other terminal equipment. Alternatively, the media production equipment can also be a server. The server here can be an independent physical server, a server cluster or a distributed system composed of multiple physical servers, or a server that provides cloud services, cloud databases, and cloud services. Cloud servers for basic cloud computing services such as computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN (Content Delivery Network), and big data and artificial intelligence platforms . In addition, the so-called media consumption device refers to a device used by users of immersive media (for example, users who consume media content of immersive media), which can be such as personal computers, smart phones, VR (Virtual Reality, virtual reality) devices ( For example, VR helmets, VR glasses, etc.), smart watches, smart cars and other terminal devices; further, the media consumption device can also install and run a client with media consumption capabilities, thereby consuming media through the client.
基于上述图1e所示的沉浸媒体的相关描述可知,当沉浸媒体为自由视角视频时,自由视角视频在采集完成后,媒体制作设备需要对采集到的自由视角视频进行压缩编码;压缩编码过程中所涉及的视频压缩算法可由AVS3编码技术、HEVC编码技术等完成。在对自由视角视频进行压缩编码后,需要对编码后的数据流(如视频位流)进行封装以得到媒体文件,从而将媒体文件传输给媒体消费设备。相应的,在媒体消费设备侧,需要对媒体文件进行解封装以及解码等逆操作,才能最终呈现自由视角视频的视频内容(即媒体内容)。Based on the relevant description of the immersive media shown in Figure 1e above, when the immersive media is free-view video, after the free-view video is collected, the media production equipment needs to compress and encode the collected free-view video; during the compression and encoding process The video compression algorithms involved can be completed by AVS3 encoding technology, HEVC encoding technology, etc. After compressing and encoding the free-view video, the encoded data stream (such as a video bit stream) needs to be encapsulated to obtain a media file, so that the media file can be transmitted to a media consumption device. Correspondingly, on the media consumption device side, reverse operations such as decapsulation and decoding of media files are required to finally present the video content of the free-view video (that is, the media content).
进一步的,针对媒体制作设备而言,自由视角视频的视频内容的制作过程可大致包含如下步骤:①通过相机阵列进行采集拍摄,以得到相机阵列中的多相机采集的纹理图信息和深度图信息。②把多相机采集的纹理图信息和对应的深度图信息进行拼接处理,此步骤可以是一个可选步骤。③通过平面视频压缩方式对多相机采集的深度图信息和纹理图信息进行编码,得到编码后的视频位流。④将编码后的视频位流和元数据封装入自由视角视频的媒体文件(即视频文件)。而针对媒体消费设备而言,其需根据自由视角视频的媒体文件,将自由视角视频重构到虚拟视点(或称为虚拟视角),从而实现为用户渲染呈现该虚拟视点所对应的视频内容;具体的,在将自由视角视频重构到虚拟视点时,媒体消费设备可执行如下步骤:获取自由视角视频的媒体文件,根据目标视角及媒体文件中的元数据,从媒体文件中解码获得一个或多个相机采集的深度图信息和纹理图信息。处理解码后的自由视角视频数据,根据自由视角视频算法(可采用应用自身的算法),结合深度图信息和纹理图信息,合成用户所切换的目标视角对应的虚拟视点。Furthermore, for media production equipment, the production process of free-view video content can roughly include the following steps: ① Capture and shoot through a camera array to obtain texture map information and depth map information collected by multiple cameras in the camera array . ② Splice the texture map information collected by multiple cameras and the corresponding depth map information. This step can be an optional step. ③ Encode the depth map information and texture map information collected by multiple cameras through planar video compression to obtain the encoded video bit stream. ④ Encapsulate the encoded video bit stream and metadata into the media file of the free-view video (i.e., video file). For media consumption devices, it needs to reconstruct the free-view video into a virtual viewpoint (or virtual viewpoint) based on the media file of the free-view video, so as to render and present the video content corresponding to the virtual viewpoint to the user; Specifically, when reconstructing the free-view video to the virtual viewpoint, the media consumption device can perform the following steps: obtain the media file of the free-view video, and decode it from the media file to obtain a or Depth map information and texture map information collected by multiple cameras. Process the decoded free-view video data, and combine the depth map information and texture map information according to the free-view video algorithm (the application's own algorithm can be used) to synthesize the virtual viewpoint corresponding to the target perspective switched by the user.
传统的视频编解码标准对自由视角视频数据涉及的纹理图信息结构和深度图信息结构进行了定义,具体如下:Traditional video coding and decoding standards define the texture map information structure and depth map information structure involved in free-view video data, as follows:
(一)纹理图信息结构:(1) Texture map information structure:
纹理图信息结构用于指示:自由视角视频数据所涉及的原始相机(即相机)所对应的纹理图的编码元数据信息。具体的,纹理图信息结构(TextureInfostruct)的语法可以如下所示:The texture map information structure is used to indicate: the encoded metadata information of the texture map corresponding to the original camera (ie camera) involved in the free-view video data. Specifically, the syntax of the texture map information structure (TextureInfostruct) can be as follows:
Figure PCTCN2022137788-appb-000001
Figure PCTCN2022137788-appb-000001
上述提及的纹理图信息结构的语法的语义如下:aligned(8)表示按8比特(bit)对齐;texture_padding_size指示纹理图的保护带宽度;texture_top_left_x指示相应原始相机对应的纹理图在视频帧平面中的左上角x坐标;texture_top_left_y指示相应原始相机对应的纹理图在视频帧中的左上 角y坐标;texture_bottom_right_x指示相应原始相机对应的纹理图在视频帧中的右下角x坐标;texture_bottom_right_y指示相应原始相机对应的纹理图在视频帧中的右下角x坐标。The semantics of the syntax of the texture map information structure mentioned above are as follows: aligned(8) indicates alignment by 8 bits; texture_padding_size indicates the guard band width of the texture map; texture_top_left_x indicates that the texture map corresponding to the corresponding original camera is in the video frame plane The x coordinate of the upper left corner of The x-coordinate of the lower right corner of the texture map in the video frame.
(二)深度图信息结构:(2) Depth map information structure:
深度图信息结构用于指示:自由视角视频数据所涉及的原始相机(即相机)所对应的深度图的编码元数据信息。具体的,深度图信息结构(DepthInfostruct)的语法可以如下所示:The depth map information structure is used to indicate: the encoded metadata information of the depth map corresponding to the original camera (ie camera) involved in the free-view video data. Specifically, the syntax of the depth map information structure (DepthInfostruct) can be as follows:
Figure PCTCN2022137788-appb-000002
Figure PCTCN2022137788-appb-000002
上述提及的深度图信息结构的语法的语义如下:其中,depth_padding_size指示深度图的保护带宽度;depth_top_left_x指示相应原始相机对应的深度图在视频帧平面中的左上角x坐标;depth_top_left_y指示相应原始相机对应的深度图在视频帧平面中的左上角y坐标;depth_bottom_right_x指示相应原始相机对应的深度图在视频帧平面中的右下角x坐标;depth_bottom_right_y指示相应原始相机对应的深度图在视频帧平面中的右下角x坐标;depth_downsample_factor指示深度图降采样的倍数因子,深度图的实际分辨率宽高为相机采集分辨率宽高的1/2 depth_downsample_factorThe semantics of the syntax of the depth map information structure mentioned above are as follows: depth_padding_size indicates the guard band width of the depth map; depth_top_left_x indicates the x-coordinate of the upper left corner of the depth map corresponding to the corresponding original camera in the video frame plane; depth_top_left_y indicates the corresponding original camera. The y coordinate of the upper left corner of the corresponding depth map in the video frame plane; depth_bottom_right_x indicates the x coordinate of the lower right corner of the depth map corresponding to the corresponding original camera in the video frame plane; depth_bottom_right_y indicates the depth map corresponding to the original camera in the video frame plane. The x coordinate of the lower right corner; depth_downsample_factor indicates the multiple factor of depth map downsampling. The actual resolution width and height of the depth map is 1/2 depth_downsample_factor of the camera acquisition resolution width and height.
另外,现有的视频编解码标准还规定了自由视角视频轨道应作为受限的视频在媒体文件中表示,且该类型轨道的样本入口下的RestrictedSchemeInfoBox(受限模式信息数据盒)中SchemeTypeBox(模式类型数据盒)的scheme_type(模式类型)应设置为'as3f';但是,自由视角视频的编码对视频编解码器没有限制。除此之外,现有的视频编解码标准还定义了自由视角视频的封装方式可包括单轨道封装和多轨道封装两种:In addition, the existing video coding and decoding standards also stipulate that the free-view video track should be represented in the media file as a restricted video, and the SchemeTypeBox (mode) in the RestrictedSchemeInfoBox (restricted mode information data box) under the sample entry of this type of track type data box) should be set to 'as3f'; however, encoding of free-view video has no restrictions on the video codec. In addition, existing video coding and decoding standards also define free-view video encapsulation methods, which can include single-track encapsulation and multi-track encapsulation:
(1)自由视角视频的单轨道封装:(1) Single-track packaging of free-view video:
自由视角视频的单轨道封装是指:由一个单轨道来表示自由视角视频码流(即自由视角视频的视频位流)。采用单轨道封装方式将自由视角视频的数据直接封装到媒体轨道容器中,数据结构按原样保留在码流中。单轨道封装的自由视角视频轨道使用一个VisualSampleEntry(样本入口),在ISO/IEC 14496-12中定义。该样本入口中可包含一个AvsFreeViewInfoBox(自由视角信息数据盒)和PayloadInfoBox(荷载信息数据盒)。AvsFreeViewInfoBox指示该自由视角视频轨道中的视角对应的相机信息,PayloadInfoBox指示相机对应的纹理图和深度图信息。The single-track encapsulation of free-view video means that a single track represents the free-view video code stream (that is, the video bit stream of free-view video). The single-track encapsulation method is used to directly encapsulate the free-view video data into the media track container, and the data structure is retained in the code stream as it is. Single-track encapsulated free-view video tracks use a VisualSampleEntry (sample entry), defined in ISO/IEC 14496-12. The sample entry can contain an AvsFreeViewInfoBox (free view information data box) and PayloadInfoBox (load information data box). AvsFreeViewInfoBox indicates the camera information corresponding to the perspective in the free-view video track, and PayloadInfoBox indicates the texture map and depth map information corresponding to the camera.
(2)自由视角视频的多轨道封装:(2) Multi-track packaging of free-view video:
自由视角视频的多轨道封装是指:由多个媒体轨道(如视频轨道)表示自由视角视频码流;多轨道封装可将同一个相机采集的纹理图和深度图封装到一个轨道中,或者封装到不同轨道中,对此不作限定。基于此,当纹理图和深度图被封装到不同轨道时,自由视角视频的多轨道封装可包含两种类型的媒体轨道:纹理图轨道(封装纹理图(即纹理信息)的视频轨道)和深度图轨道(封装深度图(即深度信息)的视频轨道);此情况下,对应相同相机的纹理图轨道和深度图轨道需要关联,具体可使用在ISO/IEC 14494-12中定义的TrackReferenceBox(轨道参考数据盒),将深度图轨道参考或关联到相同相机对应的纹理图轨道,reference_type(参考类型字段)的取值为'vdep'。可见,采用多轨道封装方式将自由视角视频按照数据类型映射到各自的媒体轨道的容器中;属于同一相机的轨道之间关联,并且时间是对齐的。Multi-track encapsulation of free-view video means: multiple media tracks (such as video tracks) represent the free-view video code stream; multi-track encapsulation can encapsulate the texture map and depth map collected by the same camera into one track, or encapsulate into different tracks, there is no limit to this. Based on this, when the texture map and the depth map are packaged into different tracks, the multi-track package of the free-view video can contain two types of media tracks: texture map track (the video track that encapsulates the texture map (i.e. texture information)) and depth Image track (video track that encapsulates depth map (i.e., depth information)); in this case, the texture map track and depth map track corresponding to the same camera need to be associated. Specifically, the TrackReferenceBox (track) defined in ISO/IEC 14494-12 can be used. Reference data box), reference or associate the depth map track to the texture map track corresponding to the same camera, the value of reference_type (reference type field) is 'vdep'. It can be seen that the multi-track encapsulation method is used to map the free-view video to the container of the respective media track according to the data type; the tracks belonging to the same camera are related to each other, and the time is aligned.
多轨道封装的自由视角视频的纹理图轨道和深度图轨道都使用VisualSampleEntry,在ISO/IEC19946-12中定义;并且,纹理图轨道和深度图轨道的样本入口中包含AvsFreeViewInfoBox,该AvsFreeViewInfoBox中描述的相机参数信息不应重复出现在纹理图轨道和其对应的深度图轨道。当纹理图轨道包含多个相机的纹理图时,样本入口中包含PayloadInfoBox描述纹理图信息;当深度图轨道包含多个相机的深度图时,样本入口包含PayloadInfoBox描述深度图信息。另外,在纹理图轨道或深度图轨道中的每个样本,均对应一个或多个相机的纹理图或深度图。轨道中样本对应的相机信息,通过VisualSampleEntry中的AvsFreeViewInfoBox进行描述。当每个样本中包含多个相机的纹理图或深度图时,每个样本可能包含多个子样本,每个子样本对应一个相机的纹理图或深度图,使用SubSampleInformationBox(子样本信息数据盒)进行描述;具体的,可在现有的视频编解码标准中关于SubSampleInformationBox的相关语法中的flags字段的值为0时,通过codec_specific_parameters的字段对子样本进行描述。The texture map track and depth map track of the multi-track packaged free-view video both use VisualSampleEntry, which is defined in ISO/IEC19946-12; and the sample entries of the texture map track and depth map track contain AvsFreeViewInfoBox, and the camera described in the AvsFreeViewInfoBox Parameter information should not appear repeatedly in the texture map track and its corresponding depth map track. When the texture map track contains texture maps from multiple cameras, the sample entry contains a PayloadInfoBox describing the texture map information; when the depth map track contains depth maps from multiple cameras, the sample entry contains a PayloadInfoBox describing the depth map information. In addition, each sample in the texture map track or depth map track corresponds to the texture map or depth map of one or more cameras. The camera information corresponding to the samples in the track is described through the AvsFreeViewInfoBox in VisualSampleEntry. When each sample contains texture maps or depth maps of multiple cameras, each sample may contain multiple subsamples. Each subsample corresponds to the texture map or depth map of a camera and is described using SubSampleInformationBox (subsample information data box). ; Specifically, when the value of the flags field in the relevant syntax of SubSampleInformationBox in the existing video encoding and decoding standards is 0, the subsample can be described through the field of codec_specific_parameters.
更进一步的,若一个自由视角视频被封装为多个视频轨道,且一个视频轨道中仅包含部分相机对应的纹理信息和/或深度信息,则这些视频轨道应通过自由视角轨道组进行关联,即这些视频轨道中均应包含AvsFreeViewGroupBox(自由视角轨道组数据盒)。AvsFreeViewGroupBox中可包括轨道组标识字段(track_group_id),包含部分相机且对应同一个自由视角视频的不同视频轨道应具备相同的track_group_id的取值。Furthermore, if a free-view video is encapsulated into multiple video tracks, and one video track only contains texture information and/or depth information corresponding to some cameras, these video tracks should be related through the free-view track group, that is, These video tracks should all contain AvsFreeViewGroupBox (free view track group data box). AvsFreeViewGroupBox can include a track group identification field (track_group_id). Different video tracks that contain some cameras and correspond to the same free-view video should have the same track_group_id value.
下面,对上述所提及的自由视角信息数据盒、荷载信息数据盒以及自由视角轨道组等数据盒作进一步阐述:Below, the above-mentioned free-view information data box, load information data box, free-view track group and other data boxes are further elaborated:
①自由视角信息数据盒(AvsFreeViewInfoBox):①Free view information data box (AvsFreeViewInfoBox):
自由视角数据盒用于指示对应轨道内包含的一个或多个自由视角对应的相机元数据信息(即相机的元数据信息)。具体的,自由视角数据盒的语法可以如下所示:The free-angle data box is used to indicate the camera metadata information corresponding to one or more free-angle views contained in the corresponding track (ie, the metadata information of the camera). Specifically, the syntax of the free-view data box can be as follows:
数据盒类型(Box Type):'afvi'Data box type (Box Type): 'afvi'
包含于(Container):SchemeInformationBox或SampleEntryContained in (Container): SchemeInformationBox or SampleEntry
强制的(Mandatory):不Mandatory: No
数量(Quantity):零个或一个Quantity: zero or one
Figure PCTCN2022137788-appb-000003
Figure PCTCN2022137788-appb-000003
上述自由视角数据盒的语法的语义如下:texture_in_track指示相应轨道中是否包含纹理图;texture_in_track的值为1时,指示相应轨道中包含纹理图,texture_in_track的值为0时,指示相应轨道中不包含纹理图。depth_in_track指示相应轨道中是否包含深度图;depth_in_track的值为1时,指示相应轨道中包含深度图,depth_in_track的值为0时,指示相应轨道中不包含深度图。camera_count指示采集视频的所有相机的个数。IntCameraInfoStruct()指示相机的内参信息,ExtCameraInfoStruct() 指示相机的外参信息。codec_independency指示相应轨道内各个相机对应的纹理图及深度图之间的编解码独立性,其所涉及的取值可参见下述表1所示:The semantics of the syntax of the above free-view data box are as follows: texture_in_track indicates whether the corresponding track contains a texture map; when the value of texture_in_track is 1, it indicates that the corresponding track contains a texture map; when the value of texture_in_track is 0, it indicates that the corresponding track does not contain a texture picture. depth_in_track indicates whether the corresponding track contains a depth map; when the value of depth_in_track is 1, it indicates that the corresponding track contains a depth map; when the value of depth_in_track is 0, it indicates that the corresponding track does not contain a depth map. camera_count indicates the number of all cameras collecting video. IntCameraInfoStruct() indicates the internal parameter information of the camera, and ExtCameraInfoStruct() indicates the external parameter information of the camera. codec_independency indicates the codec independence between the texture maps and depth maps corresponding to each camera in the corresponding track. The values involved can be seen in Table 1 below:
表1Table 1
Figure PCTCN2022137788-appb-000004
Figure PCTCN2022137788-appb-000004
②荷载信息数据盒(PayloadInfoBox):②PayloadInfoBox:
荷载信息数据盒用于描述对应轨道内一个或多个相机对应的纹理图和深度图的元数据信息。具体的,荷载信息数据盒的语法可以如下所示:The load information data box is used to describe the metadata information of the texture map and depth map corresponding to one or more cameras in the corresponding track. Specifically, the syntax of the payload information data box can be as follows:
数据盒类型(Box Type):'plin'Data box type (Box Type): 'plin'
包含于(Container):SampleEntryContained in (Container): SampleEntry
强制的(Mandatory):不Mandatory: No
数量(Quantity):零个或一个Quantity: zero or one
Figure PCTCN2022137788-appb-000005
Figure PCTCN2022137788-appb-000005
上述荷载信息数据盒的语法的语义如下:texture_info_flag指示是否表述描述纹理图的元数据信息;texture_info_flag的值为1时,表述描述纹理图的元数据信息,值为0时,不表述描述纹理图的元数据信息。depth_info_flag指示是否表述描述深度图的元数据信息;depth_info_flag的值为1时,表述描述深度图的元数据信息,值为0时,不表述描述深度图的元数据信息。num_cameras指示该媒体轨道的相机数量,camera_id指示相机的标识。TextureInfostruct()描述纹理图的元数据信息,DepthInfostruct() 描述深度图的元数据信息。The semantics of the syntax of the above load information data box are as follows: texture_info_flag indicates whether to express the metadata information describing the texture map; when the value of texture_info_flag is 1, it expresses the metadata information describing the texture map; when the value is 0, it does not express the metadata information describing the texture map. Metadata information. depth_info_flag indicates whether to express the metadata information describing the depth map; when the value of depth_info_flag is 1, the metadata information describing the depth map is expressed; when the value is 0, the metadata information describing the depth map is not expressed. num_cameras indicates the number of cameras in this media track, and camera_id indicates the identity of the camera. TextureInfostruct() describes the metadata information of the texture map, and DepthInfostruct() describes the metadata information of the depth map.
③自由视角轨道组数据盒(AvsFreeViewGroupBox):③Free view track group data box (AvsFreeViewGroupBox):
自由视角轨道组数据盒用于关联同一个自由视角视频的各个视频轨道。具体的,自由视角轨道组数据盒的语法可以如下所示:The free-view track group data box is used to associate various video tracks of the same free-view video. Specifically, the syntax of the free-view track group data box can be as follows:
Figure PCTCN2022137788-appb-000006
Figure PCTCN2022137788-appb-000006
上述自由视角轨道组数据盒的语法的语义如下:camera_count用于指示该轨道中包含的自由视角纹理信息和深度信息对应的相机数目。camera_id用于指示每个相机对应的相机标识符。depth_texture_type用于指示该轨道包含的对应相机拍摄的纹理信息或深度信息类型,其取值可参考下述表2所示:The semantics of the syntax of the above free-view track group data box are as follows: camera_count is used to indicate the number of cameras corresponding to the free-view texture information and depth information contained in the track. camera_id is used to indicate the camera identifier corresponding to each camera. depth_texture_type is used to indicate the type of texture information or depth information captured by the corresponding camera contained in the track. Its value can be referred to the following Table 2:
表2Table 2
depth_texture_type取值depth_texture_type value 含义meaning
00 保留reserve
11 表明包含对应相机拍摄的纹理信息Indicates that it contains texture information captured by the corresponding camera
22 表明包含对应相机拍摄的深度信息Indicates that it contains depth information captured by the corresponding camera
33 表明包含对应相机拍摄的纹理和深度信息Indicates that it contains texture and depth information captured by the corresponding camera
从上述描述可以看到,现有的视频编解码标准对每个视频轨道中包含的深度信息、纹理信息存在重复指示的问题,比如在自由视角信息数据盒、荷载信息数据盒、自由视角轨道组数据盒中均存在相应的字段指示轨道中是否包含纹理或深度信息,这样会导致媒体消费设备在解封装过程中重复解析,从而,存在资源浪费的问题。并且,虽然现有的视频编解码标准对自由视角视频的编解码独立性进行了指示,即对视频轨道内的纹理信息与深度信息之间、各个相机拍摄的内容之间是否存在编解码依赖进行了指示,但是当前的指示方式仅局限于自由视角应用场景,不具备兼容性和可扩展性。除此之外,还缺乏信令层级的关于视频辅助信息的指示。As can be seen from the above description, the existing video encoding and decoding standards have the problem of repeated instructions for the depth information and texture information contained in each video track, such as in the free-view information data box, load information data box, free-view track group There are corresponding fields in the data box indicating whether the track contains texture or depth information. This will cause the media consumption device to repeatedly parse during the decapsulation process, thus causing a waste of resources. Moreover, although the existing video coding and decoding standards indicate the coding and decoding independence of free-view videos, that is, whether there is coding and decoding dependence between texture information and depth information in the video track and between the content captured by each camera. However, the current indication method is limited to free-view application scenarios and does not have compatibility and scalability. In addition, there is a lack of indication of video auxiliary information at the signaling level.
基于此,本申请实施例针对自由视角视频应用以及其他的沉浸媒体应用,提出了一种文件封装和传输信令指示方法;通过该方法可实现:①以更优化的方式组织自由视角视频的媒体内容,避免媒体消费设备在解封装过程中重复解析其他数据盒中用于指示视频轨道内的信息情况的相关字段,从而节省处理资源,提升解封装效率,进而提升解码效率;②以更优化的方式指示视频内容各部分之间的编解码依赖性,使得该方法不局限于自由视角应用场景,提升兼容性和可扩展性;③根据①~②中的信息,生成相应的传输信令,指导客户端等媒体消费设备在自由视角视频消费过程中的资源选择。Based on this, the embodiments of this application propose a file encapsulation and transmission signaling indication method for free-view video applications and other immersive media applications; through this method, the following can be achieved: ① Organize free-view video media in a more optimized manner content, preventing the media consumption device from repeatedly parsing the relevant fields in other data boxes used to indicate the information in the video track during the decapsulation process, thereby saving processing resources, improving decapsulation efficiency, and thus improving decoding efficiency; ② With more optimized The method indicates the encoding and decoding dependencies between various parts of the video content, so that the method is not limited to free-view application scenarios, improving compatibility and scalability; ③ Based on the information in ①~②, generate corresponding transmission signaling to guide Resource selection of media consumption devices such as clients in the process of free-view video consumption.
在具体实现中,该方法可以应用于任一沉浸媒体(如自由视角视频)的存储、传输、封装、解封装过程中。以自由视角视频为例,该方法的具体实施步骤如下:In specific implementation, this method can be applied to the storage, transmission, encapsulation, and decapsulation process of any immersive media (such as free-view video). Taking free-view video as an example, the specific implementation steps of this method are as follows:
步骤1:媒体制作设备(如服务器)生成自由视角视频的视频位流,并将该视频位流封装至媒体文件,根据自由视角视频中的纹理信息与深度信息是否在同一个视频轨道、以及自由视角视频是否被封装至为多 个视频轨道,依照如下准则指示自由视角视频的视频轨道内包含的纹理信息、深度信息以及相机信息。Step 1: The media production equipment (such as a server) generates the video bit stream of the free-view video and encapsulates the video bit stream into a media file. According to whether the texture information and depth information in the free-view video are in the same video track, and whether Whether the perspective video is encapsulated into multiple video tracks, the texture information, depth information, and camera information contained in the video track of the free-view video are indicated according to the following guidelines.
(1)若一个视频轨道中同时包含纹理信息和深度信息,则这些纹理信息和深度信息必须对应N(N为正整数)个相同相机,即将N个相机拍摄的纹理图和深度图封装在一个视频轨道中。此时在视频轨道中的视频媒体头数据盒(VideoMediaHeaderBox)通过元数据信息标识该视频轨道附带视频辅助信息(如深度信息)。(1) If a video track contains both texture information and depth information, these texture information and depth information must correspond to N (N is a positive integer) identical cameras, that is, the texture maps and depth maps captured by N cameras are encapsulated in one in the video track. At this time, the video media header data box (VideoMediaHeaderBox) in the video track identifies that the video track is accompanied by video auxiliary information (such as depth information) through metadata information.
(2)若一个视频轨道仅包含纹理信息或者仅包含深度信息,则考虑到纹理信息属于视频图像信息中的一种,深度信息属于视频辅助信息中的一种,因此可认为纹理图轨道(即封装纹理信息的视频轨道)是一种视频图像信息轨道,且认为深度图轨道是一种视频辅助信息轨道。此处提及的视频图像信息轨道可简称为视频轨道,其是指封装了视频图像信息的视频轨道;此处提及的视频辅助信息轨道又可称为辅助视频媒体轨道或视频辅助轨道,其是指封装了视频辅助信息的视频轨道。进一步的,考虑到现有的视频文件封装技术规定了:①Video media uses the handler type in the HandlerBox of the,as defined in 8.4.3(视频媒体使用相关标准中的第8.4.3章节中所定义的HandlerBox(处理程序数据盒)中的handler_type(处理程序类型))。②视频媒体的处理程序类型(handler type)为'vide',即视频媒体文件的MediaBox(媒体数据盒)数据盒中的HandlerBox数据盒中相应字段取值为'vide';辅助视频媒体的处理程序类型为'auxv',即辅助视频媒体文件的MediaBox数据盒中的HandlerBox数据盒中相应字段取值为'auxv'。③辅助视频媒体轨道(即视频辅助信息轨道)和视频轨道(即视频图像信息轨道)采用相同的编码方式进行编码,但使用不同的处理程序类型进行标识,且辅助视频媒体轨道通常并不用于视觉呈现(比如辅助视频轨道包含深度信息的场景)。因此,若一个视频轨道仅包含纹理信息或者仅包含深度信息,则可利用现有技术中对视频图像信息轨道和视频辅助信息轨道的区分(以handler_type区分)以及关联技术('vdep'轨道索引),结合(1)中的标识信息(如元数据信息),即可区分视频轨道中包含的信息为纹理信息或者深度信息。(2) If a video track only contains texture information or only depth information, considering that the texture information is a kind of video image information and the depth information is a kind of video auxiliary information, it can be considered as a texture map track (i.e. The video track encapsulating texture information) is a video image information track, and the depth map track is considered to be a video auxiliary information track. The video image information track mentioned here can be referred to as a video track for short, which refers to a video track that encapsulates video image information; the video auxiliary information track mentioned here can also be called an auxiliary video media track or a video auxiliary track, which Refers to the video track that encapsulates video auxiliary information. Further, considering that the existing video file encapsulation technology stipulates: ①Video media uses the handler type in the HandlerBox of the, as defined in 8.4.3 (as defined in Chapter 8.4.3 of the relevant standards for video media use) handler_type (handler type) in HandlerBox (handler data box)). ②The handler type of video media is 'vide', that is, the corresponding field value in the HandlerBox data box in the MediaBox data box of the video media file is 'vide'; the handler of auxiliary video media The type is 'auxv', that is, the value of the corresponding field in the HandlerBox data box in the MediaBox data box of the auxiliary video media file is 'auxv'. ③The auxiliary video media track (i.e., video auxiliary information track) and the video track (i.e., video image information track) are encoded in the same encoding method, but are identified using different handler types, and the auxiliary video media track is usually not used for visual Rendering (such as scenes where the auxiliary video track contains depth information). Therefore, if a video track only contains texture information or only depth information, the distinction between the video image information track and the video auxiliary information track in the existing technology (distinguished by handler_type) and the correlation technology ('vdep' track index) can be used , combined with the identification information (such as metadata information) in (1), the information contained in the video track can be distinguished as texture information or depth information.
(3)当一个自由视角视频被封装至多个视频轨道时,使用自由视角轨道组数据盒将多个视频轨道进行关联;进一步的,可使用自由视角轨道组数据盒将封装了视频图像信息(如纹理信息)的多个视频轨道进行关联。同时将载荷信息(即前述提及的荷载信息数据盒中的信息)包含于自由视角轨道组数据盒中。可选地,还可以根据(1)中的的元数据信息所指示的分类(即视频轨道是否同时附带视频图像信息(如纹理信息)和视频辅助信息(如深度信息)),进一步区分自由视角轨道组的类型。示例性的,此处提及的自由视角轨道组的类型可以用于指示:自由视角轨道组所关联的各个视频轨道均只封装了视频图像信息(如纹理信息),或者自由视角轨道组所关联的各个视频轨道均同时封装了视频图像信息(如纹理信息)和视频辅助信息(如深度信息)等。(3) When a free-view video is encapsulated into multiple video tracks, the free-view track group data box is used to associate the multiple video tracks; further, the free-view track group data box can be used to encapsulate the video image information (such as Texture information) to associate multiple video tracks. At the same time, the load information (ie, the information in the aforementioned load information data box) is included in the free-view track group data box. Optionally, the free perspective can also be further distinguished based on the classification indicated by the metadata information in (1) (that is, whether the video track is accompanied by both video image information (such as texture information) and video auxiliary information (such as depth information)) The type of track group. For example, the type of free-view track group mentioned here can be used to indicate that each video track associated with the free-view track group only encapsulates video image information (such as texture information), or that the free-view track group is associated with Each video track of the video track simultaneously encapsulates video image information (such as texture information) and video auxiliary information (such as depth information).
步骤2:媒体制作设备(如服务器)在步骤1所涉及的文件封装结构(即是否为多轨封装,单个视频轨道是否同时包含纹理信息和深度信息)的基础上,在视频轨道中指示编解码依赖信息,该编解码依赖信息可包括:视频内数据集合(即视频成分)划分的规则的指示,以及在该规则下不同数据集合是否存在编解码依赖的指示。除此之外,在单个视频轨道同时包含纹理信息和深度信息的情况下,该编解码依赖信息还可包括单个视频轨道中的纹理信息和深度信息是否存在编解码依赖的指示,等等。Step 2: The media production device (such as a server) indicates the codec in the video track based on the file encapsulation structure involved in step 1 (i.e. whether it is multi-track encapsulation, whether a single video track contains both texture information and depth information) Dependency information, the codec dependency information may include: an indication of the rules for dividing data sets (ie, video components) within the video, and an indication of whether codec dependencies exist in different data sets under the rules. In addition, when a single video track contains both texture information and depth information, the codec dependency information may also include an indication of whether there is codec dependence on the texture information and depth information in the single video track, and so on.
步骤3:媒体制作设备(如服务器)可提供完整文件的下载服务,此情况下的媒体制作设备(如服务器)可不对自由视角视频的媒体文件进行处理,从而提供完整的媒体文件供媒体消费设备下载。当然,若媒体制作设备(如服务器)也支持流化传输,此时,媒体制作设备(如服务器)将自由视角视频的媒体文件切片为适合流化传输的媒体片段(按照现有传输标准的规范进行),并生成相应的传输信令,并执行步骤4。示例性的,此处提及的传输信令可以包括但不限于DASH信令或者SMT信令;并且,传输信令的信令描述文件中指示以下信息:Step 3: The media production equipment (such as a server) can provide a complete file download service. In this case, the media production equipment (such as a server) may not process the media files of the free-view video, thereby providing complete media files for the media consumption device. download. Of course, if the media production device (such as a server) also supports streaming transmission, at this time, the media production device (such as a server) slices the media file of the free-view video into media segments suitable for streaming transmission (according to the specifications of the existing transmission standards Perform), generate corresponding transmission signaling, and perform step 4. For example, the transmission signaling mentioned here may include but is not limited to DASH signaling or SMT signaling; and, the signaling description file of the transmission signaling indicates the following information:
(a)指示媒体资源(即视频资源)中包含的数据类型(视频数据/视频辅助数据)以及相应媒体资源之间的索引与关联关系;(a) Indicate the data type (video data/video auxiliary data) contained in the media resources (i.e. video resources) and the index and association relationship between the corresponding media resources;
(b)指示媒体资源中或者媒体资源之间的编解码依赖信息。(b) Indicate codec dependency information in media resources or between media resources.
步骤4:媒体制作设备(如服务器)将传输信令的信令描述文件传输给媒体消费设备侧的用户。Step 4: The media production device (such as a server) transmits the signaling description file of the transmission signaling to the user on the media consumption device side.
步骤5:媒体消费设备中的客户端可执行如下逻辑:Step 5: The client in the media consumption device can execute the following logic:
(A)根据应用形式,客户端可以下载完整文件(即完整的媒体文件)后本地播放。并根据媒体文件中关于纹理信息、深度信息以及编解码依赖信息的指示,确定该媒体文件是否可以在不同的阶段进行部分解码,以优化客户端播放的性能。(A) Depending on the application form, the client can download the complete file (that is, the complete media file) and then play it locally. And based on the instructions about texture information, depth information and codec dependency information in the media file, it is determined whether the media file can be partially decoded at different stages to optimize the performance of client playback.
(B)根据应用形式,与媒体制作设备(如服务器)建立流化传输,根据传输信令的信令描述文件中所指示的纹理信息、深度信息以及编解码依赖信息,自适应选择合适的自由视角视频流进行消费。(B) Establish streaming transmission with media production equipment (such as servers) according to the application form, and adaptively select the appropriate freedom according to the texture information, depth information and codec dependency information indicated in the signaling description file of the transmission signaling. View video streams for consumption.
为支持本申请实施例提出的上述方法中的各个实施步骤,本申请实施例在现有技术基础上,在系统层添加了若干描述性字段。下文中,以扩展现有ISOBMFF数据盒、DASH信令和SMT信令的形式举例,定义了相关的字段以支持自由视角视频应用或其他的沉浸媒体应用,具体如下:In order to support each implementation step in the above method proposed by the embodiment of this application, the embodiment of this application adds several descriptive fields at the system layer based on the existing technology. In the following, taking the example of extending the existing ISOBMFF data box, DASH signaling and SMT signaling, the relevant fields are defined to support free-view video applications or other immersive media applications, as follows:
一、ISOBMFF数据盒扩展1. ISOBMFF data box extension
1、ISOBMFF数据盒可包括视频媒体头数据盒(VideoMediaHeaderBox),视频轨道使用视频媒体头数据盒,其中包含视频媒体一般性的,独立于编解码方式之外的呈现相关信息;本申请实施例可对视频媒体头数据盒进行扩展,以在视频媒体头数据盒中指示视频轨道中是否同时包含深度信息或其他辅助信息。具体的,扩展后的视频媒体头数据盒可以如下所示:1. The ISOBMFF data box may include a video media header data box (VideoMediaHeaderBox). The video track uses a video media header data box, which contains general presentation-related information of the video media that is independent of the codec method; the embodiment of the present application can The video media header data box is extended to indicate in the video media header data box whether depth information or other auxiliary information is also included in the video track. Specifically, the expanded video media header data box can be as follows:
Figure PCTCN2022137788-appb-000007
Figure PCTCN2022137788-appb-000007
其中,version、graphicsmode和opcolor这三个字段是视频媒体头数据盒原有的字段。version用于指示该视频媒体头数据盒的版本号;graphicsmode用于指示当前视频轨道的图像组合模式,该字段取值为0表示当前视频轨道的图像组合模式为复制模式,其他取值可由相应标准额外定义;opcolor为三种颜色(红,绿,蓝)取值的数组,用于配合图像组合模式使用。Among them, the three fields version, graphicsmode and opcolor are the original fields of the video media header data box. version is used to indicate the version number of the video media header data box; graphicsmode is used to indicate the image combination mode of the current video track. A value of 0 in this field indicates that the image combination mode of the current video track is copy mode. Other values can be determined by corresponding standards. Additional definition; opcolor is an array of values for three colors (red, green, blue), used in conjunction with the image combination mode.
基于上述扩展后的视频媒体头数据盒可知,本申请实施例可在视频媒体头数据盒中定义video_with_depth_info_flag(视频深度信息标志字段)这一标志位,来指示当前视频轨道中是否同时包含纹理信息(图像信息)以及纹理信息(图像信息)对应的深度信息。当该字段取值为第一数值(如数值1)时,指示当前视频轨道同时包含纹理信息(图像信息)以及纹理信息(图像信息)对应的深度信息;当该字段取值为第二数值(如数值0)时,指示当前视频轨道仅包含纹理信息(图像信息)。Based on the above expanded video media header data box, it can be known that the embodiment of the present application can define the flag bit video_with_depth_info_flag (video depth information flag field) in the video media header data box to indicate whether the current video track also contains texture information ( Image information) and depth information corresponding to texture information (image information). When the value of this field is the first value (such as the value 1), it indicates that the current video track contains both texture information (image information) and depth information corresponding to the texture information (image information); when the value of this field is the second value ( If the value is 0), it indicates that the current video track only contains texture information (image information).
扩展地,由于纹理信息属于视频图像信息中的一种,深度信息属于视频辅助信息的一种;而视频图像信息不局限于纹理信息,视频辅助信息也不局限于深度信息,例如其还可包括视差信息等。因此,还可以在视频媒体头数据盒中定义其他标志位来表明当前视频轨道是否同时包含相应的辅助信息。例如,可定义如下一种或多种标志位:Extensively, since texture information is a type of video image information, depth information is a type of video auxiliary information; video image information is not limited to texture information, and video auxiliary information is not limited to depth information. For example, it may also include Parallax information, etc. Therefore, other flag bits can also be defined in the video media header data box to indicate whether the current video track also contains corresponding auxiliary information. For example, one or more of the following flags can be defined:
①video_with_parallax_info_flag(视频视差信息标志字段),指示当前视频轨道中是否同时包含视频图像信息以及视频图像信息对应的视差信息。当该字段取值为第一数值(如数值1)时,指示当前视频轨道同时包含视频图像信息以及视频图像信息对应的视差信息;当该字段取值为第二数值(如数值0)时,指示当前视频轨道仅包含视频图像信息。① video_with_parallax_info_flag (video parallax information flag field), indicating whether the current video track contains both video image information and parallax information corresponding to the video image information. When the value of this field is the first value (such as the value 1), it indicates that the current video track contains both video image information and disparity information corresponding to the video image information; when the value of this field is the second value (such as the value 0), Indicates that the current video track contains only video image information.
②video_with_auxl_info_flag指示当前视频轨道中是否同时包含视频图像信息以及视频图像信息对应的辅助信息。当该字段取值为第一数值(如数值1)时,当前视频轨道同时包含视频图像信息以及视频图像信息对应的视频辅助信息;当该字段取值为第二数值(如数值0)时,当前视频轨道仅包含视频图像信息。此处的视频辅助信息可以是深度信息、视差信息或由其他应用定义的辅助信息。②video_with_auxl_info_flag indicates whether the current video track contains both video image information and auxiliary information corresponding to the video image information. When the value of this field is the first value (such as the value 1), the current video track contains both video image information and video auxiliary information corresponding to the video image information; when the value of this field is the second value (such as the value 0), The current video track only contains video image information. The video auxiliary information here can be depth information, disparity information, or auxiliary information defined by other applications.
2、可进一步扩展视频媒体头数据盒,以在视频媒体头数据盒中指示视频轨道中包含的数据内容(信息)之间是否存在编解码独立性。具体的,以第一数值为数值1,第二数值为数值0为例,扩展后的视频媒体头数据盒可以进一步如下所示:2. The video media header data box can be further extended to indicate in the video media header data box whether there is codec independence between the data content (information) contained in the video track. Specifically, taking the first value as value 1 and the second value as value 0 as an example, the expanded video media header data box can be further as follows:
Figure PCTCN2022137788-appb-000008
Figure PCTCN2022137788-appb-000008
上述扩展后的视频媒体头数据盒的语法的语义大致如下:The semantics of the syntax of the above extended video media header data box are roughly as follows:
在定义video_with_depth_info_flag的情况下,若该字段的取值等于1,即当前视频轨道中同时包含 纹理信息(图像信息)以及纹理信息(图像信息)对应的深度信息,则还可进一步在视频媒体头数据盒中定义depth_codec_dependency(深度编解码依赖性字段)这一标志位。其中,当depth_codec_dependency取值为第三数值(如数值1)时,表明当前视频轨道中包含的纹理信息(图像信息)和纹理信息(图像信息)对应的深度信息之间存在解码依赖关系;当depth_codec_dependency取值为第四数值(如数值0)时,表明当前视频轨道中包含的纹理信息(图像信息)和纹理信息(图像信息)对应的深度信息之间不存在解码依赖关系。In the case of defining video_with_depth_info_flag, if the value of this field is equal to 1, that is, the current video track contains both texture information (image information) and depth information corresponding to the texture information (image information), you can further add the video media header data The depth_codec_dependency (depth codec dependency field) flag is defined in the box. Among them, when the value of depth_codec_dependency is the third value (such as the value 1), it indicates that there is a decoding dependency between the texture information (image information) contained in the current video track and the depth information corresponding to the texture information (image information); when depth_codec_dependency When the value is a fourth value (such as a value of 0), it indicates that there is no decoding dependency between the texture information (image information) contained in the current video track and the depth information corresponding to the texture information (image information).
在定义video_with_parallax_info_flag的情况下,若该字段的取值等于1,即当前视频轨道同时包含视频图像信息以及视频图像信息对应的视差信息,则还可进一步在视频媒体头数据盒中定义parallax_codec_dependency(视差编解码依赖性字段)这一标志位。其中,当parallax_codec_dependency取值为第三数值(如数值1)时,表明当前视频轨道中包含的视频图像信息以及视频图像信息对应的视差信息之间存在解码依赖关系;当parallax_codec_dependency取值为第四数值(如数值0)时,表明当前视频轨道中包含的视频图像信息以及视频图像信息对应的视差信息不存在解码依赖关系。When video_with_parallax_info_flag is defined, if the value of this field is equal to 1, that is, the current video track contains both video image information and disparity information corresponding to the video image information, you can further define parallax_codec_dependency (disparity codec) in the video media header data box. Decoding dependency field) this flag. Among them, when the value of parallax_codec_dependency is the third value (such as the value 1), it indicates that there is a decoding dependency relationship between the video image information contained in the current video track and the disparity information corresponding to the video image information; when the value of parallax_codec_dependency is the fourth value (such as a value of 0), it indicates that there is no decoding dependency between the video image information contained in the current video track and the disparity information corresponding to the video image information.
在定义video_with_auxl_info_flag的情况下,若该字段的取值等于1,即当前视频轨道同时包含视频图像信息以及视频图像信息对应的视频辅助信息,则还可进一步在视频媒体头数据盒中定义auxl_codec_dependency(辅助编解码依赖性字段)这一标志位。其中,当auxl_codec_dependency取值为第三数值(如数值1)时,表明当前视频轨道中包含的视频图像信息以及视频图像信息对应的视频辅助信息之间存在解码依赖关系;当auxl_codec_dependency取值为第四数值(如数值0)时,表明当前视频轨道中包含的视频图像信息以及视频图像信息对应的视频辅助信息不存在解码依赖关系。When video_with_auxl_info_flag is defined, if the value of this field is equal to 1, that is, the current video track contains both video image information and video auxiliary information corresponding to the video image information, you can further define auxl_codec_dependency (auxiliary) in the video media header data box. Codec dependency field) this flag. Among them, when the value of auxl_codec_dependency is the third value (such as the value 1), it indicates that there is a decoding dependency relationship between the video image information contained in the current video track and the video auxiliary information corresponding to the video image information; when the value of auxl_codec_dependency is the fourth value A value (such as a value of 0) indicates that there is no decoding dependency between the video image information contained in the current video track and the video auxiliary information corresponding to the video image information.
3、定义视频成分依赖信息数据盒(VideoCompositionDependencyInfoBox),该视频成分依赖信息数据盒用于指示:沉浸媒体的视频内容的成分类型,以及沉浸媒体的视频内容的不同成分之间是否存在编解码依赖关系(即编解码上的依赖关系)。具体的,视频成分依赖信息数据盒可以如下所示:3. Define the video composition dependency information data box (VideoCompositionDependencyInfoBox). This video composition dependency information data box is used to indicate: the component type of the video content of the immersive media, and whether there is a codec dependency relationship between different components of the video content of the immersive media. (That is, the dependency on encoding and decoding). Specifically, the video component dependency information data box can be as follows:
Figure PCTCN2022137788-appb-000009
Figure PCTCN2022137788-appb-000009
其中,composition_codec_dependency(成分编解码依赖性字段)用于指示沉浸媒体的视频内容包含的不同成分之间是否存在编解码依赖关系。当该字段取值为第五数值(如数值0)时,指示沉浸媒体的视频内容的不同成分之间不存在编解码依赖关系;当该字段取值为第六数值(如数值1)时,指示沉浸媒体的视频内容的不同成分之间存在编解码依赖关系。Among them, composition_codec_dependency (component codec dependency field) is used to indicate whether there is a codec dependency relationship between different components contained in the video content of the immersive media. When the value of this field is the fifth value (such as the value 0), it indicates that there is no codec dependency between different components of the video content of the immersive media; when the value of this field is the sixth value (such as the value 1), Indicates the existence of codec dependencies between different components of video content for immersive media.
其中,composition_type(成分类型字段)用于指示沉浸媒体的视频内容的成分类型(即视频内容所包含的成分的类型)。当该字段取值为第七数值(如数值0)时,指示沉浸媒体的视频内容的成分类型为基于相机的成分类型,从而可指示沉浸媒体的视频内容中的成分是基于相机划分得到的,一个成分包括一个相机拍摄的视频内容,即沉浸媒体的视频内容包含不同相机拍摄的视频内容;当该字段取值为第八数值(如数值1)时,指示沉浸媒体的视频内容的成分类型为基于双目的成分类型,从而可指示沉浸媒体的视频内容中的成分是基于双目划分得到的,一个成分包括左目视频内容或右目视频内容,即沉浸媒体的视频内容包含左右目视频内容。应理解的是,composition_type的取值并不不局限于第七数值和第八数值这两种,还可为其他取值,而其他取值可由相应标准自行扩展。Wherein, composition_type (component type field) is used to indicate the component type of the video content of the immersive media (that is, the type of components contained in the video content). When the value of this field is the seventh value (such as the value 0), it indicates that the component type of the video content of the immersive media is a component type based on the camera, thereby indicating that the components in the video content of the immersive media are divided based on the camera, One component includes video content shot by one camera, that is, the video content of the immersive media includes video content shot by different cameras; when the value of this field is the eighth value (such as the value 1), it indicates that the component type of the video content of the immersive media is Based on the binocular component type, it can be indicated that the components in the video content of the immersive media are obtained based on binocular division. One component includes left-eye video content or right-eye video content, that is, the video content of the immersive media includes left-eye video content. It should be understood that the value of composition_type is not limited to the seventh value and the eighth value, but can also be other values, and other values can be expanded by the corresponding standards.
4、扩展轨道参考类型数据盒(TrackReferenceTypeBox)。当自由视角视频的纹理信息和深度信息分 为不同的视频轨道封装时,当前技术定义了参考类型字段(reference_type)可通过存储'vdep'这一轨道索引,将深度图轨道参考或关联到相同相机对应的纹理图轨道,但是该轨道索引仅表示二者之间在呈现时存在依赖关系,并未指示二者之间在解码时是否存在依赖关系。因此,本申请实施例作出如下扩展:4. Extended track reference type data box (TrackReferenceTypeBox). When the texture information and depth information of the free-view video are divided into different video track packages, the current technology defines a reference type field (reference_type) that can reference or associate the depth map track to the same camera by storing the track index 'vdep' The corresponding texture map track, but the track index only indicates that there is a dependency between the two during presentation, and does not indicate whether there is a dependency between the two during decoding. Therefore, the embodiments of this application are extended as follows:
当参考类型字段(reference_type)存储'vdep'这一轨道索引时,用于将深度图轨道参考或关联到相应相机对应的纹理图轨道,且指示深度图轨道在解码时依赖对应的纹理图轨道。When the reference type field (reference_type) stores the track index 'vdep', it is used to reference or associate the depth map track to the texture map track corresponding to the corresponding camera, and indicates that the depth map track relies on the corresponding texture map track when decoding.
当参考类型字段(reference_type)存储'vdpi'这一轨道索引时,用于将深度图轨道参考或关联到相应相机对应的纹理图轨道,且指示深度图轨道在解码时不依赖对应的纹理图轨道。When the reference type field (reference_type) stores the track index 'vdpi', it is used to reference or associate the depth map track to the texture map track corresponding to the corresponding camera, and indicates that the depth map track does not rely on the corresponding texture map track when decoding. .
相应地,考虑到纹理图轨道属于视频图像信息轨道中的一种,深度图轨道属于视频辅助信息轨道中的一种,而视频图像信息轨道不局限于纹理图轨道,视频辅助信息轨道也不局限于深度图轨道,例如,视频辅助轨道还可包括视差信息轨道等。因此,还可对视频图像信息轨道与视差信息轨道之间的轨道索引、以及视频图像信息轨道与视频辅助信息轨道之间的轨道索引做出如下扩展:Correspondingly, considering that the texture map track is one of the video image information tracks, the depth map track is one of the video auxiliary information tracks, and the video image information track is not limited to the texture map track, and the video auxiliary information track is not limited to For the depth map track, for example, the video auxiliary track may also include a disparity information track, etc. Therefore, the track index between the video image information track and the disparity information track, and the track index between the video image information track and the video auxiliary information track can also be extended as follows:
当参考类型字段(reference_type)存储为'vplx'的轨道索引,用于将视差信息轨道参考或关联到相应的视频图像信息轨道,且指示视差信息轨道在解码时依赖对应的视频图像信息轨道。When the reference type field (reference_type) is stored as the track index of 'vplx', it is used to reference or associate the disparity information track to the corresponding video image information track, and indicates that the disparity information track relies on the corresponding video image information track when decoding.
当参考类型字段(reference_type)存储'vpli'的轨道索引,用于将视差信息轨道参考或关联到相应的视频图像信息轨道,且指示视差信息轨道在解码时不依赖对应的视频图像信息轨道。When the reference type field (reference_type) stores the track index of 'vpli', it is used to reference or associate the disparity information track to the corresponding video image information track, and indicates that the disparity information track does not depend on the corresponding video image information track when decoding.
当参考类型字段(reference_type)存储'auxl'的轨道索引,用于将视频辅助信息轨道参考或关联到相应的视频图像信息轨道,且指示视频辅助信息轨道在解码时依赖对应的视频图像信息轨道。When the reference type field (reference_type) stores the track index of 'auxl', it is used to reference or associate the video auxiliary information track to the corresponding video image information track, and indicates that the video auxiliary information track relies on the corresponding video image information track when decoding.
当参考类型字段(reference_type)存储'auxi'的轨道索引,用于将视频辅助信息轨道参考或关联到相应的视频图像信息轨道,且指示视频辅助信息轨道在解码时不依赖对应的视频图像信息轨道。When the reference type field (reference_type) stores the track index of 'auxi', it is used to reference or associate the video auxiliary information track to the corresponding video image information track, and indicates that the video auxiliary information track does not rely on the corresponding video image information track when decoding. .
需要说明的是,可扩展地,上述轨道索引类型关于解码依赖的指示可以互换;比如'vdep'这一轨道索引用于表示编解码不依赖,'vdpi'这一轨道索引用于表示编解码依赖,等等。It should be noted that, extensibly, the above-mentioned track index types can be interchanged regarding decoding dependency indications; for example, the track index 'vdep' is used to indicate that encoding and decoding are independent, and the track index 'vdpi' is used to indicate encoding and decoding. Dependencies, etc.
5、扩展自由视角轨道组及自由视角轨道组数据盒(AvsFreeViewGroupBox)。现有技术对于自由视角轨道组定义如下:若一个自由视角视频被封装为多个视频轨道,且一个视频轨道中仅包含部分相机对应的纹理信息和/或深度信息,则这些视频轨道应通过自由视角轨道组进行关联,包含部分相机且对应同一个自由视角视频的不同视频轨道应具备相同的track_group_id。然而,本申请实施例对自由视角轨道组的定义扩展如下:5. Expand the free view track group and free view track group data box (AvsFreeViewGroupBox). The existing technology defines the free-view track group as follows: If a free-view video is encapsulated into multiple video tracks, and one video track only contains texture information and/or depth information corresponding to some cameras, these video tracks should be passed through the free-view track group. The angle track group is associated. Different video tracks that contain some cameras and correspond to the same free-angle video should have the same track_group_id. However, the definition of the free-view track group in the embodiment of this application is expanded as follows:
自由视角轨道组用于将对应不同相机且包含纹理信息的视频轨道进行关联,且这些不同的视频轨道之间编解码相互独立;并且,包含部分相机且对应同一个自由视角视频的不同视频轨道应具备相同的track_group_id。也就是说,本申请实施例只将封装有纹理信息的视频轨道通过自由视角轨道组关联起来;考虑到纹理信息是视频图像信息中的一种,因此可进一步可扩展为:只将封装有视频图像信息的视频轨道通过自由视角轨道组关联起来。也就是说,在本申请实施例中,只有封装了视频图像信息(如纹理信息)的视频轨道中可包括自由视角轨道组数据盒;此处的自由视角轨道组数据盒可简称为轨道组数据盒,该轨道组数据盒可包括轨道组标识字段(track_group_id)和相机标识字段(camera_id)。其中,轨道组标识字段用于存储相应视频轨道所对应的轨道组的标识信息,相机标识字段用于存储相应视频轨道对应的相机的标识信息。The free-view track group is used to associate video tracks that correspond to different cameras and contain texture information, and the encoding and decoding between these different video tracks are independent of each other; and, different video tracks that contain some cameras and correspond to the same free-view video should Have the same track_group_id. That is to say, the embodiment of the present application only associates video tracks encapsulated with texture information through free-view track groups; considering that texture information is a type of video image information, it can be further expanded to: only video tracks encapsulated with video Video tracks of image information are associated through free-view track groups. That is to say, in the embodiment of the present application, only the video track that encapsulates video image information (such as texture information) may include a free-view track group data box; the free-view track group data box here may be referred to as track group data for short. box, the track group data box may include a track group identification field (track_group_id) and a camera identification field (camera_id). The track group identification field is used to store the identification information of the track group corresponding to the corresponding video track, and the camera identification field is used to store the identification information of the camera corresponding to the corresponding video track.
二、DASH信令扩展2. DASH signaling extension
1、定义视频辅助信息描述子(VideoAuxiliaryInfoDescriptor),该视频辅助信息描述子用于描述视频资源的视频辅助信息的相关属性。视频辅助信息描述子为SupplementalProperty元素,其@schemeIdUri属性为"urn:avs:ims:2018:av3f"。该视频辅助信息描述子可存在于adaptationset层级(自适应集层级)或representation层级(表示层级)。当该视频辅助信息描述子存在于adaptationset层级时,描述相应 的adaptationset内所有的representation的视频辅助信息的相关属性;该视频辅助信息描述子存在于representation层级时,描述对应的representation的视频辅助信息的相关属性。具体的,视频辅助信息描述子的具体属性可如下表3所示:1. Define the video auxiliary information descriptor (VideoAuxiliaryInfoDescriptor), which is used to describe the relevant attributes of the video auxiliary information of the video resource. The video auxiliary information descriptor is a SupplementalProperty element, and its @schemeIdUri attribute is "urn:avs:ims:2018:av3f". The video auxiliary information descriptor may exist at the adaptation set level (adaptation set level) or representation level (representation level). When the video auxiliary information descriptor exists at the adaptation set level, it describes the relevant attributes of the video auxiliary information of all representations in the corresponding adaptation set; when the video auxiliary information descriptor exists at the representation level, it describes the video auxiliary information of the corresponding representation. Related properties. Specifically, the specific attributes of the video auxiliary information descriptor can be shown in Table 3 below:
表3table 3
Figure PCTCN2022137788-appb-000010
Figure PCTCN2022137788-appb-000010
Figure PCTCN2022137788-appb-000011
Figure PCTCN2022137788-appb-000011
在上述表3中,第一列中的各个英文对应的中文含义可以如下:Elementsand Attributes的全称为Elementsand AttributesforVideoAuxiliaryInfoDescriptor,其对应的中文含义是视频辅助信息描述子的元素和属性;VideoAuxiliaryInfo为视频辅助信息字段、VideoAuxiliaryInfo@withDepthInfo为带有深度信息的视频辅助信息字段、VideoAuxiliaryInfo@withParallaxInfo为带有视差信息的视频辅助信息字段、VideoAuxiliaryInfo@withAuxiliaryInfo为带有辅助信息的视频辅助信息字段、VideoAuxiliaryInfo@depthCodecDependency为视频辅助信息的深度编解码依赖性字段、VideoAuxiliaryInfo@parallaxCodecDependency为视频辅助信息的视差编解码依赖性字段、VideoAuxiliaryInfo@auxilliaryCodecDependency为视频辅助信息的辅助编解码依赖性字段、VideoAuxiliaryInfo@onlyDepthInfo为视频辅助信息的仅限深度信息标志字段、VideoAuxiliaryInfo@onlyParallaxInfo为视频辅助信息的仅限视差信息标志字段、VideoAuxiliaryInfo@onlyAuxiliaryInfo为视频辅助信息的仅限辅助信息标志字段。第二列中的Use的中文含义是作用,O代表对应属性为Optional(可选的),CM代表ConditionalMandatory(在条件下的必选的),M代表Mandatory(必选的);第三列中的Data type的中文含义是数据类型,avs:ims:2018:av31和xs:bool是两种不同的数据类型;第四列中的Description的中文含义是描述。In the above Table 3, the corresponding Chinese meanings of each English in the first column can be as follows: The full name of Elementsand Attributes is Elementsand AttributesforVideoAuxiliaryInfoDescriptor, and its corresponding Chinese meaning is the elements and attributes of the video auxiliary information descriptor; VideoAuxiliaryInfo is the video auxiliary information field , VideoAuxiliaryInfo@withDepthInfo is the video auxiliary information field with depth information, VideoAuxiliaryInfo@withParallaxInfo is the video auxiliary information field with parallax information, VideoAuxiliaryInfo@withAuxiliaryInfo is the video auxiliary information field with auxiliary information, VideoAuxiliaryInfo@depthCodecDependency is the video auxiliary information The depth codec dependency field, VideoAuxiliaryInfo@parallaxCodecDependency is the parallax codec dependency field of video auxiliary information, VideoAuxiliaryInfo@auxilliaryCodecDependency is the auxiliary codec dependency field of video auxiliary information, VideoAuxiliaryInfo@onlyDepthInfo is the depth-only information of video auxiliary information The flag field, VideoAuxiliaryInfo@onlyParallaxInfo, is the only parallax information flag field for video auxiliary information, and VideoAuxiliaryInfo@onlyAuxiliaryInfo is the only auxiliary information flag field for video auxiliary information. The Chinese meaning of Use in the second column is function, O means that the corresponding attribute is Optional (optional), CM means ConditionalMandatory (mandatory under conditions), M means Mandatory (mandatory); in the third column The Chinese meaning of Data type is data type, avs:ims:2018:av31 and xs:bool are two different data types; the Chinese meaning of Description in the fourth column is description.
2、扩展自由视角相机描述子(AvsFreeViewCamInfoDescriptor),该自由视角相机描述子用于描述视频资源对应的相机信息。该自由视角相机描述子为SupplementalProperty元素,其@schemeIdUri属性为"urn:avs:ims:2018:av31"。并且,该自由视角相机描述子可adaptationset层级(自适应集层级)或representation层级(表示层级)。当该自由视角相机描述子存在于adaptationset层级时,描述相应的adaptationset内所有的representation对应的相机信息;该自由视角相机描述子存在于representation层级时,描述对应的representation对应的相机信息。2. Extend the free-view camera descriptor (AvsFreeViewCamInfoDescriptor), which is used to describe the camera information corresponding to the video resource. The free-view camera descriptor is a SupplementalProperty element, and its @schemeIdUri attribute is "urn:avs:ims:2018:av31". Moreover, the free-view camera descriptor can be at the adaptation set level (adaptation set level) or the representation level (representation level). When the free-view camera descriptor exists at the adaptation set level, it describes the camera information corresponding to all representations in the corresponding adaptation set; when the free-view camera descriptor exists at the representation level, it describes the camera information corresponding to the corresponding representation.
具体的,扩展后的自由视角相机描述子的具体属性可以大致参见下述表4所示:Specifically, the specific attributes of the extended free-view camera descriptor can be roughly shown in Table 4 below:
表4Table 4
Figure PCTCN2022137788-appb-000012
Figure PCTCN2022137788-appb-000012
Figure PCTCN2022137788-appb-000013
Figure PCTCN2022137788-appb-000013
在上述表4中,AvsFreeViewCam@sourceId(自由视角相机源标识字段)以及AvsFreeViewCam@combinationId(自由视角相机组合标识字段)这两个字段是本申请实施例在自由视角相机描述子中扩展的字段,除这两个字段以外的其他字段是自由视角相机描述子中原有的字段。另外,第一列中的Elements and Attributes的全称为Elementsand AttributesforAvsFreeViewCamInfoDescriptor,其对应的中文含义是自由视角相机描述子的元素和属性。第二列中的各个英文的中文含义可参见前述表3的相关解释,第三列中的avs:ims:2018:av3l、xs:unsignedInt以及xs:float是三种不同的数据类型。In the above Table 4, the two fields AvsFreeViewCam@sourceId (free-view camera source identification field) and AvsFreeViewCam@combinationId (free-view camera combination identification field) are fields extended in the free-view camera descriptor according to the embodiment of the present application. Other fields besides these two fields are the original fields in the free-view camera descriptor. In addition, the full name of Elements and Attributes in the first column is Elementsand AttributesforAvsFreeViewCamInfoDescriptor, and its corresponding Chinese meaning is the elements and attributes of the free-view camera descriptor. The Chinese meanings of each English in the second column can be found in the relevant explanations in Table 3. The avs:ims:2018:av3l, xs:unsignedInt and xs:float in the third column are three different data types.
可选的,当沉浸媒体(如自由视角视频)以多轨形式进行封装时,一个视频轨道对应的representation(视频资源)或adaptationset(自适应集)还可以通过preselection(预选择集合)工具进行指示;也就是说,同一个视频内容的不同相机对应的视频资源应存在于同一个preselection中。Optionally, when immersive media (such as free-view video) is packaged in multi-track form, the representation (video resource) or adaptation set (adaptation set) corresponding to a video track can also be indicated through the preselection (pre-selection set) tool. ; In other words, video resources corresponding to different cameras of the same video content should exist in the same preselection.
三、SMT信令扩展3. SMT signaling extension
定义视频辅助信息描述子(VideoAuxiliaryInfoDescriptor),该视频辅助信息描述子也可称为视频辅助信息描述符,其可用于描述视频资源的视频辅助信息的相关属性。具体的,视频辅助信息描述子的定义可如下表5所示:Define a video auxiliary information descriptor (VideoAuxiliaryInfoDescriptor). The video auxiliary information descriptor can also be called a video auxiliary information descriptor, which can be used to describe the relevant attributes of the video auxiliary information of the video resource. Specifically, the definition of the video auxiliary information descriptor can be shown in Table 5 below:
表5table 5
Figure PCTCN2022137788-appb-000014
Figure PCTCN2022137788-appb-000014
Figure PCTCN2022137788-appb-000015
Figure PCTCN2022137788-appb-000015
其中,descriptor_tag(描述子标识符字段),用于标志descriptor(即视频辅助信息描述子)的类型;descriptor_length(描述子长度字段),用于标识descriptor(即视频辅助信息描述子)的长度,单位为字节。Among them, descriptor_tag (descriptor identifier field) is used to identify the type of descriptor (i.e., video auxiliary information descriptor); descriptor_length (descriptor length field) is used to identify the length of the descriptor (i.e., video auxiliary information descriptor), unit is bytes.
video_with_depth_info_flag(带有深度信息的视频辅助信息字段)用于指示当前媒体资源(即相应视频资源)中是否同时包含纹理信息(图像信息)以及纹理信息(图像信息)对应的深度信息。该字段取值为第九数值(如数值1)时,指示当前媒体资源同时包含纹理信息(图像信息)以及纹理信息(图像信息)对应的深度信息;该字段取值为第十数值(如数值0)时,指示当前媒体资源仅包含纹理信息(图像信息)。video_with_depth_info_flag (video auxiliary information field with depth information) is used to indicate whether the current media resource (ie, the corresponding video resource) contains both texture information (image information) and depth information corresponding to the texture information (image information). When the value of this field is the ninth value (such as the value 1), it indicates that the current media resource contains both texture information (image information) and depth information corresponding to the texture information (image information); when the value of this field is the tenth value (such as the value 0), indicates that the current media resource only contains texture information (image information).
在video_with_depth_info_flag的取值为第九数值(如数值1)的情况下,即视频资源同时包含纹理信息和相应的深度信息的情况下,视频辅助信息描述子还可包括depth_codec_dependency(视频辅助信息的深度编解码依赖性字段);否则,视频辅助信息描述子可包括only_depth_info_flag(视频辅助信息的仅限深度信息标志字段)。其中,depth_codec_dependency这一字段用于指示当前媒体资源(即相应视频资源)中包含的纹理信息(图像信息)和纹理信息(图像信息)对应的深度信息之间是否存在解码依赖关系;当该字段取值为第十一数值(如数值1)时,指示当前媒体资源中包含的纹理信息(图像信息)和纹理信息(图像信息)对应的深度信息之间存在解码依赖关系;当该字段取值为第十二数值(如数值0)时,指示当前媒体资源中包含的纹理信息(图像信息)和纹理信息(图像信息)对应的深度信息之间不存在解码依赖关系。only_depth_info_flag这一字段用于指示当前媒体资源(即相应视频资源)包含的图像信息;当该字段取值为第十三数值(如数值1)时,指示当前媒体资源仅包含深度信息;当该字段取值为第十四数值(如数值0)时,指示当前媒体资源仅包含视频图像(纹理)信息。When the value of video_with_depth_info_flag is the ninth value (such as value 1), that is, when the video resource contains both texture information and corresponding depth information, the video auxiliary information descriptor may also include depth_codec_dependency (the depth codec of the video auxiliary information). Decoding dependency field); otherwise, the video auxiliary information descriptor may include only_depth_info_flag (only depth information flag field of video auxiliary information). Among them, the depth_codec_dependency field is used to indicate whether there is a decoding dependency between the texture information (image information) contained in the current media resource (that is, the corresponding video resource) and the depth information corresponding to the texture information (image information); when this field is When the value is the eleventh value (such as the value 1), it indicates that there is a decoding dependency between the texture information (image information) contained in the current media resource and the depth information corresponding to the texture information (image information); when the value of this field is The twelfth value (such as the value 0) indicates that there is no decoding dependency between the texture information (image information) contained in the current media resource and the depth information corresponding to the texture information (image information). The only_depth_info_flag field is used to indicate the image information contained in the current media resource (i.e., the corresponding video resource); when the value of this field is the thirteenth value (such as the value 1), it indicates that the current media resource only contains depth information; when this field When the value is the fourteenth value (such as the value 0), it indicates that the current media resource only contains video image (texture) information.
video_with_parallax_info_flag(带有视差信息的视频辅助信息字段)用于指示当前媒体资源(即相应视频资源)中是否同时包含视频图像信息以及视频图像信息对应的视差信息。该字段取值为第九数值(如数值1)时,指示当前媒体资源同时包含视频图像信息以及视频图像信息对应的视差信息;该字段取值为第十数值(如数值0)时,指示当前媒体资源仅包含视频图像信息。video_with_parallax_info_flag (video auxiliary information field with parallax information) is used to indicate whether the current media resource (ie, the corresponding video resource) contains both video image information and parallax information corresponding to the video image information. When the value of this field is the ninth value (such as the value 1), it indicates that the current media resource contains both video image information and disparity information corresponding to the video image information; when the value of this field is the tenth value (such as the value 0), it indicates that the current Media assets contain only video image information.
在video_with_parallax_info_flag的取值为第九数值(如数值1)的情况下,即视频资源同时包含视频图像信息和相应的视差信息的情况下,视频辅助信息描述子还可包括parallax_codec_dependency(视 频辅助信息的视差编解码依赖性字段);否则,视频辅助信息描述子可包括only_parallax_info_flag(视频辅助信息的仅限视差信息标志字段)。其中,parallax_codec_dependency这一字段用于指示当前媒体资源(即相应视频资源)中包含的视频图像信息和视频图像信息对应的视差信息之间是否存在解码依赖关系;当该字段取值为第十一数值(如数值1)时,指示当前媒体资源中包含的视频图像信息和视频图像信息对应的视差信息之间存在解码依赖关系;当该字段取值为第十二数值(如数值0)时,指示当前媒体资源中包含的视频图像信息和视频图像信息对应的视差信息之间不存在解码依赖关系。only_parallax_info_flag这一字段用于指示当前媒体资源(即相应视频资源)包含的图像信息;当该字段取值为第十三数值(如数值1)时,指示当前媒体资源仅包含视差信息;当该字段取值为第十四数值(如数值0)时,指示当前媒体资源仅包含视频图像信息。When the value of video_with_parallax_info_flag is the ninth value (such as value 1), that is, when the video resource contains both video image information and corresponding disparity information, the video auxiliary information descriptor may also include parallax_codec_dependency (the disparity of the video auxiliary information Codec dependency field); otherwise, the video auxiliary information descriptor may include only_parallax_info_flag (only parallax information flag field of video auxiliary information). Among them, the field parallax_codec_dependency is used to indicate whether there is a decoding dependency between the video image information contained in the current media resource (i.e., the corresponding video resource) and the disparity information corresponding to the video image information; when the value of this field is the eleventh value (such as a value of 1), indicates that there is a decoding dependency between the video image information contained in the current media resource and the disparity information corresponding to the video image information; when the value of this field is a twelfth value (such as a value of 0), it indicates that There is no decoding dependency between the video image information contained in the current media resource and the disparity information corresponding to the video image information. The only_parallax_info_flag field is used to indicate the image information contained in the current media resource (i.e., the corresponding video resource); when the value of this field is the thirteenth value (such as the value 1), it indicates that the current media resource only contains parallax information; when this field When the value is the fourteenth value (such as the value 0), it indicates that the current media resource only contains video image information.
video_with_auxl_info_flag(带有辅助信息的视频辅助信息字段)用于指示当前媒体资源(即相应视频资源)中是否同时包含视频图像信息以及视频图像信息对应的视频辅助信息。该字段取值为第九数值(如数值1)时,指示当前媒体资源同时包含视频图像信息以及视频图像信息对应的视频辅助信息;该字段取值为第十数值(如数值0)时,指示当前媒体资源仅包含视频图像信息。此处的视频辅助信息可以是深度信息、视差信息或由其他应用定义的辅助信息。video_with_auxl_info_flag (video auxiliary information field with auxiliary information) is used to indicate whether the current media resource (ie, the corresponding video resource) contains both video image information and video auxiliary information corresponding to the video image information. When the value of this field is the ninth value (such as the value 1), it indicates that the current media resource contains both video image information and video auxiliary information corresponding to the video image information; when the value of this field is the tenth value (such as the value 0), it indicates The current media resource only contains video image information. The video auxiliary information here can be depth information, disparity information, or auxiliary information defined by other applications.
在video_with_auxl_info_flag的取值为第九数值(如数值1)的情况下,即视频资源同时包含视频图像信息和相应的视频辅助信息的情况下,视频辅助信息描述子还可包括auxl_codec_dependency(视频辅助信息的辅助编解码依赖性字段);否则,视频辅助信息描述子可包括only_auxiliary_info_flag(视频辅助信息的仅限辅助信息标志字段)。其中,auxl_codec_dependency这一字段用于指示当前媒体资源(即相应视频资源)中包含的视频图像信息和视频图像信息对应的视频辅助信息之间是否存在解码依赖关系;当该字段取值为第十一数值(如数值1)时,指示当前媒体资源中包含的视频图像信息和视频图像信息对应的视频辅助信息之间存在解码依赖关系;当该字段取值为第十二数值(如数值0)时,指示当前媒体资源中包含的视频图像信息和视频图像信息对应的视频辅助信息之间不存在解码依赖关系。only_auxiliary_info_flag这一字段用于指示当前媒体资源(即相应视频资源)包含的图像信息;当该字段取值为第十三数值(如数值1)时,指示当前媒体资源仅包含视频辅助信息;当该字段取值为第十四数值(如数值0)时,指示当前媒体资源仅包含视频图像信息。When the value of video_with_auxl_info_flag is the ninth value (such as value 1), that is, when the video resource contains both video image information and corresponding video auxiliary information, the video auxiliary information descriptor may also include auxl_codec_dependency (video auxiliary information auxiliary codec dependency field); otherwise, the video auxiliary information descriptor may include only_auxiliary_info_flag (only auxiliary information flag field of video auxiliary information). Among them, the auxl_codec_dependency field is used to indicate whether there is a decoding dependency between the video image information contained in the current media resource (i.e., the corresponding video resource) and the video auxiliary information corresponding to the video image information; when the value of this field is eleventh When the value of this field is a twelfth value (such as a value of 1), it indicates that there is a decoding dependency between the video image information contained in the current media resource and the video auxiliary information corresponding to the video image information; when the value of this field is a twelfth value (such as a value of 0) , indicating that there is no decoding dependency between the video image information contained in the current media resource and the video auxiliary information corresponding to the video image information. The only_auxiliary_info_flag field is used to indicate the image information contained in the current media resource (i.e., the corresponding video resource); when the value of this field is the thirteenth value (such as the value 1), it indicates that the current media resource only contains video auxiliary information; when the field When the field value is the fourteenth value (such as the value 0), it indicates that the current media resource only contains video image information.
基于上述描述,本申请实施例提出一种沉浸媒体的处理方法,该沉浸媒体的处理方法可以由媒体制作设备执行。请参见图2,该沉浸媒体的处理方法可包括以下步骤S201-S202:Based on the above description, embodiments of the present application propose an immersive media processing method. The immersive media processing method can be executed by a media production device. Referring to Figure 2, the immersive media processing method may include the following steps S201-S202:
S201,获取沉浸媒体的视频位流。S201: Obtain the video bit stream of the immersive media.
在本申请实施例中,沉浸媒体可以是自由视角视频,也可以是除自由视角视频以外的其他沉浸式媒体,在此不作限定。并且,沉浸媒体可通过多个相机拍摄得到;任一相机通过拍摄所采集到的信息包括:视频图像信息和相应的视频辅助信息。其中,当沉浸媒体为自由视角视频时,任一相机所采集到的视频图像信息可包括纹理信息,且任一相机所采集到的视频辅助信息可包括深度信息;当然可理解的是,此处并不对视频图像信息和视频辅助信息进行限定,例如当沉浸媒体为其他沉浸式媒体时,任一相机所采集到的视频辅助信息可包括视差信息,或者由其他应用定义的辅助信息。In this embodiment of the present application, the immersive media may be free-view video, or other immersive media other than free-view video, which is not limited here. Moreover, immersive media can be captured by multiple cameras; the information collected by any camera includes: video image information and corresponding video auxiliary information. When the immersive media is free-view video, the video image information collected by any camera may include texture information, and the video auxiliary information collected by any camera may include depth information; of course, it is understandable that here Video image information and video auxiliary information are not limited. For example, when the immersive media is other immersive media, the video auxiliary information collected by any camera may include parallax information, or auxiliary information defined by other applications.
S202,将视频位流封装到M个视频轨道中,得到沉浸媒体的媒体文件。S202: Encapsulate the video bit stream into M video tracks to obtain an immersive media media file.
其中,M为正整数。当M=1时,可表明沉浸媒体的封装方式是单轨封装方式,即媒体制作设备可将视频位流封装到一个视频轨道中,以得到沉浸媒体的媒体文件。当M>1时,可表明沉浸媒体的封装方式是多轨封装方式,即媒体制作设备可将视频位流封装到一个视频轨道中,以得到沉浸媒体的媒体文件;在此情况下,媒体制作设备可将同一个相机所采集到的视频图像信息和对应的视频辅助信息封装到同一个视频轨道中,也可将同一个相机所采集到的视频图像信息和对应的视频辅助信息封装到不同的视频轨道中,对 此不作限定。进一步的,当沉浸媒体的封装方式是多轨封装方式时,一个视频轨道可对应一个或多个相机,对此也不作限定;例如,媒体制作设备可选择一个视频轨道只封装一个相机采集到的视频图像信息和相应的视频辅助信息中的至少一种信息,也可选择一个视频轨道封装至少两个相机采集到的视频图像信息和相应的视频辅助信息,还可选择一个视频轨道封装至少两个相机采集到的视频图像信息,或至少两个相机采集到的视频辅助信息。Among them, M is a positive integer. When M=1, it can be indicated that the encapsulation method of immersive media is a single-track encapsulation method, that is, the media production device can encapsulate the video bit stream into a video track to obtain the media file of the immersive media. When M>1, it can indicate that the encapsulation method of immersive media is multi-track encapsulation, that is, the media production equipment can encapsulate the video bit stream into a video track to obtain the media file of immersive media; in this case, the media production equipment The device can encapsulate video image information and corresponding video auxiliary information collected by the same camera into the same video track, or it can encapsulate video image information and corresponding video auxiliary information collected by the same camera into different In the video track, there is no limit to this. Furthermore, when the immersive media packaging method is a multi-track packaging method, one video track can correspond to one or more cameras, and there is no limit to this; for example, the media production equipment can choose a video track to only package the video captured by one camera. At least one of video image information and corresponding video auxiliary information, you can also choose one video track to encapsulate the video image information and corresponding video auxiliary information collected by at least two cameras, you can also choose one video track to encapsulate at least two Video image information collected by the camera, or video auxiliary information collected by at least two cameras.
在封装过程中,媒体制作设备还可执行如下操作:During the encapsulation process, media production equipment can also perform the following operations:
(一)根据视频图像信息和视频辅助信息是否在同一个视频轨道,在M个视频轨道中的至少一个视频轨道的视频媒体头数据盒中添加元数据信息,以指示相应视频轨道是否同时包含视频图像信息和相应的视频辅助信息。也就是说,在M个视频轨道中,存在至少一个视频轨道中的视频媒体头数据盒包括元数据信息,该元数据信息用于指示相应视频轨道是否同时包含视频图像信息和相应的视频辅助信息。(1) Depending on whether the video image information and video auxiliary information are in the same video track, add metadata information to the video media header data box of at least one video track among the M video tracks to indicate whether the corresponding video track also contains video Image information and corresponding video auxiliary information. That is to say, among the M video tracks, the video media header data box in at least one video track includes metadata information, and the metadata information is used to indicate whether the corresponding video track contains both video image information and corresponding video auxiliary information. .
具体的,若M=1(即沉浸媒体的视频位流被封装到一个视频轨道),则媒体制作设备可在这唯一一个视频轨道的媒体头数据盒中添加元数据信息。若M>1(即沉浸媒体的视频位流被封装到多个视频轨道),则在同一个相机采集的视频图像信息和对应的视频辅助信息被封装到同一个视频轨道的情况下,媒体制作设备可在M个视频轨道中的每个视频轨道的视频媒体头数据盒中添加元数据信息;而在同一个相机采集的视频图像信息和对应的视频辅助信息被封装到不同视频轨道的情况下,考虑到M个视频轨道可具体细分为一个或多个视频图像信息轨道和一个或多个视频辅助信息轨道,而视频辅助信息轨道可被参考或关联到相应的视频图像信息轨道上,因此为节省资源,媒体制作设备可在M个视频轨道中的每个视频图像信息轨道的视频媒体头数据盒中添加元数据信息。Specifically, if M=1 (that is, the video bit stream of the immersive media is encapsulated into one video track), the media production device can add metadata information to the media header data box of this only video track. If M>1 (that is, the video bit stream of immersive media is encapsulated into multiple video tracks), then in the case where the video image information collected by the same camera and the corresponding video auxiliary information are encapsulated into the same video track, media production The device can add metadata information to the video media header data box of each of the M video tracks; in the case where the video image information and corresponding video auxiliary information collected by the same camera are encapsulated into different video tracks , considering that the M video tracks can be specifically subdivided into one or more video image information tracks and one or more video auxiliary information tracks, and the video auxiliary information tracks can be referenced or associated to the corresponding video image information track, therefore To save resources, the media production equipment may add metadata information in the video media header data box of each video image information track among the M video tracks.
由此可见,在一些实施例中,媒体制作设备可至少在封装了视频图像信息的视频轨道的视频媒体头数据盒中添加元数据信息;即若M个视频轨道中的第m个视频轨道封装了至少一个相机的视频图像信息,则该第m个视频轨道的视频媒体头数据盒包括元数据信息,m∈[1,M]。It can be seen that in some embodiments, the media production device can at least add metadata information in the video media header data box of the video track that encapsulates the video image information; that is, if the m-th video track among the M video tracks encapsulates If the video image information of at least one camera is obtained, the video media header data box of the m-th video track includes metadata information, m∈[1, M].
这样,根据每个视频轨道的视频媒体头数据盒中的元数据信息,能够直接判断该视频轨道中是否同时包含视频图像信息和相应的视频辅助信息,避免了媒体消费设备在解封过程中重复解析其他数据盒中是否包含视频图像信息和相应的视频辅助信息,节省了处理资源。In this way, based on the metadata information in the video media header data box of each video track, it can be directly determined whether the video track contains both video image information and corresponding video auxiliary information, avoiding duplication of media consumption equipment during the unblocking process. Analyze whether other data boxes contain video image information and corresponding video auxiliary information, saving processing resources.
在一些实施例中,元数据信息包括内容信息标志字段;媒体制作设备可根据视频轨道中是否同时包含视频图像信息和相应的视频辅助信息,设置内容信息标志字段的取值。具体的,若相应视频轨道同时包含视频图像信息和相应的视频辅助信息,则媒体制作设备可将相应的内容信息标志字段的取值设置为第一数值(如数值1),若相应视频轨道仅包含视频图像信息,则媒体制作设备可将相应的内容信息标志字段的取值设置为第二数值(如数值0)。In some embodiments, the metadata information includes a content information flag field; the media production device can set the value of the content information flag field according to whether the video track contains both video image information and corresponding video auxiliary information. Specifically, if the corresponding video track contains both video image information and corresponding video auxiliary information, the media production device can set the value of the corresponding content information flag field to the first value (such as the value 1). If the corresponding video track only Contains video image information, the media production device can set the value of the corresponding content information flag field to the second value (such as the value 0).
在本实施例中,为迅速确定视频轨道中是否同时包含视频图像信息和相应的视频辅助信息,在元数据信息中设置用于确定是否同时包含视频图像信息和相应的视频辅助信息的内容信息标志字段,确保了媒体消费设备在解封过程中的处理速度,还能够节省处理资源。In this embodiment, in order to quickly determine whether the video track contains both video image information and corresponding video auxiliary information, a content information flag for determining whether the video track contains both video image information and corresponding video auxiliary information is set in the metadata information. field, ensuring the processing speed of media consumption devices during the unblocking process, and also saving processing resources.
在一些实施例中,内容信息标志字段可以为以下任一种:视频深度信息标志字段(即前述提及的video_with_depth_info_flag),视频视差信息标志字段(即前述提及的video_with_parallax_info_flag),视频辅助信息标志字段(即前述提及的video_with_auxl_info_flag)。其中:当内容信息标志字段为视频深度信息标志字段时,视频图像信息包括纹理信息,视频辅助信息包括深度信息;当内容信息标志字段为视频视差信息标志字段时,视频辅助信息包括视差信息;当内容信息标志字段为视频辅助信息标志字段时,视频辅助信息包括深度信息、视差信息或者由其他应用定义的辅助信息。In some embodiments, the content information flag field may be any of the following: video depth information flag field (i.e., the aforementioned video_with_depth_info_flag), video parallax information flag field (i.e., the aforementioned video_with_parallax_info_flag), video auxiliary information flag field (i.e. the video_with_auxl_info_flag mentioned above). Among them: when the content information flag field is the video depth information flag field, the video image information includes texture information, and the video auxiliary information includes depth information; when the content information flag field is the video disparity information flag field, the video auxiliary information includes disparity information; when When the content information flag field is a video auxiliary information flag field, the video auxiliary information includes depth information, disparity information, or auxiliary information defined by other applications.
需要说明的是,上述只是示例性地阐述了媒体制作设备添加元数据信息的方式,并不对此进行限定。例如,在一些实施例中,媒体制作设备也可选择选择在封装了视频辅助信息的视频轨道的视频媒体头数据 盒中添加元数据信息;在此情况下,若相应视频轨道同时包含视频图像信息和相应的视频辅助信息,则媒体制作设备可将相应的内容信息标志字段的取值设置为第一数值(如数值1),若相应视频轨道仅包含视频辅助信息,则媒体制作设备可将相应的内容信息标志字段的取值设置为第二数值(如数值0)。又如,在一些实施例中,媒体制作设备也可不关注视频辅助信息和视频图像信息是否被封装到同一个视频轨道,直接选择在每个视频轨道的视频媒体头数据盒中添加元数据信息;在此情况下,可通过内容信息标志字段的三个取值来指示视频轨道的信息情况,若相应视频轨道同时包含视频图像信息和相应的视频辅助信息,则媒体制作设备可将相应的内容信息标志字段的取值设置为第一数值(如数值1),若相应视频轨道仅包含视频图像信息,则媒体制作设备可将相应的内容信息标志字段的取值设置为第二数值(如数值0),若相应视频轨道仅包含视频辅助信息,则媒体制作设备可将相应的内容信息标志字段的取值设置为其他数值(如数值2),等等。It should be noted that the above only exemplifies the manner in which the media production device adds metadata information, and does not limit this. For example, in some embodiments, the media production device may also choose to add metadata information in the video media header data box of the video track that encapsulates the video auxiliary information; in this case, if the corresponding video track also contains video image information and the corresponding video auxiliary information, the media production equipment can set the value of the corresponding content information flag field to the first value (such as the value 1). If the corresponding video track only contains video auxiliary information, the media production equipment can set the corresponding The value of the content information flag field is set to the second value (such as the value 0). As another example, in some embodiments, the media production device may not pay attention to whether the video auxiliary information and video image information are encapsulated in the same video track, and directly choose to add metadata information in the video media header data box of each video track; In this case, the information status of the video track can be indicated by the three values of the content information flag field. If the corresponding video track contains both video image information and corresponding video auxiliary information, the media production device can convert the corresponding content information The value of the flag field is set to the first value (such as the value 1). If the corresponding video track only contains video image information, the media production device can set the value of the corresponding content information flag field to the second value (such as the value 0). ), if the corresponding video track only contains video auxiliary information, the media production device can set the value of the corresponding content information flag field to another value (such as value 2), and so on.
在本实施例中,通过识别内容信息标志字段是为视频深度信息标志字段、视频视差信息标志字段还是视频辅助信息标志字段,能够在封装过程中及时明确辅助信息是深度信息、视差信息和辅助信息中的哪一种。In this embodiment, by identifying whether the content information flag field is a video depth information flag field, a video disparity information flag field, or a video auxiliary information flag field, it is possible to promptly clarify whether the auxiliary information is depth information, parallax information, and auxiliary information during the encapsulation process. Which one.
(二),针对包含元数据信息的视频轨道而言,若元数据信息指示相应视频轨道同时包含视频图像信息和相应的视频辅助信息(如内容标志字段的取值为第一数值),则媒体制作设备还可在相应视频中的视频媒体头数据盒中添加视频信息编解码依赖性字段,并可根据相应视频轨道中的视频图像信息和相应的视频辅助信息视频图像信息和视频辅助信息之间是否存在解码依赖关系,设置视频信息编解码依赖性字段的取值。(2) For a video track containing metadata information, if the metadata information indicates that the corresponding video track contains both video image information and corresponding video auxiliary information (such as the value of the content flag field is the first value), then the media The production equipment can also add a video information encoding and decoding dependency field in the video media header data box in the corresponding video, and can make a connection between the video image information and the video auxiliary information according to the video image information in the corresponding video track and the corresponding video auxiliary information. Whether there is a decoding dependency relationship, set the value of the video information encoding and decoding dependency field.
在一些实施例中,若所述元数据信息指示相应视频轨道同时包含视频图像信息和相应的视频辅助信息,则相应视频轨道中的视频媒体头数据盒还包括视频信息编解码依赖性字段;当所述视频信息编解码依赖性字段的取值为第三数值时,指示相应视频轨道中的视频图像信息和相应的视频辅助信息之间存在解码依赖关系;当所述视频信息编解码依赖性字段的取值为第四数值时,指示相应视频轨道中的视频图像信息和相应的视频辅助信息之间不存在解码依赖关系。In some embodiments, if the metadata information indicates that the corresponding video track contains both video image information and corresponding video auxiliary information, the video media header data box in the corresponding video track also includes a video information codec dependency field; when When the value of the video information codec dependency field is the third value, it indicates that there is a decoding dependency relationship between the video image information in the corresponding video track and the corresponding video auxiliary information; when the video information codec dependency field When the value of is the fourth value, it indicates that there is no decoding dependency between the video image information in the corresponding video track and the corresponding video auxiliary information.
具体的,若视频轨道中的视频图像信息和相应的视频辅助信息之间存在解码依赖关系,则可设置视频信息编解码依赖性字段的取值为第三数值(如数值1);若视频轨道中的视频图像信息和相应的视频辅助信息之间不存在解码依赖关系,则可设置视频信息编解码依赖性字段的取值为第四数值(如数值0)。Specifically, if there is a decoding dependency relationship between the video image information in the video track and the corresponding video auxiliary information, the value of the video information codec dependency field can be set to a third value (such as the value 1); if the video track If there is no decoding dependency between the video image information and the corresponding video auxiliary information, the value of the video information codec dependency field can be set to a fourth value (such as a value of 0).
本实施例中,通过识别视频轨道中的视频信息编解码依赖性字段,能够快速确定视频轨道中的视频图像信息和相应的视频辅助信息之间是否存在解码依赖关系,能够提高编解码过程的效率。In this embodiment, by identifying the video information encoding and decoding dependency fields in the video track, it can be quickly determined whether there is a decoding dependency between the video image information in the video track and the corresponding video auxiliary information, which can improve the efficiency of the encoding and decoding process. .
由前述可知,视频信息编解码依赖性字段的具体形式可取决于内容标志字段的具体形式,若内容标志字段是视频深度信息标志字段,则视频信息编解码依赖性字段为深度编解码依赖性字段;若内容标志字段是视频视差信息标志字段,则视频信息编解码依赖性字段为视差编解码依赖性字段;若内容标志字段是视频辅助信息标志字段,则视频信息编解码依赖性字段为辅助编解码依赖性字段。As can be seen from the foregoing, the specific form of the video information codec dependency field may depend on the specific form of the content flag field. If the content flag field is the video depth information flag field, the video information codec dependency field is the depth codec dependency field. ; If the content flag field is the video disparity information flag field, the video information codec dependency field is the parallax codec dependency field; if the content flag field is the video auxiliary information flag field, the video information codec dependency field is the auxiliary codec field. Decode dependency fields.
因此,在一些实施例中,视频信息编解码依赖性字段为以下任一种:深度编解码依赖性字段,视差编解码依赖性字段,辅助编解码依赖性字段;其中:当视频信息编解码依赖性字段为深度编解码依赖性字段时,视频图像信息包括纹理信息,视频辅助信息包括深度信息;当视频信息编解码依赖性字段为视差编解码依赖性字段时,视频辅助信息包括视差信息;当视频信息编解码依赖性字段为辅助编解码依赖性字段时,视频辅助信息包括深度信息、视差信息或者由其他应用定义的辅助信息。Therefore, in some embodiments, the video information codec dependency field is any of the following: a depth codec dependency field, a parallax codec dependency field, an auxiliary codec dependency field; where: when the video information codec dependency field When the sex field is a depth codec dependency field, the video image information includes texture information, and the video auxiliary information includes depth information; when the video information codec dependency field is a disparity codec dependency field, the video auxiliary information includes disparity information; when When the video information codec dependency field is an auxiliary codec dependency field, the video auxiliary information includes depth information, disparity information, or auxiliary information defined by other applications.
其中,深度编解码依赖性字段为前述提及的depth_codec_dependency,视差编解码依赖性字段为前述提及的parallax_codec_dependency,辅助编解码依赖性字段为前述提及的auxl_codec_dependency。Among them, the depth codec dependency field is the aforementioned depth_codec_dependency, the parallax codec dependency field is the aforementioned parallax_codec_dependency, and the auxiliary codec dependency field is the aforementioned auxl_codec_dependency.
在本实施例中,通过识别视频信息编解码依赖性字段是深度编解码依赖性字段、视差编解码依赖性字 段还是辅助编解码依赖性字段,能够及时明确在视频轨道中辅助信息是深度信息、视差信息和辅助信息中的哪一种。In this embodiment, by identifying whether the video information codec dependency field is a depth codec dependency field, a parallax codec dependency field, or an auxiliary codec dependency field, it can be timely clarified whether the auxiliary information in the video track is depth information, depth information, or auxiliary codec dependency field. Which of disparity information and auxiliary information.
(三)根据沉浸媒体的视频内容的成分类型,以及沉浸媒体的视频内容的不同成分之间是否存在编解码依赖关系,在M个视频轨道中的至少一个视频轨道中添加视频成分依赖信息数据盒。(3) Add a video component dependency information data box to at least one of the M video tracks according to the component type of the video content of the immersive media and whether there is a codec dependency relationship between different components of the video content of the immersive media. .
由此可见,在一些实施例中,在M个视频轨道中,存在至少一个视频轨道包括视频成分依赖信息数据盒;其中,视频成分依赖信息数据盒用于指示:沉浸媒体的视频内容的成分类型,以及沉浸媒体的视频内容的不同成分之间是否存在编解码依赖关系。It can be seen that in some embodiments, among the M video tracks, there is at least one video track including a video component dependency information data box; wherein the video component dependency information data box is used to indicate: the component type of the video content of the immersive media , and whether there are codec dependencies between different components of the video content of the immersive media.
上述实施例中定义了视频成分依赖数据盒,通过视频成分依赖数据盒能够明确视频内容的成分类型和成分之间是否存在编解码依赖关系,从而,提高编解码的效率。In the above embodiment, a video component dependency data box is defined. Through the video component dependency data box, it is possible to clarify whether the component type of the video content and whether there is a codec dependency relationship between the components, thereby improving the efficiency of codec.
在一些实施例中,当M=1时,M个视频轨道中包括视频成分依赖信息数据盒;当M>1时,若M个视频轨道中的第m个视频轨道封装了至少两个相机所采集的一种或多种信息,则第m个视频轨道中包括视频成分依赖信息数据盒。In some embodiments, when M=1, the M video tracks include video component dependency information data boxes; when M>1, if the m-th video track among the M video tracks encapsulates at least two cameras If one or more types of information are collected, the m-th video track includes a video component dependent information data box.
具体的,若M=1(即沉浸媒体的视频位流被封装到一个视频轨道),则媒体制作设备可在这唯一一个视频轨道中添加视频成分依赖信息数据盒。若M>1(即沉浸媒体的视频位流被封装到多个视频轨道),则在一个视频轨道对应至少两个相机的情况下,媒体制作设备可在M个视频轨道中的封装有至少两个相机所采集的一种或多种信息的视频轨道中添加视频成分依赖信息数据盒。例如,若某个视频轨道封装了至少两个相机采集的视频图像信息和视频辅助信息,则媒体制作设备可在该视频轨道中添加视频成分依赖信息数据盒;又如,若某个视频轨道封装了至少两个相机采集的视频图像信息,或封装了至少两个相机采集的视频辅助信息,则媒体制作设备可在该视频轨道中添加视频成分依赖信息数据盒。可见,若M个视频轨道中的第m个视频轨道封装了至少两个相机所采集的一种或多种信息,则该第m个视频轨道中包括视频成分依赖信息数据盒。Specifically, if M=1 (that is, the video bit stream of the immersive media is encapsulated into one video track), the media production device can add the video component dependency information data box to this only video track. If M>1 (that is, the video bit stream of the immersive media is encapsulated into multiple video tracks), then in the case where one video track corresponds to at least two cameras, the media production device can encapsulate at least two of the M video tracks. Add a video component dependent information data box to the video track of one or more types of information collected by a camera. For example, if a video track encapsulates video image information and video auxiliary information collected by at least two cameras, the media production equipment can add a video component dependency information data box to the video track; another example, if a video track encapsulates If the video image information collected by at least two cameras is included, or the video auxiliary information collected by at least two cameras is encapsulated, the media production device can add a video component dependency information data box to the video track. It can be seen that if the m-th video track among the M video tracks encapsulates one or more types of information collected by at least two cameras, the m-th video track includes a video component dependency information data box.
在本实施例中,在视频轨道数量为1的情况下,无需多个视频成分依赖数据盒来指示视频内容的成分类型和不同成分之间编解码依赖关系。在视频轨道数量为多个的情况下,若存在视频轨道封装了指示两个相机所采集的一种多种信息,则该视频轨道中需要视频成分依赖信息数据盒对两个相机之间的编解码依赖关系进行指示,以提高编解码效率。In this embodiment, when the number of video tracks is 1, there is no need for multiple video component dependency data boxes to indicate component types of video content and codec dependencies between different components. In the case where the number of video tracks is multiple, if there is a video track that encapsulates a variety of information indicating the collection by two cameras, the video component dependency information data box needs to be included in the video track to encode the information between the two cameras. Decoding dependencies are indicated to improve encoding and decoding efficiency.
在一些实施例中,视频成分依赖信息数据盒包括成分编解码依赖性字段;当成分编解码依赖性字段的取值为第五数值时,指示沉浸媒体的视频内容的不同成分之间存在编解码依赖关系;当成分编解码依赖性字段的取值为第六数值时,指示沉浸媒体的视频内容的不同成分之间不存在编解码依赖关系。In some embodiments, the video component dependency information data box includes a component codec dependency field; when the value of the component codec dependency field is a fifth value, it indicates that codecs exist between different components of the video content of the immersive media. Dependency relationship; when the value of the component codec dependency field is the sixth value, it indicates that there is no codec dependency relationship between different components of the video content of the immersive media.
其中,视频成分依赖信息数据盒可包括成分编解码依赖性字段(即前述提及的composition_codec_dependency);媒体制作设备可根据沉浸媒体的视频内容的不同成分之间是否存在编解码依赖关系,设置成分编解码依赖性字段的取值。具体的,若沉浸媒体的视频内容的不同成分之间存在编解码依赖关系,则媒体制作设备可将相应的成分编解码依赖性字段的取值设置为第五数值(如数值1),若沉浸媒体的视频内容的不同成分之间不存在编解码依赖关系,则媒体制作设备可将相应的成分编解码依赖性字段的取值设置为第六数值(如数值0)。Among them, the video component dependency information data box may include a component codec dependency field (i.e., the aforementioned composition_codec_dependency); the media production device may set the component codec dependency according to whether there is a codec dependency relationship between different components of the immersive media video content. Decode the value of the dependency field. Specifically, if there is a codec dependency relationship between different components of the video content of the immersive media, the media production device can set the value of the codec dependency field of the corresponding component to a fifth value (such as a value of 1). If there is no codec dependency relationship between different components of the video content of the media, the media production device can set the value of the codec dependency field of the corresponding component to a sixth value (such as a value of 0).
在本实施例中,通过成分编解码依赖性字段的数值,能够准确反映沉浸媒体的视频内容的不同成分之间是否存在编解码依赖关系。In this embodiment, the value of the component codec dependency field can accurately reflect whether there is a codec dependency relationship between different components of the video content of the immersive media.
在一些实施例中,视频成分依赖信息数据盒包括成分类型字段;当成分类型字段的取值为第七数值时,指示沉浸媒体的视频内容的成分类型为基于相机的成分类型;其中,一个成分包括一个相机拍摄的视频内容;当成分类型字段的取值为第八数值时,指示沉浸媒体的视频内容的成分类型为基于双目的成分类型;其中,一个成分包括左目视频内容或右目视频内容。In some embodiments, the video component dependency information data box includes a component type field; when the value of the component type field is a seventh value, it indicates that the component type of the video content of the immersive media is a camera-based component type; wherein, a component Includes video content captured by a camera; when the value of the component type field is the eighth value, it indicates that the component type of the video content of the immersive media is a binocular-based component type; wherein one component includes left-eye video content or right-eye video content .
其中,成分类型字段为前述提及的composition_type;媒体制作设备可根据沉浸媒体的视频内容的成分类型(即沉浸媒体的视频内容的成分的划分规则),设置成分类型字段的取值。具体的,若沉浸媒体的视频内容的成分类型为基于相机的成分类型,即一个成分包括一个相机拍摄的视频内容,则媒体制作设备可将相应的成分类型字段的取值设置为第七数值(如数值0);若沉浸媒体的视频内容的成分类型为基于双目的成分类型,即一个成分包括左目视频内容或右目视频内容,则媒体制作设备可将相应的成分类型字段的取值设置为第八数值(如数值1)。The component type field is the aforementioned composition_type; the media production device can set the value of the component type field according to the component type of the video content of the immersive media (that is, the division rule of the components of the video content of the immersive media). Specifically, if the component type of the video content of the immersive media is a camera-based component type, that is, one component includes video content shot by a camera, the media production device can set the value of the corresponding component type field to the seventh value ( Such as the value 0); if the component type of the immersive media video content is based on the binocular component type, that is, one component includes left-eye video content or right-eye video content, the media production device can set the value of the corresponding component type field to The eighth value (such as value 1).
在本实施例中,通过视频成分依赖信息数据盒中的成分类型字段的数值,能够准确区分沉浸媒体的视频内容的成分类型是基于相机的成分类型还是基于双目的成分类型。In this embodiment, by using the value of the component type field in the video component dependency information data box, it is possible to accurately distinguish whether the component type of the video content of the immersive media is a camera-based component type or a binocular-based component type.
(四)在M>1(即沉浸媒体的视频位流被封装到多个视频轨道)的情况下,在M个视频轨道中封装了视频图像信息的各个视频轨道中添加轨道组数据盒,以将M个视频轨道中封装了视频图像信息的各个视频轨道关联起来。(4) In the case of M>1 (that is, the video bit stream of the immersive media is encapsulated into multiple video tracks), add a track group data box to each video track that encapsulates the video image information in the M video tracks to Each video track encapsulating video image information among the M video tracks is associated.
在一些实施例中,在M>1的情况下,M个视频轨道中封装了视频图像信息的视频轨道还包含轨道组数据盒;其中,轨道组数据盒包括轨道组标识字段和相机标识字段;轨道组标识字段用于存储相应视频轨道所对应的轨道组的标识信息,相机标识字段用于存储相应视频轨道对应的相机的标识信息。In some embodiments, when M>1, the video track encapsulating video image information among the M video tracks also includes a track group data box; wherein the track group data box includes a track group identification field and a camera identification field; The track group identification field is used to store the identification information of the track group corresponding to the corresponding video track, and the camera identification field is used to store the identification information of the camera corresponding to the corresponding video track.
其中,轨道组数据盒中可至少包括轨道组标识字段和相机标识字段;针对封装了视频图像信息的任一视频轨道而言,媒体制作设备可将相应视频轨道所对应的轨道组的标识信息存储至该视频轨道中的轨道组数据盒中的轨道组标识字段中,以及将相应视频轨道对应的相机的标识信息存储至该视频轨道中的轨道组数据盒中的相机标识字段中。The track group data box may include at least a track group identification field and a camera identification field; for any video track that encapsulates video image information, the media production device may store the identification information of the track group corresponding to the corresponding video track. to the track group identification field in the track group data box in the video track, and store the identification information of the camera corresponding to the corresponding video track in the camera identification field in the track group data box in the video track.
在本实施例中,通过视频轨道中的轨道组数据盒,使得媒体制作设备能够将轨道组的标识信息和相机的标识信息进行准确存储在相应的标识字段中。In this embodiment, through the track group data box in the video track, the media production device can accurately store the identification information of the track group and the identification information of the camera in the corresponding identification fields.
(五)在M个视频轨道包括一个或多个视频图像信息轨道和一个或多个视频辅助信息轨道的情况下,媒体制作设备还可在每个视频辅助信息轨道中添加轨道参考类型数据盒,以将视频辅助信息轨道参考或关联到相同相机对应的视频图像信息轨道,且指示视频辅助信息轨道和对应的视频图像信息轨道,在解码时是否存在依赖关系。(5) In the case where the M video tracks include one or more video image information tracks and one or more video auxiliary information tracks, the media production equipment can also add a track reference type data box to each video auxiliary information track, The video auxiliary information track is referenced or associated with the video image information track corresponding to the same camera, and indicates whether there is a dependency between the video auxiliary information track and the corresponding video image information track during decoding.
在一个实施例中,在M>1的情况下,若同一个相机的视频图像信息和视频辅助信息被封装到不同的视频轨道,则M个视频轨道包括:一个或多个视频图像信息轨道和一个或多个视频辅助信息轨道;其中,视频图像信息轨道是指封装了视频图像信息的视频轨道,视频辅助信息轨道是指封装了视频辅助信息的视频轨道;任一个视频辅助信息轨道包括轨道参考类型数据盒,轨道参考类型数据盒用于将视频辅助信息轨道参考或关联到相同相机对应的视频图像信息轨道,且用于指示视频辅助信息轨道和对应的视频图像信息轨道,在解码时是否存在依赖关系。In one embodiment, when M>1, if the video image information and video auxiliary information of the same camera are encapsulated into different video tracks, the M video tracks include: one or more video image information tracks and One or more video auxiliary information tracks; wherein, the video image information track refers to a video track that encapsulates video image information, and the video auxiliary information track refers to a video track that encapsulates video auxiliary information; any video auxiliary information track includes a track reference Type data box, track reference type data box is used to reference or associate the video auxiliary information track to the video image information track corresponding to the same camera, and is used to indicate whether the video auxiliary information track and the corresponding video image information track exist during decoding. Dependencies.
在本实施例中,对于M个视频轨道,将视频轨道分为封装了至少一个视频图像信息轨道和封装了至少一个视频辅助信息轨道,并通过与视频辅助信息轨道对应的轨道参考类型数据盒所指示的解码依赖关系,能够提高解码效率。In this embodiment, for M video tracks, the video tracks are divided into tracks that encapsulate at least one video image information and tracks that encapsulate at least one video auxiliary information, and are represented by the track reference type data box corresponding to the video auxiliary information track. The indicated decoding dependencies can improve decoding efficiency.
在一些实施例中,轨道参考类型数据盒包含轨道标识字段,轨道标识字段用于存储与视频辅助信息轨道所关联的视频图像信息轨道的标识信息。In some embodiments, the track reference type data box contains a track identification field for storing identification information of the video image information track associated with the video auxiliary information track.
其中,轨道参考类型数据盒包含轨道标识字段(track_ID);那么针对任一视频辅助信息轨道,媒体制作设备可与该视频辅助信息轨道所关联的视频图像信息轨道的标识信息,存储至该视频辅助信息轨道中的轨道参考类型数据盒中的轨道标识字段中。Among them, the track reference type data box contains a track identification field (track_ID); then for any video auxiliary information track, the media production device can store the identification information of the video image information track associated with the video auxiliary information track to the video auxiliary information track. In the Track ID field in the Track Reference Type data box in the Information Track.
在本实施例中,通过轨道参数类型数据盒中的轨道标识字段,使得媒体制作设备能够准确存储每个视频辅助轨道所关联的视频图像信息轨道的标识信息。In this embodiment, through the track identification field in the track parameter type data box, the media production device can accurately store the identification information of the video image information track associated with each video auxiliary track.
在一些实施例中,轨道参考类型数据盒包含参考类型字段;当参考类型字段存储第一轨道索引时,指示视频辅助信息轨道在解码时,依赖对应的视频图像信息轨道;当参考类型字段存储第二轨道索引时,指示视频辅助信息轨道在解码时,不依赖对应的视频图像信息轨道。In some embodiments, the track reference type data box contains a reference type field; when the reference type field stores the first track index, it indicates that the video auxiliary information track relies on the corresponding video image information track when decoding; when the reference type field stores the first When the two-track index is used, it indicates that the video auxiliary information track does not rely on the corresponding video image information track when decoding.
其中,由于轨道参考类型数据盒包含参考类型字段;所以,针对任一视频辅助信息轨道,媒体制作设备可根据该视频辅助信息轨道在解码时,是否依赖对应的视频图像信息轨道,选择将第一轨道索引或第二轨道索引存储至该视频辅助信息轨道中的轨道参考类型数据盒中的参考类型字段中。具体的,若视频辅助信息轨道在解码时,依赖对应的视频图像信息轨道,则可选择将第一轨道索引存储至该视频辅助信息轨道中的轨道参考类型数据盒中的参考类型字段中;若视频辅助信息轨道在解码时,不依赖对应的视频图像信息轨道,则可选择将第二轨道索引存储至该视频辅助信息轨道中的轨道参考类型数据盒中的参考类型字段中。Among them, since the track reference type data box contains a reference type field; therefore, for any video auxiliary information track, the media production equipment can select the first one according to whether the video auxiliary information track relies on the corresponding video image information track when decoding. The track index or second track index is stored into the reference type field in the track reference type data box in the video side information track. Specifically, if the video auxiliary information track relies on the corresponding video image information track when decoding, you can choose to store the first track index in the reference type field in the track reference type data box in the video auxiliary information track; if When the video auxiliary information track is decoded, independent of the corresponding video image information track, the second track index may be stored in the reference type field in the track reference type data box in the video auxiliary information track.
在本实施例中,媒体制作设备在解码过程中,根据参考类型字段所存在的轨道索引,能够明确视频辅助信息轨道和视频图像信息轨道是否有依赖关系,从而,能够准确且迅速的完成解码,提高了解码效率。在一些实施例中,第一轨道索引为以下任一种:第一字符串,第一字符串用于指示视频辅助信息轨道中的视频辅助信息包括深度信息;第二字符串,第二字符串用于指示视频辅助信息轨道中的视频辅助信息包括视差信息;第三字符串,第三字符串用于指示视频辅助信息轨道中的视频辅助信息包括深度信息、视差信息或者由其他应用定义的辅助信息;第二轨道索引为以下任一种:第四字符串,第四字符串用于指示视频辅助信息轨道中的视频辅助信息包括深度信息;第五字符串,第五字符串用于指示视频辅助信息轨道中的视频辅助信息包括视差信息;第六字符串,第六字符串用于指示视频辅助信息轨道中的视频辅助信息包括深度信息、视差信息或者由其他应用定义的辅助信息。In this embodiment, during the decoding process, the media production device can clarify whether there is a dependency relationship between the video auxiliary information track and the video image information track based on the track index that exists in the reference type field, so that the decoding can be completed accurately and quickly. Improved decoding efficiency. In some embodiments, the first track index is any one of the following: a first character string, the first character string is used to indicate that the video auxiliary information in the video auxiliary information track includes depth information; a second character string, a second character string used to indicate that the video auxiliary information in the video auxiliary information track includes parallax information; the third string is used to indicate that the video auxiliary information in the video auxiliary information track includes depth information, disparity information or auxiliary information defined by other applications information; the second track index is any of the following: the fourth string, the fourth string is used to indicate that the video auxiliary information in the video auxiliary information track includes depth information; the fifth string, the fifth string is used to indicate the video The video auxiliary information in the auxiliary information track includes disparity information; the sixth string is used to indicate that the video auxiliary information in the video auxiliary information track includes depth information, disparity information or auxiliary information defined by other applications.
其中,第一字符串为前述提及的'vdep',第二字符串为前述提及的'vplx',第三字符串为即前述提及的'auxl'。第四字符串为前述提及的'vdpi',第五字符串为前述提及的'vpli',第六字符串为前述提及的'auxi'。Among them, the first character string is the aforementioned 'vdep', the second character string is the aforementioned 'vplx', and the third character string is the aforementioned 'auxl'. The fourth string is the aforementioned 'vdpi', the fifth string is the aforementioned 'vpli', and the sixth string is the aforementioned 'auxi'.
在本实施例中,对于第一轨道索引或者第二轨道索引,对不同视频辅助信息分别设置相应的字符串,能够有效区分第一轨道索引或第二轨道索引中的视频辅助信息是深度信息、视差信息和辅助信息中的哪一种。In this embodiment, for the first track index or the second track index, corresponding strings are set for different video auxiliary information, which can effectively distinguish whether the video auxiliary information in the first track index or the second track index is depth information, Which of disparity information and auxiliary information.
在上述沉浸媒体的处理方法中,通过获取沉浸媒体的视频位流,并将视频位流封装到M个视频轨道中,得到沉浸媒体的媒体文件;在此过程中,可通过在至少一个视频轨道中的视频媒体头数据盒封装元数据信息,以对相应视频轨道内的信息情况进行统一指示,这样可以实现以更优化的方式组织沉浸媒体的媒体内容,避免媒体消费设备在解封装过程中重复解析其他数据盒中用于指示视频轨道内的信息情况的相关字段,从而节省处理资源,提升解封装效率,进而提升解码效率。In the above immersive media processing method, the video bitstream of the immersive media is obtained and the video bitstream is encapsulated into M video tracks to obtain the media file of the immersive media; in this process, the media file of the immersive media can be obtained by The video media header data box in the video media encapsulates metadata information to provide a unified indication of the information in the corresponding video track. This can organize the media content of immersive media in a more optimized way and avoid duplication of media consumption devices during the decapsulation process. Parse the relevant fields in other data boxes used to indicate the information status in the video track, thereby saving processing resources, improving decapsulation efficiency, and thus improving decoding efficiency.
在一些实施例中,视频位流被封装到M个视频轨道后,得到每个视频轨道对应的视频资源,媒体文件包括M个视频轨道对应的视频资源;若媒体文件的传输方式为流化传输方式,则方法还包括:生成媒体文件对应的信令描述文件;其中,信令描述文件中包括至少一个视频辅助信息描述子,视频辅助信息描述子用于描述视频资源的视频辅助信息的相关属性。In some embodiments, after the video bit stream is encapsulated into M video tracks, the video resources corresponding to each video track are obtained, and the media file includes the video resources corresponding to the M video tracks; if the transmission method of the media file is streaming transmission method, the method further includes: generating a signaling description file corresponding to the media file; wherein the signaling description file includes at least one video auxiliary information descriptor, and the video auxiliary information descriptor is used to describe the relevant attributes of the video auxiliary information of the video resource. .
其中,该信令描述文件对应的传输信令可以是DASH信令或者SMT信令,对此不作限定。The transmission signaling corresponding to the signaling description file may be DASH signaling or SMT signaling, which is not limited.
具体的,媒体制作设备可在信令描述文件中添加至少一个视频辅助信息描述子,以描述视频资源的视频辅助信息的相关属性。Specifically, the media production device may add at least one video auxiliary information descriptor in the signaling description file to describe the relevant attributes of the video auxiliary information of the video resource.
进一步地,当M>1时,表明所述沉浸媒体以多轨形式进行封装,一个视频轨道对应的视频资源或自适应集通过预选择集合工具进行指示。Further, when M>1, it indicates that the immersive media is packaged in a multi-track form, and the video resources or adaptive sets corresponding to one video track are indicated through the pre-selection set tool.
这样,在存在多个视频轨道的情况下,通过预旋转集合工具能够对每个视频轨道所对应的视频资源或 者自适应集进行有效且准确的指示。In this way, when there are multiple video tracks, the pre-rotation set tool can effectively and accurately indicate the video resources or adaptive sets corresponding to each video track.
在本实施例中,若确定了媒体文件的传输方式为流化传输方式,则生成包括至少一个视频辅助信息描述子的信令描述文件,从而,能够详细且准确的描述与视频辅助信息相关联的相关属性。In this embodiment, if it is determined that the transmission mode of the media file is a streaming transmission mode, a signaling description file including at least one video auxiliary information descriptor is generated, so that a detailed and accurate description associated with the video auxiliary information can be related properties.
在一些实施例中,当视频辅助信息描述子位于信令描述文件的自适应集层级时,描述相应自适应集中的各个视频资源的视频辅助信息的相关属性;当视频辅助信息描述子位于信令描述文件的表示层级时,描述相应的视频资源的视频辅助信息的相关属性。In some embodiments, when the video auxiliary information descriptor is located at the adaptation set level of the signaling description file, it describes the relevant attributes of the video auxiliary information of each video resource in the corresponding adaptation set; when the video auxiliary information descriptor is located at the signaling description file When describing the presentation level of the file, describe the relevant attributes of the video auxiliary information of the corresponding video resource.
需要说明的是,在传输信令是DASH信令的情况下,由前述关于DASH信令的扩展可知,视频辅助信息描述子可位于表示层级,以描述相应的视频资源的视频辅助信息的相关属性,此情况下,信令描述文件中的视频辅助信息描述子的数量与媒体文件中的视频资源(representation)的数量相同;或者,视频辅助信息描述子可位于自适应集层级,以描述相应自适应集中的各个视频资源的视频辅助信息的相关属性,此情况下,信令描述文件中的视频辅助信息描述子的数量与媒体文件中的自适应集的数量相同。It should be noted that when the transmission signaling is DASH signaling, as can be seen from the aforementioned expansion of DASH signaling, the video auxiliary information descriptor can be located at the presentation level to describe the relevant attributes of the video auxiliary information of the corresponding video resource. , in this case, the number of video auxiliary information descriptors in the signaling description file is the same as the number of video resources (representations) in the media file; alternatively, the video auxiliary information descriptors can be located at the adaptive set level to describe the corresponding Adapt to the relevant attributes of the video auxiliary information of each video resource in the set. In this case, the number of video auxiliary information descriptors in the signaling description file is the same as the number of adaptive sets in the media file.
在本实施例中,通过确定视频辅助信息描述子在信令描述文件中的层级,能够准确的描述与层级相匹配的相关属性。In this embodiment, by determining the level of the video auxiliary information descriptor in the signaling description file, the relevant attributes matching the level can be accurately described.
在一些实施例中,媒体制作设备可在视频辅助信息描述子中添加视频辅助信息字段。并且,当视频辅助信息描述子位于表示层级时,媒体制作设备可将对应的视频资源的视频辅助信息的相关属性存储至该视频辅助信息字段中;当视频辅助信息描述子位于自适应集层级时,媒体制作设备可将对应的自适应集的视频辅助信息的相关属性存储至该视频辅助信息字段中。In some embodiments, the media production device may add a video auxiliary information field in the video auxiliary information descriptor. Moreover, when the video auxiliary information descriptor is located at the presentation level, the media production device can store the relevant attributes of the video auxiliary information of the corresponding video resource into the video auxiliary information field; when the video auxiliary information descriptor is located at the adaptive set level , the media production device can store relevant attributes of the video auxiliary information of the corresponding adaptive set into the video auxiliary information field.
在本实施例中,根据视频辅助信息描述子中的视频辅助信息字段,能够及时且准确的将视频辅助信息的相关属性或者自适应集的视频辅助信息存储至对应的视频辅助信息字段,即提高了视频辅助信息存储的准确率。In this embodiment, according to the video auxiliary information field in the video auxiliary information descriptor, the relevant attributes of the video auxiliary information or the video auxiliary information of the adaptive set can be stored in the corresponding video auxiliary information field in a timely and accurate manner, that is, improving The accuracy of video auxiliary information storage is improved.
在一些实施例中,视频辅助信息描述子包括辅助信息指示字段;当辅助信息指示字段的取值为第九数值时,指示相应视频资源同时包含视频图像信息和相应的视频辅助信息;当辅助信息指示字段的取值为第十数值时,指示相应视频资源仅包含视频图像信息。In some embodiments, the video auxiliary information descriptor includes an auxiliary information indication field; when the value of the auxiliary information indication field is the ninth value, it indicates that the corresponding video resource contains both video image information and corresponding video auxiliary information; when the auxiliary information When the value of the indication field is the tenth value, it indicates that the corresponding video resource only contains video image information.
其中,媒体制作设备可在视频辅助信息描述子中添加辅助信息指示字段。媒体制作设备可根据该视频辅助信息描述子所描述的视频资源是否同时包含视频图像信息和相应的视频辅助信息,设置该辅助信息指示字段的取值。Wherein, the media production device can add an auxiliary information indication field in the video auxiliary information descriptor. The media production device may set the value of the auxiliary information indication field according to whether the video resource described by the video auxiliary information descriptor contains both video image information and corresponding video auxiliary information.
具体的,若视频辅助信息描述子所描述的视频资源同时包含视频图像信息和相应的视频辅助信息,则媒体制作设备设置该辅助信息指示字段的取值为第九数值(如数值1);若视频辅助信息描述子所描述的视频资源仅包含视频图像信息,则媒体制作设备设置该辅助信息指示字段的取值为第十数值(如数值0)。Specifically, if the video resource described by the video auxiliary information descriptor contains both video image information and corresponding video auxiliary information, the media production device sets the value of the auxiliary information indication field to the ninth value (such as the value 1); if The video resource described by the video auxiliary information descriptor only contains video image information, then the media production device sets the value of the auxiliary information indication field to the tenth value (such as the value 0).
在本实施例中,根据辅助信息指示字段的数值,能够及时明确相应的视频资源中是仅包含了视频图像信息,还是同时包含了视频图像信息和相应的视频辅助信息。In this embodiment, according to the value of the auxiliary information indication field, it can be clarified in time whether the corresponding video resource only contains video image information, or whether it contains both video image information and corresponding video auxiliary information.
在一些实施例中,辅助信息指示字段为以下任一种:带有深度信息的视频辅助信息字段,带有视差信息的视频辅助信息字段,带有辅助信息的视频辅助信息字段;其中:当辅助信息指示字段为带有深度信息的视频辅助信息字段时,视频图像信息包括纹理信息,视频辅助信息包括深度信息;当辅助信息指示字段为带有视差信息的视频辅助信息字段时,视频辅助信息包括视差信息;当辅助信息指示字段为带有辅助信息的视频辅助信息字段时,视频辅助信息包括深度信息、视差信息或者由其他应用定义的辅助信息。In some embodiments, the auxiliary information indication field is any of the following: a video auxiliary information field with depth information, a video auxiliary information field with disparity information, a video auxiliary information field with auxiliary information; wherein: when assisting When the information indication field is a video auxiliary information field with depth information, the video image information includes texture information, and the video auxiliary information includes depth information; when the auxiliary information indication field is a video auxiliary information field with disparity information, the video auxiliary information includes Parallax information; when the auxiliary information indication field is a video auxiliary information field with auxiliary information, the video auxiliary information includes depth information, disparity information, or auxiliary information defined by other applications.
其中,带有深度信息的视频辅助信息字段为前述提及的DASH信令涉及的VideoAuxiliaryInfo@withDepthInfo或SMT信令涉及的video_with_depth_info_flag,带有视差信息的视频辅助信息字段为前述提及的DASH信令涉及的VideoAuxiliaryInfo@withParallaxInfo或前述SMT信令涉及的video_with_parallax_info_flag,带有辅助信息的视频辅助信息字段为前述提及的DASH信令涉 及的VideoAuxiliaryInfo@withAuxiliaryInfo或SMT信令涉及的video_with_auxl_info_flag。Among them, the video auxiliary information field with depth information is VideoAuxiliaryInfo@withDepthInfo or video_with_depth_info_flag involved in the aforementioned DASH signaling, and the video auxiliary information field with disparity information is the aforementioned DASH signaling. VideoAuxiliaryInfo@withParallaxInfo or video_with_parallax_info_flag related to the aforementioned SMT signaling, and the video auxiliary information field with auxiliary information is VideoAuxiliaryInfo@withAuxiliaryInfo related to the aforementioned DASH signaling or video_with_auxl_info_flag related to the SMT signaling.
在本实施例中,通过识别辅助信息指示字段是带有深度信息的视频辅助信息字段、带有视差信息的视频辅助信息字段还是带有辅助信息的视频辅助信息字段,能够使得在传输过程中及时明确除了包含视频资源的视频图像信息之外,还包含深度信息、视差信息和辅助信息中的一种。In this embodiment, by identifying whether the auxiliary information indication field is a video auxiliary information field with depth information, a video auxiliary information field with disparity information, or a video auxiliary information field with auxiliary information, timely transmission can be achieved during the transmission process. It is clarified that in addition to the video image information of the video resource, it also contains one of depth information, disparity information and auxiliary information.
进一步的,若视频辅助信息描述子所描述的视频资源同时包含视频图像信息和相应的视频辅助信息,即辅助信息指示字段的取值为第九数值,则媒体制作设备可在视频辅助信息描述子中添加辅助信息编解码依赖性字段,并根据相应视频资源中的视频图像信息和相应的视频辅助信息之间是否存在解码依赖关系,设置辅助信息编解码依赖性字段的取值。Further, if the video resource described by the video auxiliary information descriptor contains both video image information and corresponding video auxiliary information, that is, the value of the auxiliary information indication field is the ninth value, then the media production device can specify the video auxiliary information descriptor in the video resource. Add an auxiliary information codec dependency field in the auxiliary information codec dependency field, and set the value of the auxiliary information codec dependency field according to whether there is a decoding dependency relationship between the video image information in the corresponding video resource and the corresponding video auxiliary information.
在一些实施例中,若视频辅助信息描述子所描述的视频资源同时包含视频图像信息和相应的视频辅助信息,则视频辅助信息描述子包括辅助信息编解码依赖性字段;当辅助信息编解码依赖性字段的取值为第十一数值时,指示相应视频资源中的视频图像信息和相应的视频辅助信息之间存在解码依赖关系;当辅助信息编解码依赖性字段的取值为第十二数值时,指示相应视频资源中的视频图像信息和相应的视频辅助信息之间不存在解码依赖关系。In some embodiments, if the video resource described by the video auxiliary information descriptor contains both video image information and corresponding video auxiliary information, the video auxiliary information descriptor includes an auxiliary information codec dependency field; when the auxiliary information codec depends When the value of the property field is the eleventh value, it indicates that there is a decoding dependency relationship between the video image information in the corresponding video resource and the corresponding video auxiliary information; when the value of the auxiliary information encoding and decoding dependency field is the twelfth value When, it indicates that there is no decoding dependency between the video image information and the corresponding video auxiliary information in the corresponding video resource.
具体的,若视频辅助信息描述子所描述的视频资源中的视频图像信息和相应的视频辅助信息之间存在解码依赖关系,则可设置辅助信息编解码依赖性字段的取值为第十一数值(如数值1);若视频辅助信息描述子所描述的视频资源中的视频图像信息和相应的视频辅助信息之间不存在解码依赖关系,则可设置辅助信息编解码依赖性字段的取值为第十二数值(如数值0)。Specifically, if there is a decoding dependency relationship between the video image information in the video resource described by the video auxiliary information descriptor and the corresponding video auxiliary information, the value of the auxiliary information codec dependency field can be set to the eleventh value (such as value 1); if there is no decoding dependency between the video image information in the video resource described by the video auxiliary information descriptor and the corresponding video auxiliary information, the value of the auxiliary information codec dependency field can be set to The twelfth value (such as the value 0).
在本实施例中,通过识别辅助信息编解码依赖性字段的数值,能够快速确定视频资源中视频图像信息和相应的视频辅助信息之间是否存在解码依赖关系,提高了解码效率。In this embodiment, by identifying the value of the auxiliary information encoding and decoding dependency field, it can be quickly determined whether there is a decoding dependency between the video image information and the corresponding video auxiliary information in the video resource, thereby improving decoding efficiency.
由前述可知,辅助信息编解码依赖性字段的具体形式可取决于辅助信息指示字段的具体形式。若辅助信息指示字段是带有深度信息的视频辅助信息字段,则辅助信息编解码依赖性字段为视频辅助信息的深度编解码依赖性字段;若辅助信息指示字段是带有视差信息的视频辅助信息字段,则辅助信息编解码依赖性字段为视频辅助信息的视差编解码依赖性字段;若辅助信息指示字段是带有辅助信息的视频辅助信息字段,则辅助信息编解码依赖性字段为视频辅助信息的辅助编解码依赖性字段。As can be seen from the foregoing, the specific form of the auxiliary information encoding and decoding dependency field may depend on the specific form of the auxiliary information indication field. If the auxiliary information indication field is a video auxiliary information field with depth information, the auxiliary information codec dependency field is the depth codec dependency field of the video auxiliary information; if the auxiliary information indication field is video auxiliary information with disparity information field, then the auxiliary information codec dependency field is the disparity codec dependency field of the video auxiliary information; if the auxiliary information indication field is a video auxiliary information field with auxiliary information, the auxiliary information codec dependency field is the video auxiliary information Auxiliary codec dependency field.
因此,在一些实施例中,辅助信息编解码依赖性字段为以下任一种:视频辅助信息的深度编解码依赖性字段,视频辅助信息的视差编解码依赖性字段,视频辅助信息的辅助编解码依赖性字段;其中:当辅助信息编解码依赖性字段为视频辅助信息的深度编解码依赖性字段时,视频图像信息包括纹理信息,视频辅助信息包括深度信息;当辅助信息编解码依赖性字段为视频辅助信息的视差编解码依赖性字段时,视频辅助信息包括视差信息;当辅助信息编解码依赖性字段为视频辅助信息的辅助编解码依赖性字段时,视频辅助信息包括深度信息、视差信息或者由其他应用定义的辅助信息。Therefore, in some embodiments, the auxiliary information codec dependency field is any of the following: a depth codec dependency field of video auxiliary information, a disparity codec dependency field of video auxiliary information, an auxiliary codec of video auxiliary information Dependency field; where: when the auxiliary information codec dependency field is the depth codec dependency field of the video auxiliary information, the video image information includes texture information, and the video auxiliary information includes depth information; when the auxiliary information codec dependency field is when the auxiliary information codec dependency field is the auxiliary codec dependency field of the video auxiliary information, the video auxiliary information includes depth information, disparity information, or Auxiliary information defined by other applications.
其中,视频辅助信息的深度编解码依赖性字段为前述DASH信令涉及的VideoAuxiliaryInfo@depthCodecDependency或SMT信令涉及的depth_codec_dependency,视频辅助信息的视差编解码依赖性字段为前述DASH信令涉及的VideoAuxiliaryInfo@parallaxCodecDependency或者前述SMT信令涉及的parallax_codec_dependency,视频辅助信息的辅助编解码依赖性字段为DASH信令的VideoAuxiliaryInfo@auxilliaryCodecDependency或SMT信令涉及的auxl_codec_dependency。,Among them, the depth codec dependency field of the video auxiliary information is the VideoAuxiliaryInfo@depthCodecDependency involved in the aforementioned DASH signaling or the depth_codec_dependency involved in the SMT signaling, and the parallax codec dependency field of the video auxiliary information is the VideoAuxiliaryInfo@parallaxCodecDependency involved in the aforementioned DASH signaling. Or parallax_codec_dependency involved in the aforementioned SMT signaling, the auxiliary codec dependency field of the video auxiliary information is VideoAuxiliaryInfo@auxilliaryCodecDependency of DASH signaling or auxl_codec_dependency involved in SMT signaling. ,
在本实施例中,通过识别辅助信息编解码依赖性字段是深度编解码依赖性字段、视差编解码依赖性字段、辅助编解码依赖性字段,能够及时明确在传输过程中辅助信息是深度信息、视差信息和辅助信息中的哪一种。In this embodiment, by identifying that the auxiliary information codec dependency field is the depth codec dependency field, the disparity codec dependency field, and the auxiliary codec dependency field, it can be clearly understood in time during the transmission process whether the auxiliary information is depth information, Which of disparity information and auxiliary information.
在一些实施例中,媒体制作设备还可在视频辅助信息描述子添加视频辅助标识字段,并根据相应视频资源中的信息类型,设置视频辅助标识字段的取值。In some embodiments, the media production device can also add a video auxiliary identification field to the video auxiliary information descriptor, and set the value of the video auxiliary identification field according to the information type in the corresponding video resource.
在一些实施例中,视频辅助信息描述子包括视频辅助标识字段;当辅助信息标识字段的取值为第十三数值时,指示相应视频资源仅包含视频图像信息对应的视频辅助信息;当辅助信息标识字段的取值为第十四数值时,指示相应视频资源仅包含视频图像信息。In some embodiments, the video auxiliary information descriptor includes a video auxiliary identification field; when the value of the auxiliary information identification field is the thirteenth value, it indicates that the corresponding video resource only contains video auxiliary information corresponding to the video image information; when the auxiliary information When the value of the identification field is the fourteenth value, it indicates that the corresponding video resource only contains video image information.
具体的,若视频辅助信息描述子所描述的视频资源仅包含视频图像信息对应的视频辅助信息,则可设置视频辅助标识字段的取值为第十三数值(如数值1);若视频辅助信息描述子所描述的视频资源仅包含视频图像信息对应的视频辅助信息,则可设置视频辅助标识字段的取值为第十四数值(如数值0)。Specifically, if the video resource described by the video auxiliary information descriptor only contains video auxiliary information corresponding to the video image information, the value of the video auxiliary identification field can be set to the thirteenth value (such as the value 1); if the video auxiliary information If the video resource described by the descriptor only contains video auxiliary information corresponding to the video image information, the value of the video auxiliary identification field can be set to a fourteenth value (such as a value of 0).
在本实施例中,通过视频辅助标识字段的数值,能够准确表明传输过程中相应视频资源是仅包含视频辅助信息,还是仅包含视频图像信息。In this embodiment, the value of the video auxiliary identification field can accurately indicate whether the corresponding video resource only contains video auxiliary information or only video image information during the transmission process.
在一些实施例中,视频辅助标识字段为以下任一种:视频辅助信息的仅限深度信息标志字段,视频辅助信息的仅限视差信息标志字段,视频辅助信息的仅限辅助信息标志字段;其中:当视频辅助标识字段为视频辅助信息的仅限深度信息标志字段时,视频图像信息包括纹理信息,视频辅助信息包括深度信息;当视频辅助标识字段为视频辅助信息的仅限视差信息标志字段时,视频辅助信息包括视差信息;当视频辅助标识字段为视频辅助信息的仅限辅助信息标志字段时,视频辅助信息包括深度信息、视差信息或者由其他应用定义的辅助信息。In some embodiments, the video auxiliary identification field is any of the following: a depth information only flag field of video auxiliary information, a disparity information only flag field of video auxiliary information, or an auxiliary information only flag field of video auxiliary information; wherein : When the video auxiliary identification field is the depth information only flag field of the video auxiliary information, the video image information includes texture information, and the video auxiliary information includes depth information; when the video auxiliary identification field is the parallax information only flag field of the video auxiliary information. , the video auxiliary information includes disparity information; when the video auxiliary identification field is the auxiliary information only flag field of the video auxiliary information, the video auxiliary information includes depth information, disparity information or auxiliary information defined by other applications.
其中,视频辅助信息的仅限深度信息标志字段为前述DASH信令涉及的VideoAuxiliaryInfo@onlyDepthInfo或SMT信令涉及的only_depth_info_flag,视频辅助信息的仅限视差信息标志字段为前述DASH信令涉及的VideoAuxiliaryInfo@onlyParallaxInfo或SMT信令涉及的only_parallax_info_flag),视频辅助信息的仅限辅助信息标志字段为前述DASH信令涉及的VideoAuxiliaryInfo@onlyAuxiliaryInfo或前述SMT信令所涉及的only_auxiliary_info_flag。Among them, the depth-only information flag field of the video auxiliary information is VideoAuxiliaryInfo@onlyDepthInfo involved in the aforementioned DASH signaling or the only_depth_info_flag involved in the SMT signaling, and the only parallax information flag field of the video auxiliary information is VideoAuxiliaryInfo@onlyParallaxInfo involved in the aforementioned DASH signaling. or only_parallax_info_flag involved in SMT signaling), the auxiliary information only flag field of the video auxiliary information is VideoAuxiliaryInfo@onlyAuxiliaryInfo involved in the aforementioned DASH signaling or only_auxiliary_info_flag involved in the aforementioned SMT signaling.
在本实施例中,根据视频辅助标识字段是仅限深度信息标志字段、仅限视差信息标志字段还是仅限辅助信息标志字段,限定了在传输过程中只涉及到一种辅助信息。In this embodiment, depending on whether the video auxiliary identification field is a depth information flag field only, a disparity information flag field only, or an auxiliary information flag field only, only one type of auxiliary information is involved in the transmission process.
在一些实施例中,信令描述文件还包括至少一个自由视角相机描述子,自由视角相机描述子用于描述视频资源对应的相机信息;其中,自由视角相机描述子包括自由视角相机源标识字段,自由视角相机源标识字段的取值用于指示对应视频资源所属的内容标识符,同一视频内容下的不同相机对应的视频资源具备相同的内容标识符;或者,自由视角相机描述子包括自由视角相机组合标识字段,自由视角相机组合标识字段的取值用于指示对应视频资源所属的组合标识符,同一视频内容下的不同相机对应的视频资源具备相同的组合标识符。In some embodiments, the signaling description file also includes at least one free-view camera descriptor, and the free-view camera descriptor is used to describe the camera information corresponding to the video resource; wherein the free-view camera descriptor includes a free-view camera source identification field, The value of the free-view camera source identification field is used to indicate the content identifier to which the corresponding video resource belongs. Video resources corresponding to different cameras under the same video content have the same content identifier; alternatively, the free-view camera descriptor includes a free-view camera Combination identification field, the value of the free-view camera combination identification field is used to indicate the combination identifier to which the corresponding video resource belongs. Video resources corresponding to different cameras under the same video content have the same combination identifier.
其中,在传输信令是DASH信令的情况下,由前述关于DASH信令的扩展可知,自由视角相机描述子可位于表示层级,以描述相应的视频资源对应的相机信息,此情况下,信令描述文件中的自由视角相机描述子的数量与媒体文件中的视频资源(representation)的数量相同;或者,自由视角相机描述子可位于自适应集层级,以描述相应自适应集中的各个视频资源对应的相机信息,此情况下,信令描述文件中的自由视角相机描述子的数量与媒体文件中的自适应集的数量相同。Among them, when the transmission signaling is DASH signaling, it can be known from the aforementioned expansion of DASH signaling that the free-view camera descriptor can be located at the presentation level to describe the camera information corresponding to the corresponding video resource. In this case, the signal Let the number of free-view camera descriptors in the description file be the same as the number of video resources (representations) in the media file; alternatively, the free-view camera descriptors can be located at the adaptation set level to describe each video resource in the corresponding adaptation set. Corresponding camera information, in this case, the number of free-view camera descriptors in the signaling description file is the same as the number of adaptive sets in the media file.
具体的,媒体制作设备可在自由视角相机描述子中添加自由视角相机源标识字段,并将自由视角相机描述子所描述的视频资源所属的内容标识符存储至自由视角相机源标识字段中;或者,媒体制作设备可在自由视角相机描述子中添加自由视角相机组合标识字段,并将自由视角相机描述子所描述的视频资源所属的组合标识符存储至自由视角相机组合标识字段中。除此之外,媒体制作设备还可在自由视角相机描述子中添加一个或多个AvsFreeViewCam(自由视角相机字段),并将相应的representation或adaptationset对应的相机信息存储至AvsFreeViewCam中。以及,在自由视角相机描述子中添加AvsFreeViewCam@cameraId(自由视角相机标识字段),并将自由视角相机描述子所描述的视频资源对应的每个相机对应的相机标识符存储至AvsFreeViewCam@cameraId中。另外,按照前述表4所提及的字段,还可在自由视角 相机描述子中添加相应视频资源对应的每个相机的其他信息,如对应相机位置的x分量取值、y分量取值以及z分量取值,对应相机焦距的x分量取值、y分量取值以及z分量取值等。Specifically, the media production device can add a free-view camera source identification field in the free-view camera descriptor, and store the content identifier belonging to the video resource described by the free-view camera descriptor in the free-view camera source identification field; or , the media production equipment can add a free-view camera combination identification field in the free-view camera descriptor, and store the combination identifier to which the video resource described by the free-view camera descriptor belongs in the free-view camera combination identification field. In addition, the media production equipment can also add one or more AvsFreeViewCam (free-view camera fields) in the free-view camera descriptor, and store the camera information corresponding to the corresponding representation or adaptation set into AvsFreeViewCam. And, add AvsFreeViewCam@cameraId (free-view camera identification field) in the free-view camera descriptor, and store the camera identifier corresponding to each camera corresponding to the video resource described by the free-view camera descriptor in AvsFreeViewCam@cameraId. In addition, according to the fields mentioned in Table 4, other information of each camera corresponding to the corresponding video resource can also be added to the free-view camera descriptor, such as the x component value, y component value and z value of the corresponding camera position. The component value corresponds to the x component value, y component value, z component value, etc. of the camera focal length.
在本实施例中,通过在自由视角相机描述子中扩展自由视角相机源标识字段和自由视角相机组合标识字段,能够准确且全面的描述视频资源对应的相机信息。In this embodiment, by extending the free-view camera source identification field and the free-view camera combination identification field in the free-view camera descriptor, the camera information corresponding to the video resource can be described accurately and comprehensively.
因此,在媒体文件的传输方式为流化传输方式的情况下,通过在信令描述文件中添加相应信息来指示媒体文件中的各个视频资源是否同时包含视频图像信息和视频辅助信息,以及指示视频资源中的各个信息之间是否存在解码依赖关系和沉浸媒体的视频内容的不同成分之间是否存在解码依赖关系等,可使得媒体消费设备中的客户端在自由视角视频消费过程中,依据信令传输文件的指示来选择相应资源进行消费,以及选择是否进行分段性解码,可有效提升客户端的播放性能。Therefore, when the transmission method of the media file is streaming transmission, corresponding information is added to the signaling description file to indicate whether each video resource in the media file contains both video image information and video auxiliary information, and to indicate whether the video Whether there is a decoding dependency relationship between various information in the resource and whether there is a decoding dependency relationship between different components of the immersive media video content can enable the client in the media consumption device to follow the signaling during the free-view video consumption process. Instructions for transferring files to select corresponding resources for consumption and whether to perform segmented decoding can effectively improve the playback performance of the client.
请参见图3,是本申请实施例提供的另一种沉浸媒体的处理方法的流程示意图。该沉浸媒体的处理方法可以由媒体消费设备执行,或者由媒体消费设备中的客户端执行;为便于阐述,后续均以媒体消费设备执行该沉浸媒体的处理方法为例进行说明。请参见图3,该沉浸媒体的处理方法可包括以下步骤S301-S302:Please refer to FIG. 3 , which is a schematic flowchart of another immersive media processing method provided by an embodiment of the present application. The immersive media processing method can be executed by the media consumption device, or by a client in the media consumption device; for ease of explanation, the following description will take the media consumption device executing the immersive media processing method as an example. Referring to Figure 3, the immersive media processing method may include the following steps S301-S302:
S301,获取沉浸媒体的媒体文件中的目标视频轨道。S301: Obtain the target video track in the media file of the immersive media.
其中,沉浸媒体通过多个相机拍摄得到,任一相机通过拍摄所采集到的信息包括:视频图像信息和相应的视频辅助信息;且媒体文件是通过将沉浸媒体的视频位流封装到M个视频轨道中得到的。M为正整数,在M个视频轨道中,存在至少一个视频轨道中的视频媒体头数据盒包括元数据信息,该元数据信息用于指示相应视频轨道是否同时包含视频图像信息和相应的视频辅助信息。可选的,在M>1的情况下,M个视频轨道中封装了视频图像信息的视频轨道还包含轨道组数据盒,该轨道组数据盒包括轨道组标识字段和相机标识字段;其中,轨道组标识字段用于存储相应视频轨道所对应的轨道组的标识信息,相机标识字段用于存储相应视频轨道对应的相机的标识信息。Among them, immersive media is captured by multiple cameras. The information collected by any camera includes: video image information and corresponding video auxiliary information; and the media file is obtained by encapsulating the video bit stream of the immersive media into M videos. obtained in orbit. M is a positive integer. Among the M video tracks, there is at least one video media header data box in the video track that includes metadata information. The metadata information is used to indicate whether the corresponding video track contains both video image information and corresponding video auxiliary information. information. Optionally, in the case of M>1, the video track encapsulating the video image information among the M video tracks also contains a track group data box. The track group data box includes a track group identification field and a camera identification field; where, the track group data box includes a track group identification field and a camera identification field; The group identification field is used to store the identification information of the track group corresponding to the corresponding video track, and the camera identification field is used to store the identification information of the camera corresponding to the corresponding video track.
在M>1的情况下,若同一个相机的视频图像信息和视频辅助信息被封装到不同的视频轨道,则M个视频轨道包括:一个或多个视频图像信息轨道和一个或多个视频辅助信息轨道;任一个视频辅助信息轨道包括轨道参考类型数据盒,轨道参考类型数据盒用于将视频辅助信息轨道参考或关联到相同相机对应的视频图像信息轨道,且用于指示视频辅助信息轨道和对应的视频图像信息轨道,在解码时是否存在依赖关系。具体的,轨道参考类型数据盒包含轨道标识字段,该轨道标识字段用于存储与视频辅助信息轨道所关联的视频图像信息轨道的标识信息。另外,轨道参考类型数据盒包含参考类型字段;当参考类型字段存储第一轨道索引时,指示视频辅助信息轨道在解码时,依赖对应的视频图像信息轨道;当参考类型字段存储第二轨道索引时,指示视频辅助信息轨道在解码时,不依赖对应的视频图像信息轨道。In the case of M>1, if the video image information and video auxiliary information of the same camera are encapsulated into different video tracks, the M video tracks include: one or more video image information tracks and one or more video auxiliary information Information track; any video auxiliary information track includes a track reference type data box. The track reference type data box is used to reference or associate the video auxiliary information track to the video image information track corresponding to the same camera, and is used to indicate the video auxiliary information track and Corresponding video image information track, whether there is any dependency during decoding. Specifically, the track reference type data box includes a track identification field, which is used to store identification information of the video image information track associated with the video auxiliary information track. In addition, the track reference type data box contains a reference type field; when the reference type field stores the first track index, it indicates that the video auxiliary information track relies on the corresponding video image information track when decoding; when the reference type field stores the second track index , indicating that the video auxiliary information track does not rely on the corresponding video image information track when decoding.
在步骤S301的一些实施例中,若媒体制作设备支持完整文件的下载服务,则媒体消费设备可从媒体制作设备处下载沉浸媒体的媒体文件。在下载媒体文件后,媒体消费设备可对媒体文件进行解封装处理,得到M个视频轨道。然后,媒体消费设备可从M个视频轨道中选取目标视频轨道。具体的,若沉浸媒体是自由视角视频,则媒体消费设备可确定与用户角度相匹配的目标相机,并从M个视频轨道中与目标相机对应的视频轨道作为目标视频轨道,目标视频轨道中封装有目标相机所采集到的视频图像信息和视频深度信息中的一种或多种。In some embodiments of step S301, if the media production device supports a complete file download service, the media consumption device may download the media file of the immersive media from the media production device. After downloading the media file, the media consumption device can decapsulate the media file to obtain M video tracks. The media consumption device can then select a target video track from the M video tracks. Specifically, if the immersive media is a free-view video, the media consumption device can determine a target camera that matches the user's angle, and select the video track corresponding to the target camera from the M video tracks as the target video track, and encapsulate it in the target video track. There is one or more of video image information and video depth information collected by the target camera.
在步骤S301的一些实施例中,若媒体制作设备支持流化传输,则媒体消费设备可与媒体制作设备建立流化传输,并从媒体制作设备处获取沉浸媒体的媒体文件对应的信令描述文件,该信令描述文件中包括至少一个视频辅助信息描述子,视频辅助信息描述子用于描述视频资源的视频辅助信息的相关属性。然后,媒体制作设备可根据信令描述文件的指示,请求媒体制作设备下发目标视频资源;在接收到目标视频资源后,媒体制作设备可对目标视频资源进行解封装处理,得到目标视频轨道。其中,当沉浸媒体为自由视角视频时,媒体制作设备可根据信令描述文件的指示,并结合用户角度来确定目标视频资源,具体如下:In some embodiments of step S301, if the media production device supports streaming transmission, the media consumption device can establish streaming transmission with the media production device, and obtain the signaling description file corresponding to the media file of the immersive media from the media production device , the signaling description file includes at least one video auxiliary information descriptor, and the video auxiliary information descriptor is used to describe the relevant attributes of the video auxiliary information of the video resource. Then, the media production device can request the media production device to deliver the target video resource according to the instructions of the signaling description file; after receiving the target video resource, the media production device can decapsulate the target video resource to obtain the target video track. Among them, when the immersive media is free-view video, the media production equipment can determine the target video resource according to the instructions of the signaling description file and combined with the user's perspective, as follows:
媒体消费设备可根据用户角度以及信令描述文件中的各个自由视角相机描述子所描述的相机位置信息,从多个相机中选择出与用户角度相匹配的目标相机,并基于信令描述文件中的各个自由视角相机描述子中的自由视角相机标识字段所存储的相机标识符,确定出目标相机对应的自由视角相机描述子。然后,可从信令描述文件中确定与自由视角相机描述子对应的视频辅助信息描述子,并根据确定出的视频辅助信息描述子中的辅助信息指示字段的取值,确定相应视频资源中是否同时包含视频图像信息和视频辅助信息,从而可根据第一确定结果确定目标视频资源。The media consumption device can select a target camera that matches the user's angle from multiple cameras based on the user's angle and the camera position information described by each free-view camera descriptor in the signaling description file. The camera identifier stored in the free-view camera identification field in each free-view camera descriptor determines the free-view camera descriptor corresponding to the target camera. Then, the video auxiliary information descriptor corresponding to the free-view camera descriptor can be determined from the signaling description file, and based on the value of the auxiliary information indication field in the determined video auxiliary information descriptor, it is determined whether the corresponding video resource It also contains video image information and video auxiliary information, so that the target video resource can be determined based on the first determination result.
由前述可知,当辅助信息指示字段的取值为第九数值时,指示相应视频资源同时包含视频图像信息和相应的视频辅助信息;当辅助信息指示字段的取值为第十数值时,指示相应视频资源仅包含视频图像信息。那么相应的,若第一确定结果指示相应视频资源中同时包含视频图像信息和视频辅助信息,则媒体消费设备可将确定出的视频辅助信息描述子所描述的视频资源确定为目标视频资源。若第一确定结果指示相应视频资源中仅包含视频图像信息,则由前述可知,此情况下的视频辅助信息描述子包括辅助信息编解码依赖性字段,那么媒体消费设备可进一步根据确定出的视频辅助信息描述子中的辅助信息编解码依赖性字段的取值,确定相应的视频图像信息和视频辅助信息是否存在解码依赖关系,从而可根据第二确定结果确定目标视频资源。As can be seen from the foregoing, when the value of the auxiliary information indication field is the ninth value, it indicates that the corresponding video resource contains both video image information and corresponding video auxiliary information; when the value of the auxiliary information indication field is the tenth value, it indicates that the corresponding Video resources only contain video image information. Correspondingly, if the first determination result indicates that the corresponding video resource contains both video image information and video auxiliary information, the media consumption device may determine the video resource described by the determined video auxiliary information descriptor as the target video resource. If the first determination result indicates that the corresponding video resource only contains video image information, and as can be seen from the foregoing, the video auxiliary information descriptor in this case includes the auxiliary information codec dependency field, then the media consumption device can further determine the video resource based on the determined video image information. The value of the auxiliary information codec dependency field in the auxiliary information descriptor determines whether there is a decoding dependency relationship between the corresponding video image information and the video auxiliary information, so that the target video resource can be determined based on the second determination result.
由前述可知,当辅助信息编解码依赖性字段的取值为第十一数值时,指示相应视频资源中的视频图像信息和相应的视频辅助信息之间存在解码依赖关系;当辅助信息编解码依赖性字段的取值为第十二数值时,指示相应视频资源中的视频图像信息和相应的视频辅助信息之间不存在解码依赖关系。那么相应的,若第二确定结果指示相应视频资源中的视频图像信息和相应的视频辅助信息之间不存在解码依赖关系,则媒体消费设备可将确定出的视频辅助信息描述子所描述的视频资源作为目标视频资源;若第二确定结果指示相应视频资源中的视频图像信息和相应的视频辅助信息之间存在解码依赖关系,则媒体消费设备可将确定出的视频辅助信息描述子所描述的视频资源,以及与该视频资源关联的其他视频资源,均作为目标视频资源。此处的其他视频资源是指:与确定出的视频辅助信息描述子所描述的视频资源中的视频图像信息相对应的视频辅助信息所在的视频资源。As can be seen from the foregoing, when the value of the auxiliary information codec dependency field is the eleventh value, it indicates that there is a decoding dependency relationship between the video image information in the corresponding video resource and the corresponding video auxiliary information; when the auxiliary information codec dependency When the value of the property field is the twelfth value, it indicates that there is no decoding dependency between the video image information in the corresponding video resource and the corresponding video auxiliary information. Correspondingly, if the second determination result indicates that there is no decoding dependency between the video image information in the corresponding video resource and the corresponding video auxiliary information, the media consumption device may convert the video described by the determined video auxiliary information descriptor into resource as the target video resource; if the second determination result indicates that there is a decoding dependency relationship between the video image information in the corresponding video resource and the corresponding video auxiliary information, the media consumption device can use the determined video auxiliary information descriptor to describe The video resource, as well as other video resources associated with this video resource, are used as target video resources. Other video resources here refer to the video resources where the video auxiliary information corresponding to the video image information in the video resource described by the determined video auxiliary information descriptor is located.
其中,媒体消费设备可根据信令描述文件中的各个视频辅助信息描述子中的视频辅助标识字段的取值,来识别各个视频资源包含的是视频图像信息,还是视频图像信息对应的视频辅助信息,从而根据识别结果确定与确定出的视频辅助信息描述子所描述的视频资源关联的其他视频资源。其中,当辅助信息标识字段的取值为第十三数值时,指示相应视频资源仅包含视频图像信息对应的视频辅助信息;当辅助信息标识字段的取值为第十四数值时,指示相应视频资源仅包含视频图像信息。Among them, the media consumption device can identify whether each video resource contains video image information or video auxiliary information corresponding to the video image information based on the value of the video auxiliary identification field in each video auxiliary information descriptor in the signaling description file. , thereby determining other video resources associated with the video resource described by the determined video auxiliary information descriptor according to the recognition result. Wherein, when the value of the auxiliary information identification field is the thirteenth value, it indicates that the corresponding video resource only contains video auxiliary information corresponding to the video image information; when the value of the auxiliary information identification field is the fourteenth value, it indicates that the corresponding video The resource only contains video image information.
在本实施例中,媒体消费设备通过与媒体制作设备建立流化传输,能够根据获取得到的信令描述文件,实时请求媒体制作设备下发目标视频资源,并通过对目标视频资源进行解封装处理,能够迅速且准确的得到目标视频轨道,提高了解封装的效率。In this embodiment, by establishing streaming transmission with the media production device, the media consumption device can request the media production device to deliver the target video resource in real time based on the obtained signaling description file, and decapsulate the target video resource. , the target video track can be obtained quickly and accurately, and the efficiency of understanding the package can be improved.
S302,对目标视频轨道中的信息进行解码处理。S302. Decode the information in the target video track.
在一些实施例中,若媒体消费设备是通过下载整个媒体文件的方式,获取到目标视频轨道的,则媒体消费设备可根据目标视频轨道的视频媒体头数据盒中的元数据信息,对目标视频轨道中的信息进行解码处理。具体的,元数据信息可包括内容信息标志字段,由前述可知,当内容信息标志字段的取值为第一数值时,指示相应视频轨道同时包含视频图像信息和相应的视频辅助信息;当内容信息标志字段的取值为第二数值时,指示相应视频轨道仅包含视频图像信息。因此,媒体消费设备可先根据目标视频轨道中的视频媒体头数据盒中的内容信息标志字段的取值,确定目标视频轨道中是否同时包含视频图像信息和相应的视频辅助信息;若确定目标视频轨道仅包含视频图像信息,则媒体消费设备可对该视频图像信息进行解码处理。In some embodiments, if the media consumption device obtains the target video track by downloading the entire media file, the media consumption device can obtain the target video based on the metadata information in the video media header data box of the target video track. The information in the track is decoded. Specifically, the metadata information may include a content information flag field. As can be seen from the foregoing, when the value of the content information flag field is the first value, it indicates that the corresponding video track contains both video image information and corresponding video auxiliary information; when the content information When the value of the flag field is the second value, it indicates that the corresponding video track only contains video image information. Therefore, the media consumption device can first determine whether the target video track contains both video image information and corresponding video auxiliary information based on the value of the content information flag field in the video media header data box in the target video track; if it is determined that the target video If the track only contains video image information, the media consumption device can decode the video image information.
若确定目标视频轨道同时包含视频图像信息和对应的视频辅助信息,则由前述可知,此时的目标视频 轨道中的视频媒体头数据盒还包括视频信息编解码依赖性字段,且当视频信息编解码依赖性字段的取值为第三数值时,指示相应视频轨道中的视频图像信息和相应的视频辅助信息之间存在解码依赖关系,当视频信息编解码依赖性字段的取值为第四数值时,指示相应视频轨道中的视频图像信息和相应的视频辅助信息之间不存在解码依赖关系。媒体消费设备可先根据目标视频轨道中的视频媒体头数据盒中的视频信息编解码依赖性字段的取值,确定目标视频轨道中的视频图像信息和相应的视频辅助信息之间是否存在解码依赖关系,从而确定是否对目标视频轨道中的视频图像信息和视频辅助信息进行分段解码处理。具体的,若存在解码依赖关系,则可对目标视频轨道中的视频图像信息和视频辅助信息进行同时解码处理;若不存在解码依赖关系,则可对目标视频轨道中的视频图像信息和视频辅助信息进行分段解码处理,即某一阶段解码纹理信息、某一阶段解码深度信息。当然,当不存在解码依赖关系时,也可对视频图像信息和视频辅助信息进行同时解码处理,对此不作限定。If it is determined that the target video track contains both video image information and corresponding video auxiliary information, it can be known from the above that the video media header data box in the target video track at this time also includes a video information encoding and decoding dependency field, and when the video information is encoded When the value of the decoding dependency field is the third numerical value, it indicates that there is a decoding dependency relationship between the video image information in the corresponding video track and the corresponding video auxiliary information, and when the value of the video information codec dependency field is the fourth numerical value When , it indicates that there is no decoding dependency between the video image information in the corresponding video track and the corresponding video auxiliary information. The media consumption device can first determine whether there is a decoding dependency between the video image information in the target video track and the corresponding video auxiliary information based on the value of the video information encoding and decoding dependency field in the video media header data box in the target video track. relationship, thereby determining whether to perform segment decoding processing on the video image information and video auxiliary information in the target video track. Specifically, if there is a decoding dependency, the video image information and video auxiliary information in the target video track can be decoded simultaneously; if there is no decoding dependency, the video image information and video auxiliary information in the target video track can be decoded simultaneously. The information is decoded in segments, that is, texture information is decoded at a certain stage and depth information is decoded at a certain stage. Of course, when there is no decoding dependency, the video image information and the video auxiliary information can also be decoded simultaneously, which is not limited.
在另一些实施例中,若媒体消费设备是通过流化传输的方式获取目标视频资源,并根据目标视频资源获取到目标视频轨道的,由于媒体消费设备在获取目标视频资源时,已根据信令描述文件中的相关信息知道了目标视频资源是否同时包含视频图像信息和视频辅助信息,以及知道了在目标视频资源同时包含视频图像信息和视频辅助信息的情况下,该视频图像信息和视频辅助信息之间是否存在解码依赖关系,因此,媒体消费设备在获取到目标视频轨道后,可直接采用相应逻辑对目标视频轨道中的信息进行解码处理。当然在其他实施例中,媒体消费设备可根据目标视频轨道的视频媒体头数据盒中的元数据信息,对目标视频轨道中的信息进行解码处理,其解码逻辑与上述的解码逻辑相同。In other embodiments, if the media consumption device obtains the target video resource through streaming transmission and obtains the target video track according to the target video resource, since the media consumption device has already obtained the target video resource according to the signaling The relevant information in the description file knows whether the target video resource contains both video image information and video auxiliary information, and knows that when the target video resource contains both video image information and video auxiliary information, the video image information and video auxiliary information Whether there is a decoding dependency between them, therefore, after the media consumption device obtains the target video track, it can directly use the corresponding logic to decode the information in the target video track. Of course, in other embodiments, the media consumption device can decode the information in the target video track according to the metadata information in the video media header data box of the target video track, and the decoding logic is the same as the above-mentioned decoding logic.
可选的,由前述可知,当一个视频轨道对应至少两个相机时,视频轨道中还可包括视频成分依赖信息数据盒,该视频成分依赖信息数据盒可包括成分编解码依赖性字段和成分类型字段;那么相应的,若目标视频轨道中包含该视频依赖成分信息数据盒,则媒体消费设备还可根据该视频依赖成分信息数据盒中的成分类型字段的取值获知沉浸媒体的视频内容的成分类型,以及根据成分编解码依赖性字段的取值来获知沉浸媒体的视频内容的不同成分之间是否存在编解码依赖关系,从而结合获知结果确定是否可对目标视频轨道中的各个相机对应的信息进行独立解码。其中,当成分编解码依赖性字段的取值为第五数值时,指示沉浸媒体的视频内容的不同成分之间存在编解码依赖关系;当成分编解码依赖性字段的取值为第六数值时,指示沉浸媒体的视频内容的不同成分之间不存在编解码依赖关系。当成分类型字段的取值为第七数值时,指示沉浸媒体的视频内容的成分类型为基于相机的成分类型;当成分类型字段的取值为第八数值时,指示所述沉浸媒体的视频内容的成分类型为基于双目的成分类型。Optionally, as can be seen from the foregoing, when a video track corresponds to at least two cameras, the video track may also include a video component dependency information data box. The video component dependency information data box may include a component codec dependency field and a component type. field; correspondingly, if the target video track contains the video dependent component information data box, the media consumption device can also obtain the components of the immersive media video content based on the value of the component type field in the video dependent component information data box. Type, and based on the value of the component encoding and decoding dependency field, we can learn whether there is a encoding and decoding dependency between different components of the immersive media video content, so as to determine whether the information corresponding to each camera in the target video track can be combined with the obtained results. Perform independent decoding. Wherein, when the value of the component codec dependency field is the fifth value, it indicates that there is a codec dependency relationship between different components of the video content of the immersive media; when the value of the component codec dependency field is the sixth value , indicating that there are no codec dependencies between different components of the video content of the immersive media. When the value of the component type field is the seventh value, it indicates that the component type of the video content of the immersive media is a camera-based component type; when the value of the component type field is the eighth value, it indicates that the video content of the immersive media is The ingredient type is based on the binocular ingredient type.
本申请实施例可避免媒体消费设备在解封装过程中重复解析其他数据盒中用于指示视频轨道内的信息情况的相关字段,从而节省处理资源,提升解封装效率,进而提升解码效率。Embodiments of the present application can avoid the media consumption device from repeatedly parsing the relevant fields in other data boxes used to indicate the information in the video track during the decapsulation process, thereby saving processing resources, improving decapsulation efficiency, and thus improving decoding efficiency.
基于上述描述,以沉浸媒体为自由视角视频为例,且以媒体制作设备为服务器,且由媒体消费设备中的客户端进行媒体消费为例,本申请实施例示例性地提出了如下三种应用场景:Based on the above description, taking the immersive media as free-view video as an example, taking the media production device as the server, and media consumption by the client in the media consumption device as an example, the embodiments of this application exemplarily propose the following three applications Scenes:
(一)应用场景1:单轨封装:(1) Application scenario 1: Monorail packaging:
1、服务器:生成自由视角视频的视频位流,并将该视频位流封装为媒体文件F0,且自由视角视频位流以单轨模式进行封装,如图4a。其中:1. Server: Generate the video bit stream of the free-view video, and encapsulate the video bit stream into a media file F0, and the free-view video bit stream is encapsulated in single-track mode, as shown in Figure 4a. in:
video_with_depth_info_flag=1,表明该视频轨道为包含纹理信息和深度信息的轨道;video_with_depth_info_flag=1, indicating that the video track is a track containing texture information and depth information;
depth_codec_dependency=0,表明该视频轨道的深度信息在解码时不依赖纹理信息;depth_codec_dependency=0, indicating that the depth information of the video track does not depend on texture information during decoding;
composition_type=0,表明该视频轨道的视频内容包含不同相机拍摄的视频内容;composition_type=0, indicating that the video content of this video track contains video content shot by different cameras;
composition_codec_dependency=0,表明该视频轨道中不同相机拍摄的视频内容之间在解码时并不相互依赖。composition_codec_dependency=0, indicating that the video content captured by different cameras in the video track does not depend on each other during decoding.
2、服务器,如图4b所示:2. Server, as shown in Figure 4b:
服务器S1根据1中的信息,进一步将F0重新封装为多视频轨道的自由视角文件F1,对多轨模式的处理可参见后续应用场景2和应用场景3的相关描述。Based on the information in 1, server S1 further repackages F0 into a multi-video track free-view file F1. For the processing of multi-track mode, please refer to the relevant descriptions of subsequent application scenarios 2 and 3.
服务器S2提供下载服务,对自由视角视频的媒体文件F0不做处理,供用户下载。The server S2 provides a download service and does not process the media file F0 of the free-angle video for users to download.
服务器S3支持流化传输,将自由视角视频的媒体文件F0切片为适合流化传输的媒体片段(按照现有传输标准的规范进行),并生成相应的传输信令,并将传输信令的信令描述文件传输给用户。其中,传输信令的信令描述文件中指示以下信息:Server S3 supports streaming transmission, slices the media file F0 of the free-view video into media segments suitable for streaming transmission (in accordance with the specifications of the existing transmission standards), generates corresponding transmission signaling, and transmits the signaling Let the description file be transferred to the user. Among them, the signaling description file of the transmission signaling indicates the following information:
VideoAuxiliaryInfo@withDepthInfo=1,表明视频资源同时包含纹理信息(图像信息)以及纹理信息(图像信息)对应的深度信息;VideoAuxiliaryInfo@withDepthInfo=1, indicating that the video resource contains both texture information (image information) and depth information corresponding to the texture information (image information);
VideoAuxiliaryInfo@depthCodecDependency=0,表明相应视频资源中包含的纹理信息(图像信息)和纹理信息(图像信息)对应的深度信息之间不存在解码依赖关系。VideoAuxiliaryInfo@depthCodecDependency=0, indicating that there is no decoding dependency between the texture information (image information) contained in the corresponding video resource and the depth information corresponding to the texture information (image information).
3、客户端:3. Client:
a)客户端C1下载完整文件F0后本地播放。根据文件中关于纹理、深度信息以及编解码依赖信息的指示,客户端可以在不同的阶段进行部分解码,即某一阶段解码纹理信息、某一阶段解码深度信息,以优化客户端播放的性能。a) Client C1 downloads the complete file F0 and plays it locally. According to the instructions in the file about texture, depth information and codec dependency information, the client can perform partial decoding at different stages, that is, decoding texture information at a certain stage and decoding depth information at a certain stage to optimize the performance of client playback.
b)客户端C2与服务器S3建立流化传输,根据传输信令中指示的纹理信息、深度信息以及编解码依赖信息,选择相应的自由视角视频流进行消费。同时,客户端可以在不同的阶段进行部分解码,即某一阶段解码纹理信息、某一阶段解码深度信息,以优化客户端播放的性能。其中,服务器可生成多个媒体文件,一个媒体文件对应一个自由视角视频流,一部分媒体文件中的视频图像信息和视频辅助信息之间存在解码依赖,另一部分媒体文件中的视频图像信息和视频辅助信息之间可存在解码依赖。b) Client C2 establishes streaming transmission with server S3, and selects the corresponding free-view video stream for consumption based on the texture information, depth information and codec dependency information indicated in the transmission signaling. At the same time, the client can perform partial decoding at different stages, that is, decoding texture information at a certain stage and decoding depth information at a certain stage to optimize the performance of client playback. Among them, the server can generate multiple media files. One media file corresponds to a free-view video stream. There is a decoding dependency between the video image information and video auxiliary information in some media files, and the video image information and video auxiliary information in another part of the media files. There can be decoding dependencies between messages.
(二)应用场景2:多轨封装,且每个视频轨道包含一个相机拍摄的纹理信息和深度信息:(2) Application scenario 2: Multi-track packaging, and each video track contains texture information and depth information captured by a camera:
1、服务器:生成自由视角视频的视频位流并将该视频位流封装为媒体文件F0,且自由视角视频位流以多轨模式进行封装,如图5a。其中:1. Server: Generate the video bit stream of the free-view video and encapsulate the video bit stream into a media file F0, and the free-view video bit stream is encapsulated in multi-track mode, as shown in Figure 5a. in:
其中Track1~Track4中的video_with_depth_info_flag=1,表明这些视频轨道均为包含深度信息的视频轨道;depth_codec_dependency=1,表明这些视频轨道的深度信息均在解码时依赖纹理信息。Among them, video_with_depth_info_flag=1 in Track1~Track4 indicates that these video tracks are all video tracks containing depth information; depth_codec_dependency=1 indicates that the depth information of these video tracks depends on texture information during decoding.
并且,Track1~Track4中都包含轨道组数据盒(AvsFreeViewGroupBox),且track_group_id均为100,说明Track1~Track4为同一个自由视角视频的包含不同相机拍摄数据的K个轨道。Moreover, Track1~Track4 all contain track group data boxes (AvsFreeViewGroupBox), and track_group_id is 100, indicating that Track1~Track4 are K tracks of the same free-view video containing different camera shooting data.
2、服务器,如图5b所示:2. Server, as shown in Figure 5b:
服务器S1提供下载服务,对自由视角视频的媒体文件F0不做处理,供用户下载。The server S1 provides a download service and does not process the media file F0 of the free-angle video for users to download.
服务器S2支持流化传输,将自由视角视频的媒体文件F0切片为适合流化传输的媒体片段(按照现有传输标准的规范进行),并生成相应的传输信令,并将传输信令的信令描述文件传输给用户。其中,传输信令的信令描述文件中指示以下信息:Server S2 supports streaming transmission, slices the media file F0 of the free-view video into media segments suitable for streaming transmission (in accordance with the specifications of the existing transmission standards), generates corresponding transmission signaling, and transmits the signaling Let the description file be transferred to the user. Among them, the signaling description file of the transmission signaling indicates the following information:
VideoAuxiliaryInfo@withDepthInfo=1,表明视频资源同时包含纹理信息(图像信息)以及纹理信息(图像信息)对应的深度信息;VideoAuxiliaryInfo@withDepthInfo=1, indicating that the video resource contains both texture information (image information) and depth information corresponding to the texture information (image information);
VideoAuxiliaryInfo@depthCodecDependency=1,表明相应视频资源中包含的纹理信息(图像信息)和纹理信息(图像信息)对应的深度信息之间存在解码依赖关系;VideoAuxiliaryInfo@depthCodecDependency=1, indicating that there is a decoding dependency between the texture information (image information) contained in the corresponding video resource and the depth information corresponding to the texture information (image information);
同时,AvsFreeViewCamInfo@sourceId指示不同视频资源属于同一个视频内容,AvsFreeViewCamInfo@cameraId和AvsFreeViewCam@camera_pos_x/y/z,分别指示相机标识符以及相机位置信息。At the same time, AvsFreeViewCamInfo@sourceId indicates that different video resources belong to the same video content, and AvsFreeViewCamInfo@cameraId and AvsFreeViewCam@camera_pos_x/y/z indicate the camera identifier and camera location information respectively.
3、客户端:3. Client:
a)客户端C1下载完整文件F0后本地播放。根据文件中关于纹理、深度信息、编解码依赖信息以及 相机信息的指示,客户端可以根据用户的观看操作,仅对特定相机对应的纹理信息和深度信息进行部分解码(此时由于纹理深度信息相互依赖,因此无法分阶段解码),以优化客户端播放的性能。a) Client C1 downloads the complete file F0 and plays it locally. According to the instructions in the file about texture, depth information, codec dependency information, and camera information, the client can only partially decode the texture information and depth information corresponding to a specific camera according to the user's viewing operation (at this time, because the texture and depth information are mutually exclusive dependencies and therefore cannot be decoded in stages) to optimize the performance of client playback.
b)客户端C2与服务器S3建立流化传输,根据传输信令中关于纹理、深度信息、编解码依赖信息以及相机信息的指示,结合用户的观看操作,请求服务器仅对包含特定相机对应的纹理信息和深度信息的视频资源进行传输,并在接收到相应视频资源后进行解码消费。b) Client C2 establishes streaming transmission with server S3, and according to the instructions on texture, depth information, codec dependency information and camera information in the transmission signaling, combined with the user's viewing operation, requests the server to only include textures corresponding to specific cameras. The video resources of information and depth information are transmitted, and the corresponding video resources are decoded and consumed.
(三)应用场景3:多轨封装,且每个视频轨道包含一个相机拍摄的纹理信息或者深度信息:(3) Application scenario 3: Multi-track packaging, and each video track contains texture information or depth information captured by a camera:
1、服务器:生成自由视角视频的视频位流并将该视频位流封装为媒体文件F0,且自由视角视频位流以多轨模式进行封装,如图6a。其中:1. Server: Generate the video bit stream of the free-view video and encapsulate the video bit stream into a media file F0, and the free-view video bit stream is encapsulated in multi-track mode, as shown in Figure 6a. in:
其中Track1~Track4中的video_with_depth_info_flag=0,表明这些视频轨道均只包含纹理信息。Track11~Track14中handlerbox中的处理类型为辅助视频轨道,表明这些视频轨道(即视频辅助信息轨道)均用于辅助某个视频图像信息轨道(即纹理图轨道)进行消费。通过TrackReferenceTypeBox中的信息,可以知道Track1~Track4、Track11~Track14这些视频轨道之间的对应关系。且通过轨道索引类型'vdpi'可以知道Track11~Track14包含的视频辅助信息为深度信息,且深度信息的解码不依赖与对应的纹理信息。Among them, video_with_depth_info_flag=0 in Track1~Track4 indicates that these video tracks only contain texture information. The processing type in the handlerbox in Track11~Track14 is auxiliary video track, indicating that these video tracks (ie, video auxiliary information track) are used to assist the consumption of a certain video image information track (ie, texture map track). Through the information in the TrackReferenceTypeBox, you can know the correspondence between the video tracks Track1~Track4 and Track11~Track14. And through the track index type 'vdpi', we can know that the video auxiliary information contained in Track11~Track14 is depth information, and the decoding of the depth information does not rely on the corresponding texture information.
并且,Track1~Track4中都包含轨道组数据盒(AvsFreeViewGroupBox),且track_group_id均为100,说明Track1~Track4为同一个自由视角视频的包含不同相机拍摄数据的K个轨道。Moreover, Track1~Track4 all contain track group data boxes (AvsFreeViewGroupBox), and track_group_id is 100, indicating that Track1~Track4 are K tracks of the same free-view video containing different camera shooting data.
2、服务器,如图6b所示:2. Server, as shown in Figure 6b:
服务器S1提供下载服务,对自由视角视频的媒体文件F0不做处理,供用户下载。The server S1 provides a download service and does not process the media file F0 of the free-angle video for users to download.
服务器S2支持流化传输,将自由视角视频的媒体文件F0切片为适合流化传输的媒体片段(按照现有传输标准的规范进行),并生成相应的传输信令,并将传输信令的信令描述文件传输给用户。其中,传输信令的信令描述文件中指示以下信息:Server S2 supports streaming transmission, slices the media file F0 of the free-view video into media segments suitable for streaming transmission (in accordance with the specifications of the existing transmission standards), generates corresponding transmission signaling, and transmits the signaling Let the description file be transferred to the user. Among them, the signaling description file of the transmission signaling indicates the following information:
Track1~Track4对应的媒体资源:Media resources corresponding to Track1~Track4:
VideoAuxiliaryInfo@withDepthInfo=0,指示相应视频资源仅包含纹理信息;VideoAuxiliaryInfo@withDepthInfo=0, indicating that the corresponding video resource only contains texture information;
AvsFreeViewCamInfo@sourceId=100,指示不同视频资源属于同一个视频内容;AvsFreeViewCamInfo@sourceId=100, indicating that different video resources belong to the same video content;
AvsFreeViewCamInfo@cameraId、AvsFreeViewCam@camera_pos_x/y/z指示相机标识符以及相机位置信息。AvsFreeViewCamInfo@cameraId, AvsFreeViewCam@camera_pos_x/y/z indicate the camera identifier and camera position information.
Track11~Track14对应的媒体资源:Media resources corresponding to Track11~Track14:
VideoAuxiliaryInfo@onlyDepthInfo=1,指示相应视频资源仅包含深度信息;VideoAuxiliaryInfo@onlyDepthInfo=1, indicating that the corresponding video resource only contains depth information;
同时,通过现有技术中的@dependencyId属性,可以将track1~track4对应的媒体资源和track11~track14对应的媒体资源进行关联。At the same time, through the @dependencyId attribute in the existing technology, the media resources corresponding to track1~track4 can be associated with the media resources corresponding to track11~track14.
3、客户端:3. Client:
a)客户端C1下载完整文件F0后本地播放。根据文件中关于纹理、深度信息、编解码依赖信息以及相机信息的指示,客户端可以根据用户观看操作仅对特定相机对应的纹理信息和深度信息进行部分解码。同时,由于纹理信息和对应的深度信息编解码独立,因此客户端还可以分阶段对纹理信息和深度信息进行解码,以优化客户端播放的性能。a) Client C1 downloads the complete file F0 and plays it locally. According to the instructions in the file about texture, depth information, codec dependency information, and camera information, the client can only partially decode the texture information and depth information corresponding to a specific camera based on the user's viewing operation. At the same time, since texture information and corresponding depth information are encoded and decoded independently, the client can also decode the texture information and depth information in stages to optimize the performance of client playback.
b)客户端C2与服务器S3建立流化传输,根据传输信令中关于纹理、深度信息、编解码依赖信息以及相机信息的指示,结合用户观看操作请求服务器仅对包含特定相机对应的纹理信息和深度信息的视频资源进行传输,并在接收到相应视频资源后,可以分阶段对纹理信息和深度信息进行解码消费。b) Client C2 establishes streaming transmission with server S3. According to the instructions about texture, depth information, codec dependency information and camera information in the transmission signaling, combined with the user viewing operation, the server requests only the texture information and information corresponding to the specific camera. The video resources of depth information are transmitted, and after receiving the corresponding video resources, the texture information and depth information can be decoded and consumed in stages.
应该理解的是,虽然如上所述的各实施例所涉及的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,如上所述的各实施例所涉及的流程图中的至少 一部分步骤可以包括多个步骤或者多个阶段,这些步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤中的步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although the steps in the flowcharts involved in the above-mentioned embodiments are shown in sequence as indicated by the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated in this article, there is no strict order restriction on the execution of these steps, and these steps can be executed in other orders. Moreover, at least some of the steps in the flowcharts involved in the above embodiments may include multiple steps or stages. These steps or stages are not necessarily executed at the same time, but may be completed at different times. The execution order of these steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least part of the steps or stages in other steps.
基于上述图2所示的沉浸媒体的处理方法的相关描述,本申请实施例还公开了一种沉浸媒体的处理装置,所述沉浸媒体的处理装置可以是运行于上述所提及的媒体制作设备的一个计算机可读指令(包括程序代码)。所述沉浸媒体的处理装置可以执行图2所示的方法,请参见图7,所述沉浸媒体的处理装置可以运行如下单元:Based on the relevant description of the immersive media processing method shown in Figure 2, embodiments of the present application also disclose an immersive media processing device. The immersive media processing device may be run on the above-mentioned media production equipment. A computer readable instruction (including program code). The immersion media processing device can execute the method shown in Figure 2. Please refer to Figure 7. The immersion media processing device can run the following units:
处理单元701,用于获取沉浸媒体的视频位流,沉浸媒体通过多个相机拍摄得到;任一相机通过拍摄所采集到的信息包括:视频图像信息和相应的视频辅助信息;The processing unit 701 is used to obtain the video bit stream of the immersive media, which is captured by multiple cameras; the information collected by any camera through photography includes: video image information and corresponding video auxiliary information;
封装单元702,用于将视频位流封装到M个视频轨道中,得到沉浸媒体的媒体文件;其中,M为正整数,且在M个视频轨道中,存在至少一个视频轨道中的视频媒体头数据盒包括元数据信息,元数据信息用于指示相应视频轨道是否同时包含视频图像信息和相应的视频辅助信息。The encapsulating unit 702 is used to encapsulate the video bit stream into M video tracks to obtain a media file of immersive media; where M is a positive integer, and among the M video tracks, there is a video media header in at least one video track. The data box includes metadata information, and the metadata information is used to indicate whether the corresponding video track contains both video image information and corresponding video auxiliary information.
基于上述图3所示的沉浸媒体的处理方法的相关描述,本申请实施例还公开了一种沉浸媒体的处理装置,所述沉浸媒体的处理装置可以是运行于上述所提及的媒体消费设备的一个计算机可读指令(包括程序代码)。所述沉浸媒体的处理装置可以执行图3所示的方法,请参见图8,所述沉浸媒体的处理装置可以运行如下单元:Based on the relevant description of the immersive media processing method shown in Figure 3 above, an embodiment of the present application also discloses an immersive media processing device. The immersive media processing device may be run on the above-mentioned media consumption device. A computer readable instruction (including program code). The immersion media processing device can execute the method shown in Figure 3. Please refer to Figure 8. The immersion media processing device can run the following units:
获取单元801,用于获取沉浸媒体的媒体文件中的目标视频轨道,沉浸媒体通过多个相机拍摄得到,任一相机通过拍摄所采集到的信息包括:视频图像信息和相应的视频辅助信息;且媒体文件是通过将沉浸媒体的视频位流封装到M个视频轨道中得到的;其中,M为正整数,在M个视频轨道中,存在至少一个视频轨道中的视频媒体头数据盒包括元数据信息,元数据信息用于指示相应视频轨道是否同时包含视频图像信息和相应的视频辅助信息;The acquisition unit 801 is used to acquire the target video track in the media file of the immersive media. The immersive media is captured by multiple cameras. The information collected by any camera through photography includes: video image information and corresponding video auxiliary information; and The media file is obtained by encapsulating the video bit stream of the immersive media into M video tracks; where M is a positive integer, and among the M video tracks, there is at least one video media header data box in the video track that includes metadata. Information, metadata information is used to indicate whether the corresponding video track contains both video image information and corresponding video auxiliary information;
解码单元802,用于对目标视频轨道中的信息进行解码处理。The decoding unit 802 is used to decode the information in the target video track.
在一些实施例中,媒体文件是由媒体制作设备制作封装得到的,视频位流被封装到M个视频轨道后,得到每个视频轨道对应的视频资源,媒体文件包括M个视频轨道对应的视频资源;In some embodiments, the media file is produced and encapsulated by a media production device. After the video bit stream is encapsulated into M video tracks, video resources corresponding to each video track are obtained. The media file includes videos corresponding to the M video tracks. resource;
相应的,获取单元801在用于获取沉浸媒体的媒体文件中的目标视频轨道时,可具体用于:Correspondingly, when used to obtain the target video track in the media file of the immersive media, the acquisition unit 801 can be specifically used to:
与媒体制作设备建立流化传输;Establish streaming to media production equipment;
从媒体制作设备处获取沉浸媒体的媒体文件对应的信令描述文件,信令描述文件中包括至少一个视频辅助信息描述子,视频辅助信息描述子用于描述视频资源的视频辅助信息的相关属性;Obtain the signaling description file corresponding to the media file of the immersive media from the media production device. The signaling description file includes at least one video auxiliary information descriptor, and the video auxiliary information descriptor is used to describe the relevant attributes of the video auxiliary information of the video resource;
根据信令描述文件的指示,请求媒体制作设备下发目标视频资源;并对目标视频资源进行解封装处理,得到目标视频轨道。According to the instructions of the signaling description file, the media production equipment is requested to deliver the target video resource; and the target video resource is decapsulated to obtain the target video track.
根据本申请的另一个实施例,图7或图8所示的沉浸媒体的处理装置中的各个单元可以分别或全部合并为一个或若干个另外的单元来构成,或者其中的某个(些)单元还可以再拆分为功能上更小的多个单元来构成,这可以实现同样的操作,而不影响本申请的实施例的技术效果的实现。上述单元是基于逻辑功能划分的,在实际应用中,一个单元的功能也可以由多个单元来实现,或者多个单元的功能由一个单元实现。在本申请的其它实施例中,基于沉浸媒体的处理装置也可以包括其它单元,在实际应用中,这些功能也可以由其它单元协助实现,并且可以由多个单元协作实现。According to another embodiment of the present application, each unit in the immersion media processing device shown in Figure 7 or Figure 8 can be separately or entirely combined into one or several additional units to form, or one (some) of them can be The unit can also be divided into multiple functionally smaller units, which can achieve the same operation without affecting the realization of the technical effects of the embodiments of the present application. The above units are divided based on logical functions. In practical applications, the function of one unit can also be realized by multiple units, or the functions of multiple units can be realized by one unit. In other embodiments of the present application, the immersive media-based processing device may also include other units. In practical applications, these functions may also be implemented with the assistance of other units, and may be implemented by multiple units in cooperation.
根据本申请的另一个实施例,可以通过在包括中央处理单元(CPU)、随机存取存储介质(RAM)、只读存储介质(ROM)等处理元件和存储元件的例如计算机的通用计算设备上运行能够执行如图2或图3中所示的相应方法所涉及的各步骤的计算机可读指令(包括程序代码),来构造如图7或图8中所示的沉浸媒体的处理装置设备,以及来实现本申请实施例的沉浸媒体的处理方法。所述计算机可读指令可以记载于例 如计算机可读存储介质上,并通过计算机可读存储介质装载于上述计算设备中,并在其中运行。According to another embodiment of the present application, the method can be implemented on a general computing device such as a computer including a central processing unit (CPU), a random access storage medium (RAM), a read-only storage medium (ROM), and other processing elements and storage elements. Run computer-readable instructions (including program codes) capable of executing each step involved in the corresponding method as shown in Figure 2 or Figure 3 to construct an immersive media processing device as shown in Figure 7 or Figure 8, And to implement the immersive media processing method of the embodiment of the present application. The computer-readable instructions can be recorded on, for example, a computer-readable storage medium, loaded into the above-mentioned computing device through the computer-readable storage medium, and run therein.
本申请实施例可获取沉浸媒体的视频位流,并可将视频位流封装到M个视频轨道中,得到沉浸媒体的媒体文件;在此过程中,可通过在至少一个视频轨道中的视频媒体头数据盒封装元数据信息,以对相应视频轨道内的信息情况进行统一指示,这样可以实现以更优化的方式组织沉浸媒体的媒体内容,避免媒体消费设备在解封装过程中重复解析其他数据盒中用于指示视频轨道内的信息情况的相关字段,从而节省处理资源,提升解封装效率,进而提升解码效率。The embodiment of the present application can obtain the video bit stream of the immersive media, and can encapsulate the video bit stream into M video tracks to obtain the media file of the immersive media; in this process, the video bit stream in at least one video track can be The header data box encapsulates metadata information to provide a unified indication of the information in the corresponding video track. This can organize the media content of immersive media in a more optimized way and avoid the media consumption device from repeatedly parsing other data boxes during the decapsulation process. Related fields used to indicate the information status in the video track, thereby saving processing resources, improving decapsulation efficiency, and thus improving decoding efficiency.
基于上述方法实施例以及装置实施例的描述,本申请实施例还提供一种媒体处理设备,该媒体处理设备可以是上述所提及的媒体制作设备或者媒体消费设备。请参见图9,该媒体处理设备至少包括处理器901、输入接口902、输出接口903以及计算机可读存储介质904。其中,媒体处理设备内的处理器901、输入接口902、输出接口903以及计算机可读存储介质904可通过总线或其他方式连接。计算机可读存储介质904可以存储在媒体处理设备的存储器中,所述计算机可读存储介质904用于存储计算机可读指令,所述计算机可读指令包括程序指令,所述处理器901用于执行所述计算机可读存储介质904存储的程序指令。处理器901(或称CPU(Central Processing Unit,中央处理器))是媒体处理设备的计算核心以及控制核心,其适于实现一条或多条计算机可读指令,具体适于加载并执行一条或多条计算机可读指令从而实现相应方法流程或相应功能。Based on the description of the above method embodiments and device embodiments, embodiments of the present application further provide a media processing device, which may be the above-mentioned media production device or media consumption device. Referring to Figure 9, the media processing device at least includes a processor 901, an input interface 902, an output interface 903, and a computer-readable storage medium 904. Among them, the processor 901, input interface 902, output interface 903 and computer-readable storage medium 904 in the media processing device can be connected through a bus or other means. Computer-readable storage media 904 may be stored in the memory of the media processing device, the computer-readable storage media 904 is used to store computer-readable instructions, the computer-readable instructions include program instructions, and the processor 901 is used to execute The computer-readable storage medium 904 stores program instructions. The processor 901 (or CPU (Central Processing Unit)) is the computing core and control core of the media processing device. It is suitable for implementing one or more computer-readable instructions, and is specifically suitable for loading and executing one or more computer-readable instructions. computer-readable instructions to implement corresponding method processes or corresponding functions.
在一个实施例中,当媒体处理设备是媒体制作设备时,本申请实施例所述的处理器901可以用于进行一系列图2所示的沉浸媒体的处理流程,具体包括:获取沉浸媒体的视频位流,所述沉浸媒体通过多个相机拍摄得到;任一相机通过拍摄所采集到的信息包括:视频图像信息和相应的视频辅助信息;将所述视频位流封装到M个视频轨道中,得到所述沉浸媒体的媒体文件;其中,M为正整数,且在所述M个视频轨道中,存在至少一个视频轨道中的视频媒体头数据盒包括元数据信息,所述元数据信息用于指示相应视频轨道是否同时包含视频图像信息和相应的视频辅助信息,等等。In one embodiment, when the media processing device is a media production device, the processor 901 described in the embodiment of this application can be used to perform a series of immersive media processing procedures shown in Figure 2, specifically including: obtaining the immersive media Video bit stream, the immersive media is captured by multiple cameras; the information collected by any camera includes: video image information and corresponding video auxiliary information; the video bit stream is encapsulated into M video tracks , obtain the media file of the immersive media; wherein, M is a positive integer, and among the M video tracks, there is at least one video media header data box in the video track that includes metadata information, and the metadata information is To indicate whether the corresponding video track contains both video image information and corresponding video auxiliary information, etc.
另一个实施例中,当媒体处理设备是媒体编辑设备时,本申请实施例所述的处理器901可以用于进行一系列图3所示的沉浸媒体的处理流程,具体包括:获取沉浸媒体的媒体文件中的目标视频轨道,所述沉浸媒体通过多个相机拍摄得到,任一相机通过拍摄所采集到的信息包括:视频图像信息和相应的视频辅助信息;且所述媒体文件是通过将所述沉浸媒体的视频位流封装到M个视频轨道中得到的;其中,M为正整数,在所述M个视频轨道中,存在至少一个视频轨道中的视频媒体头数据盒包括元数据信息,所述元数据信息用于指示相应视频轨道是否同时包含视频图像信息和相应的视频辅助信息;对所述目标视频轨道中的信息进行解码处理,等等。In another embodiment, when the media processing device is a media editing device, the processor 901 described in the embodiment of this application can be used to perform a series of immersive media processing procedures shown in Figure 3, specifically including: obtaining the immersive media The target video track in the media file, the immersive media is obtained by shooting with multiple cameras, and the information collected by any camera includes: video image information and corresponding video auxiliary information; and the media file is obtained by photographing all The video bit stream of the immersive media is obtained by encapsulating it into M video tracks; where M is a positive integer, and among the M video tracks, there is at least one video media header data box in the video track that includes metadata information, The metadata information is used to indicate whether the corresponding video track contains both video image information and corresponding video auxiliary information; to decode the information in the target video track, and so on.
本申请实施例还提供了一种计算机可读存储介质(Memory),所述计算机可读存储介质是媒体处理设备中的记忆设备,用于存放程序和数据。可以理解的是,此处的计算机可读存储介质既可以包括媒体处理设备中的内置存储介质,当然也可以包括媒体处理设备所支持的扩展存储介质。计算机可读存储介质提供存储空间,该存储空间存储了媒体处理设备的操作系统。并且,在该存储空间中还存放了适于被处理器加载并执行的一条或多条的计算机可读指令,这些计算机可读指令可以是一个或一个以上的计算机程序(包括程序代码)。需要说明的是,此处的计算机可读存储介质可以是高速RAM存储器,也可以是非不稳定的存储器(non-volatile memory),例如至少一个磁盘存储器;可选的,还可以是至少一个位于远离前述处理器的计算机可读存储介质。Embodiments of the present application also provide a computer-readable storage medium (Memory). The computer-readable storage medium is a memory device in a media processing device and is used to store programs and data. It can be understood that the computer-readable storage medium here may include a built-in storage medium in the media processing device, and of course may also include an extended storage medium supported by the media processing device. The computer-readable storage medium provides storage space that stores an operating system of the media processing device. Furthermore, the storage space also stores one or more computer-readable instructions suitable for being loaded and executed by the processor. These computer-readable instructions may be one or more computer programs (including program codes). It should be noted that the computer-readable storage medium here can be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory; optionally, it can also be at least one located remotely The computer-readable storage medium of the aforementioned processor.
在一个实施例中,可由处理器加载并执行计算机可读存储介质中存放的一条或多条计算机可读指令,以实现上述有关图2所示的沉浸媒体的处理方法实施例中的方法的相应步骤。另一个实施例中,可由处理器加载并执行计算机可读存储介质中存放的一条或多条计算机可读指令,以实现上述有关图3所示的沉浸媒体的处理方法实施例中的方法的相应步骤。In one embodiment, one or more computer-readable instructions stored in a computer-readable storage medium can be loaded and executed by the processor to implement the above-mentioned corresponding method in the immersive media processing method embodiment shown in FIG. 2 step. In another embodiment, one or more computer-readable instructions stored in a computer-readable storage medium may be loaded and executed by the processor to implement the above-mentioned corresponding method in the immersive media processing method embodiment shown in FIG. 3 step.
本申请实施例可获取沉浸媒体的视频位流,并可将视频位流封装到M个视频轨道中,得到沉浸媒体的媒体文件;在此过程中,可通过在至少一个视频轨道中的视频媒体头数据盒封装元数据信息,以对相应视频轨道内的信息情况进行统一指示,这样可以实现以更优化的方式组织沉浸媒体的媒体内容,避免媒体消费设备在解封装过程中重复解析其他数据盒中用于指示视频轨道内的信息情况的相关字段,从而节省处理资源,提升解封装效率,进而提升解码效率。The embodiment of the present application can obtain the video bit stream of the immersive media, and can encapsulate the video bit stream into M video tracks to obtain the media file of the immersive media; in this process, the video bit stream in at least one video track can be The header data box encapsulates metadata information to provide a unified indication of the information in the corresponding video track. This can organize the media content of immersive media in a more optimized way and avoid the media consumption device from repeatedly parsing other data boxes during the decapsulation process. Related fields used to indicate the information status in the video track, thereby saving processing resources, improving decapsulation efficiency, and thus improving decoding efficiency.
需要说明的是,根据本申请的一个方面,还提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机可读指令,该计算机可读指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机可读指令,处理器执行该计算机可读指令,使得该计算机设备执行上述图2或图3所示的方法实施例方面的各种可选方式中提供的方法。It should be noted that according to one aspect of the present application, a computer program product or computer program is also provided. The computer program product or computer program includes computer readable instructions, and the computer readable instructions are stored in a computer readable storage medium. . The processor of the computer device reads the computer-readable instructions from the computer-readable storage medium, and the processor executes the computer-readable instructions, so that the computer device performs various possible aspects of the method embodiment shown in Figure 2 or Figure 3. Choose the method provided in the method.
并且,应理解的是,以上所揭露的仅为本申请较佳实施例而已,当然不能以此来限定本申请之权利范围,因此依本申请权利要求所作的等同变化,仍属本申请所涵盖的范围。Furthermore, it should be understood that the above disclosures are only preferred embodiments of the present application. Of course, they cannot be used to limit the scope of rights of the present application. Therefore, equivalent changes made according to the claims of the present application are still covered by the present application. range.
以上所述实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。The technical features of the above-described embodiments can be combined in any way. To simplify the description, not all possible combinations of the technical features in the above-described embodiments are described. However, as long as there is no contradiction in the combination of these technical features, All should be considered to be within the scope of this manual.
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。The above-described embodiments only express several implementation modes of the present application, and their descriptions are relatively specific and detailed, but they should not be construed as limiting the scope of the patent application. It should be noted that, for those of ordinary skill in the art, several modifications and improvements can be made without departing from the concept of the present application, and these all fall within the protection scope of the present application. Therefore, the protection scope of this patent application should be determined by the appended claims.

Claims (31)

  1. 一种沉浸媒体的处理方法,由媒体处理设备执行,包括:An immersive media processing method, performed by a media processing device, including:
    获取沉浸媒体的视频位流,所述沉浸媒体通过多个相机拍摄得到;任一相机通过拍摄所采集到的信息包括:视频图像信息和相应的视频辅助信息;Obtain the video bit stream of the immersive media, which is captured by multiple cameras; the information collected by any camera through photography includes: video image information and corresponding video auxiliary information;
    将所述视频位流封装到M个视频轨道中,得到所述沉浸媒体的媒体文件;其中,M为正整数,且在所述M个视频轨道中,存在至少一个视频轨道中的视频媒体头数据盒包括元数据信息,所述元数据信息用于指示相应视频轨道是否同时包含视频图像信息和相应的视频辅助信息。Encapsulate the video bit stream into M video tracks to obtain the media file of the immersive media; where M is a positive integer, and among the M video tracks, there is a video media header in at least one video track The data box includes metadata information used to indicate whether the corresponding video track contains both video image information and corresponding video auxiliary information.
  2. 如权利要求1所述的方法,其特征在于,若所述M个视频轨道中的第m个视频轨道封装了至少一个相机的视频图像信息,则所述第m个视频轨道的视频媒体头数据盒包括元数据信息,m∈[1,M]。The method of claim 1, wherein if the m-th video track among the M video tracks encapsulates the video image information of at least one camera, then the video media header data of the m-th video track The box contains metadata information, m∈[1,M].
  3. 如权利要求2所述的方法,其特征在于,所述元数据信息包括内容信息标志字段;The method of claim 2, wherein the metadata information includes a content information flag field;
    当所述内容信息标志字段的取值为第一数值时,指示相应视频轨道同时包含视频图像信息和相应的视频辅助信息;When the value of the content information flag field is the first value, it indicates that the corresponding video track contains both video image information and corresponding video auxiliary information;
    当所述内容信息标志字段的取值为第二数值时,指示相应视频轨道仅包含视频图像信息。When the value of the content information flag field is the second value, it indicates that the corresponding video track only contains video image information.
  4. 如权利要求3所述的方法,其特征在于,所述内容信息标志字段为以下任一种:视频深度信息标志字段,视频视差信息标志字段,视频辅助信息标志字段;其中:The method of claim 3, wherein the content information flag field is any one of the following: video depth information flag field, video disparity information flag field, video auxiliary information flag field; wherein:
    当所述内容信息标志字段为视频深度信息标志字段时,视频图像信息包括纹理信息,视频辅助信息包括深度信息;When the content information flag field is a video depth information flag field, the video image information includes texture information, and the video auxiliary information includes depth information;
    当所述内容信息标志字段为视频视差信息标志字段时,视频辅助信息包括视差信息;When the content information flag field is a video disparity information flag field, the video auxiliary information includes disparity information;
    当所述内容信息标志字段为视频辅助信息标志字段时,视频辅助信息包括深度信息、视差信息或者由其他应用定义的辅助信息。When the content information flag field is a video auxiliary information flag field, the video auxiliary information includes depth information, disparity information, or auxiliary information defined by other applications.
  5. 如权利要求1-4任一项所述的方法,其特征在于,若所述元数据信息指示相应视频轨道同时包含视频图像信息和相应的视频辅助信息,则相应视频轨道中的视频媒体头数据盒还包括视频信息编解码依赖性字段;The method according to any one of claims 1 to 4, characterized in that, if the metadata information indicates that the corresponding video track contains both video image information and corresponding video auxiliary information, then the video media header data in the corresponding video track The box also includes video information encoding and decoding dependency fields;
    当所述视频信息编解码依赖性字段的取值为第三数值时,指示相应视频轨道中的视频图像信息和相应的视频辅助信息之间存在解码依赖关系;When the value of the video information codec dependency field is the third value, it indicates that there is a decoding dependency relationship between the video image information in the corresponding video track and the corresponding video auxiliary information;
    当所述视频信息编解码依赖性字段的取值为第四数值时,指示相应视频轨道中的视频图像信息和相应的视频辅助信息之间不存在解码依赖关系。When the value of the video information codec dependency field is the fourth value, it indicates that there is no decoding dependency relationship between the video image information in the corresponding video track and the corresponding video auxiliary information.
  6. 如权利要求5所述的方法,其特征在于,所述视频信息编解码依赖性字段为以下任一种:深度编解码依赖性字段,视差编解码依赖性字段,辅助编解码依赖性字段;其中:The method of claim 5, wherein the video information codec dependency field is any one of the following: a depth codec dependency field, a parallax codec dependency field, or an auxiliary codec dependency field; wherein :
    当所述视频信息编解码依赖性字段为深度编解码依赖性字段时,视频图像信息包括纹理信息,视频辅助信息包括深度信息;When the video information codec dependency field is a depth codec dependency field, the video image information includes texture information, and the video auxiliary information includes depth information;
    当所述视频信息编解码依赖性字段为视差编解码依赖性字段时,视频辅助信息包括视差信息;When the video information codec dependency field is a disparity codec dependency field, the video auxiliary information includes disparity information;
    当所述视频信息编解码依赖性字段为辅助编解码依赖性字段时,视频辅助信息包括深度信息、视差信息或者由其他应用定义的辅助信息。When the video information codec dependency field is an auxiliary codec dependency field, the video auxiliary information includes depth information, disparity information, or auxiliary information defined by other applications.
  7. 如权利要求1-4任一项所述的方法,其特征在于,在所述M个视频轨道中,存在至少一个视频轨道包括视频成分依赖信息数据盒;The method according to any one of claims 1 to 4, characterized in that, among the M video tracks, there is at least one video track including a video component dependency information data box;
    其中,所述视频成分依赖信息数据盒用于指示:所述沉浸媒体的视频内容的成分类型,以及所述沉浸媒体的视频内容的不同成分之间是否存在编解码依赖关系。The video component dependency information data box is used to indicate: the component type of the video content of the immersive media, and whether there is a codec dependency relationship between different components of the video content of the immersive media.
  8. 如权利要求7所述的方法,其特征在于,当M=1时,所述M个视频轨道中包括所述视频成分依赖信息数据盒;The method of claim 7, wherein when M=1, the M video tracks include the video component dependency information data box;
    当M>1时,若所述M个视频轨道中的第m个视频轨道封装了至少两个相机所采集的一种或多种信息,则所述第m个视频轨道中包括所述视频成分依赖信息数据盒。When M>1, if the m-th video track among the M video tracks encapsulates one or more types of information collected by at least two cameras, then the m-th video track includes the video component Dependent information data box.
  9. 如权利要求7所述的方法,其特征在于,所述视频成分依赖信息数据盒包括成分编解码依赖性字段;The method of claim 7, wherein the video component dependency information data box includes a component codec dependency field;
    当所述成分编解码依赖性字段的取值为第五数值时,指示所述沉浸媒体的视频内容的不同成分之间存在编解码依赖关系;When the value of the component codec dependency field is a fifth value, it indicates that there is a codec dependency relationship between different components of the video content of the immersive media;
    当所述成分编解码依赖性字段的取值为第六数值时,指示所述沉浸媒体的视频内容的不同成分之间不存在编解码依赖关系。When the value of the component codec dependency field is a sixth value, it indicates that there is no codec dependency relationship between different components of the video content of the immersive media.
  10. 如权利要求7所述的方法,其特征在于,所述视频成分依赖信息数据盒包括成分类型字段;The method of claim 7, wherein the video component dependency information data box includes a component type field;
    当所述成分类型字段的取值为第七数值时,指示所述沉浸媒体的视频内容的成分类型为基于相机的成分类型;其中,一个成分包括一个相机拍摄的视频内容;When the value of the component type field is a seventh value, it indicates that the component type of the video content of the immersive media is a camera-based component type; wherein one component includes video content captured by one camera;
    当所述成分类型字段的取值为第八数值时,指示所述沉浸媒体的视频内容的成分类型为基于双目的成分类型;其中,一个成分包括左目视频内容或右目视频内容。When the value of the component type field is an eighth value, it indicates that the component type of the video content of the immersive media is a binocular-based component type; wherein one component includes left-eye video content or right-eye video content.
  11. 如权利要求1-4任一项所述的方法,其特征在于,在M>1的情况下,所述M个视频轨道中封装了视频图像信息的视频轨道还包含轨道组数据盒;The method according to any one of claims 1 to 4, characterized in that, in the case of M>1, the video track encapsulating video image information among the M video tracks also includes a track group data box;
    其中,所述轨道组数据盒包括轨道组标识字段和相机标识字段;所述轨道组标识字段用于存储相应视频轨道所对应的轨道组的标识信息,所述相机标识字段用于存储相应视频轨道对应的相机的标识信息。Wherein, the track group data box includes a track group identification field and a camera identification field; the track group identification field is used to store the identification information of the track group corresponding to the corresponding video track, and the camera identification field is used to store the corresponding video track The corresponding camera identification information.
  12. 如权利要求1-4任一项所述的方法,其特征在于,在M>1的情况下,若同一个相机的视频图像信息和视频辅助信息被封装到不同的视频轨道,则所述M个视频轨道包括:一个或多个视频图像信息轨道和一个或多个视频辅助信息轨道;The method according to any one of claims 1 to 4, characterized in that, in the case of M>1, if the video image information and video auxiliary information of the same camera are encapsulated into different video tracks, then the M Each video track includes: one or more video image information tracks and one or more video auxiliary information tracks;
    其中,视频图像信息轨道是指封装了视频图像信息的视频轨道,视频辅助信息轨道是指封装了视频辅助信息的视频轨道;Among them, the video image information track refers to the video track that encapsulates video image information, and the video auxiliary information track refers to the video track that encapsulates video auxiliary information;
    任一个视频辅助信息轨道包括轨道参考类型数据盒,所述轨道参考类型数据盒用于将视频辅助信息轨道参考或关联到相同相机对应的视频图像信息轨道,且用于指示视频辅助信息轨道和对应的视频图像信息轨道,在解码时是否存在依赖关系。Any video auxiliary information track includes a track reference type data box, which is used to refer to or associate the video auxiliary information track to the video image information track corresponding to the same camera, and is used to indicate the video auxiliary information track and the corresponding The video image information track, whether there is any dependency during decoding.
  13. 如权利要求12所述的方法,其特征在于,所述轨道参考类型数据盒包含轨道标识字段,所述轨道标识字段用于存储与视频辅助信息轨道所关联的视频图像信息轨道的标识信息。The method of claim 12, wherein the track reference type data box contains a track identification field, and the track identification field is used to store identification information of the video image information track associated with the video auxiliary information track.
  14. 如权利要求12所述的方法,其特征在于,所述轨道参考类型数据盒包含参考类型字段;The method of claim 12, wherein the track reference type data box contains a reference type field;
    当所述参考类型字段存储第一轨道索引时,指示视频辅助信息轨道在解码时,依赖对应的视频图像信息轨道;When the reference type field stores the first track index, it indicates that the video auxiliary information track relies on the corresponding video image information track when decoding;
    当所述参考类型字段存储第二轨道索引时,指示视频辅助信息轨道在解码时,不依赖对应的视频图像信息轨道。When the reference type field stores the second track index, it indicates that the video auxiliary information track does not rely on the corresponding video image information track when decoding.
  15. 如权利要求14所述的方法,其特征在于,所述第一轨道索引为以下任一种:The method of claim 14, wherein the first track index is any of the following:
    第一字符串,所述第一字符串用于指示视频辅助信息轨道中的视频辅助信息包括深度信息;a first character string, the first character string being used to indicate that the video auxiliary information in the video auxiliary information track includes depth information;
    第二字符串,所述第二字符串用于指示视频辅助信息轨道中的视频辅助信息包括视差信息;a second character string, the second character string being used to indicate that the video auxiliary information in the video auxiliary information track includes disparity information;
    第三字符串,所述第三字符串用于指示视频辅助信息轨道中的视频辅助信息包括深度信息、视差信息或者由其他应用定义的辅助信息;A third character string, the third character string is used to indicate that the video auxiliary information in the video auxiliary information track includes depth information, disparity information, or auxiliary information defined by other applications;
    所述第二轨道索引为以下任一种:The second track index is any of the following:
    第四字符串,所述第四字符串用于指示视频辅助信息轨道中的视频辅助信息包括深度信息;A fourth character string, the fourth character string is used to indicate that the video auxiliary information in the video auxiliary information track includes depth information;
    第五字符串,所述第五字符串用于指示视频辅助信息轨道中的视频辅助信息包括视差信息;A fifth character string, the fifth character string is used to indicate that the video auxiliary information in the video auxiliary information track includes disparity information;
    第六字符串,所述第六字符串用于指示视频辅助信息轨道中的视频辅助信息包括深度信息、视差信息或者由其他应用定义的辅助信息。A sixth character string, the sixth character string is used to indicate that the video auxiliary information in the video auxiliary information track includes depth information, disparity information, or auxiliary information defined by other applications.
  16. 如权利要求1-4任一项所述的方法,其特征在于,所述视频位流被封装到M个视频轨道后,得到每个视频轨道对应的视频资源,所述媒体文件包括所述M个视频轨道对应的视频资源;The method according to any one of claims 1 to 4, characterized in that after the video bit stream is encapsulated into M video tracks, video resources corresponding to each video track are obtained, and the media file includes the M Video resources corresponding to each video track;
    若所述媒体文件的传输方式为流化传输方式,则所述方法还包括:生成所述媒体文件对应的信令描述文件;If the transmission method of the media file is a streaming transmission method, the method further includes: generating a signaling description file corresponding to the media file;
    其中,所述信令描述文件中包括至少一个视频辅助信息描述子,所述视频辅助信息描述子用于描述视频资源的视频辅助信息的相关属性。Wherein, the signaling description file includes at least one video auxiliary information descriptor, and the video auxiliary information descriptor is used to describe related attributes of the video auxiliary information of the video resource.
  17. 如权利要求16所述的方法,其特征在于,当视频辅助信息描述子位于所述信令描述文件的自适应集层级时,描述相应自适应集中的各个视频资源的视频辅助信息的相关属性;The method of claim 16, wherein when the video auxiliary information descriptor is located at the adaptation set level of the signaling description file, it describes the relevant attributes of the video auxiliary information of each video resource in the corresponding adaptation set;
    当视频辅助信息描述子位于所述信令描述文件的表示层级时,描述相应的视频资源的视频辅助信息的相关属性。When the video auxiliary information descriptor is located at the presentation level of the signaling description file, it describes the relevant attributes of the video auxiliary information of the corresponding video resource.
  18. 如权利要求16所述的方法,其特征在于,所述视频辅助信息描述子包括视频辅助信息字段;The method of claim 16, wherein the video auxiliary information descriptor includes a video auxiliary information field;
    当所述视频辅助信息描述子位于表示层级时,所述视频辅助信息字段用于存储对应的视频资源的视频辅助信息的相关属性;When the video auxiliary information descriptor is located at the presentation level, the video auxiliary information field is used to store relevant attributes of the video auxiliary information of the corresponding video resource;
    当所述视频辅助信息描述子位于自适应集层级时,所述视频辅助信息字段用于存储对应的自适应集的视频辅助信息的相关属性。When the video auxiliary information descriptor is located at the adaptive set level, the video auxiliary information field is used to store relevant attributes of the video auxiliary information of the corresponding adaptive set.
  19. 如权利要求16所述的方法,其特征在于,所述视频辅助信息描述子包括辅助信息指示字段;The method of claim 16, wherein the video auxiliary information descriptor includes an auxiliary information indication field;
    当所述辅助信息指示字段的取值为第九数值时,指示相应视频资源同时包含视频图像信息和相应的视频辅助信息;When the value of the auxiliary information indication field is the ninth value, it indicates that the corresponding video resource contains both video image information and corresponding video auxiliary information;
    当所述辅助信息指示字段的取值为第十数值时,指示相应视频资源仅包含视频图像信息。When the value of the auxiliary information indication field is the tenth value, it indicates that the corresponding video resource only contains video image information.
  20. 如权利要求19所述的方法,其特征在于,所述辅助信息指示字段为以下任一种:带有深度信息的视频辅助信息字段,带有视差信息的视频辅助信息字段,带有辅助信息的视频辅助信息字段;其中:The method of claim 19, wherein the auxiliary information indication field is any one of the following: a video auxiliary information field with depth information, a video auxiliary information field with parallax information, a video auxiliary information field with auxiliary information. Video auxiliary information field; where:
    当所述辅助信息指示字段为带有深度信息的视频辅助信息字段时,视频图像信息包括纹理信息,视频辅助信息包括深度信息;When the auxiliary information indication field is a video auxiliary information field with depth information, the video image information includes texture information, and the video auxiliary information includes depth information;
    当所述辅助信息指示字段为带有视差信息的视频辅助信息字段时,视频辅助信息包括视差信息;When the auxiliary information indication field is a video auxiliary information field with disparity information, the video auxiliary information includes disparity information;
    当所述辅助信息指示字段为带有辅助信息的视频辅助信息字段时,视频辅助信息包括深度信息、视差信息或者由其他应用定义的辅助信息。When the auxiliary information indication field is a video auxiliary information field with auxiliary information, the video auxiliary information includes depth information, disparity information, or auxiliary information defined by other applications.
  21. 如权利要求16所述的方法,其特征在于,若所述视频辅助信息描述子所描述的视频资源同时包含视频图像信息和相应的视频辅助信息,则所述视频辅助信息描述子包括辅助信息编解码依赖性字段;The method of claim 16, wherein if the video resource described by the video auxiliary information descriptor contains both video image information and corresponding video auxiliary information, then the video auxiliary information descriptor includes an auxiliary information code. Decode dependency fields;
    当所述辅助信息编解码依赖性字段的取值为第十一数值时,指示相应视频资源中的视频图像信息和相应的视频辅助信息之间存在解码依赖关系;When the value of the auxiliary information encoding and decoding dependency field is the eleventh value, it indicates that there is a decoding dependency relationship between the video image information in the corresponding video resource and the corresponding video auxiliary information;
    当所述辅助信息编解码依赖性字段的取值为第十二数值时,指示相应视频资源中的视频图像信息和相应的视频辅助信息之间不存在解码依赖关系。When the value of the auxiliary information codec dependency field is a twelfth value, it indicates that there is no decoding dependency relationship between the video image information in the corresponding video resource and the corresponding video auxiliary information.
  22. 如权利要求21所述的方法,其特征在于,所述辅助信息编解码依赖性字段为以下任一种:视频辅助信息的深度编解码依赖性字段,视频辅助信息的视差编解码依赖性字段,视频辅助信息的辅助编解码依赖性字段;其中:The method of claim 21, wherein the auxiliary information codec dependency field is any one of the following: a depth codec dependency field of video auxiliary information, a disparity codec dependency field of video auxiliary information, The auxiliary codec dependency field of the video auxiliary information; where:
    当所述辅助信息编解码依赖性字段为视频辅助信息的深度编解码依赖性字段时,视频图像信息包括纹理信息,视频辅助信息包括深度信息;When the auxiliary information codec dependency field is the depth codec dependency field of video auxiliary information, the video image information includes texture information, and the video auxiliary information includes depth information;
    当所述辅助信息编解码依赖性字段为视频辅助信息的视差编解码依赖性字段时,视频辅助信息包括视 差信息;When the auxiliary information codec dependency field is the disparity codec dependency field of the video auxiliary information, the video auxiliary information includes disparity information;
    当所述辅助信息编解码依赖性字段为视频辅助信息的辅助编解码依赖性字段时,视频辅助信息包括深度信息、视差信息或者由其他应用定义的辅助信息。When the auxiliary information codec dependency field is the auxiliary codec dependency field of video auxiliary information, the video auxiliary information includes depth information, disparity information, or auxiliary information defined by other applications.
  23. 如权利要求16所述的方法,其特征在于,所述视频辅助信息描述子包括视频辅助标识字段;The method of claim 16, wherein the video auxiliary information descriptor includes a video auxiliary identification field;
    当所述辅助信息标识字段的取值为第十三数值时,指示相应视频资源仅包含视频图像信息对应的视频辅助信息;When the value of the auxiliary information identification field is the thirteenth value, it indicates that the corresponding video resource only contains video auxiliary information corresponding to the video image information;
    当所述辅助信息标识字段的取值为第十四数值时,指示相应视频资源仅包含视频图像信息。When the value of the auxiliary information identification field is a fourteenth value, it indicates that the corresponding video resource only contains video image information.
  24. 如权利要求23所述的方法,其特征在于,所述视频辅助标识字段为以下任一种:视频辅助信息的仅限深度信息标志字段,视频辅助信息的仅限视差信息标志字段,视频辅助信息的仅限辅助信息标志字段;其中:The method of claim 23, wherein the video auxiliary identification field is any one of the following: a depth information only flag field of video auxiliary information, a parallax information only flag field of video auxiliary information, a video auxiliary information only flag field, Ancillary information only flag field of ; where:
    当所述视频辅助标识字段为视频辅助信息的仅限深度信息标志字段时,视频图像信息包括纹理信息,视频辅助信息包括深度信息;When the video auxiliary identification field is a depth information only flag field of video auxiliary information, the video image information includes texture information, and the video auxiliary information includes depth information;
    当所述视频辅助标识字段为视频辅助信息的仅限视差信息标志字段时,视频辅助信息包括视差信息;When the video auxiliary identification field is a disparity information only flag field of the video auxiliary information, the video auxiliary information includes disparity information;
    当所述视频辅助标识字段为视频辅助信息的仅限辅助信息标志字段时,视频辅助信息包括深度信息、视差信息或者由其他应用定义的辅助信息。When the video auxiliary identification field is an auxiliary information only flag field of video auxiliary information, the video auxiliary information includes depth information, disparity information or auxiliary information defined by other applications.
  25. 如权利要求16所述的方法,其特征在于,所述信令描述文件还包括至少一个自由视角相机描述子,所述自由视角相机描述子用于描述视频资源对应的相机信息;The method of claim 16, wherein the signaling description file further includes at least one free-view camera descriptor, and the free-view camera descriptor is used to describe camera information corresponding to the video resource;
    其中,所述自由视角相机描述子包括自由视角相机源标识字段,所述自由视角相机源标识字段的取值用于指示对应视频资源所属的内容标识符,同一视频内容下的不同相机对应的视频资源具备相同的内容标识符;Wherein, the free-view camera descriptor includes a free-view camera source identification field. The value of the free-view camera source identification field is used to indicate the content identifier to which the corresponding video resource belongs. Videos corresponding to different cameras under the same video content Resources have the same content identifier;
    或者,所述自由视角相机描述子包括自由视角相机组合标识字段,所述自由视角相机组合标识字段的取值用于指示对应视频资源所属的组合标识符,同一视频内容下的不同相机对应的视频资源具备相同的组合标识符。Alternatively, the free-view camera descriptor includes a free-view camera combination identification field. The value of the free-view camera combination identification field is used to indicate the combination identifier to which the corresponding video resource belongs. Videos corresponding to different cameras under the same video content Resources have the same combined identifier.
  26. 如权利要求16所述的方法,其特征在于,当M>1时,表明所述沉浸媒体以多轨形式进行封装,一个视频轨道对应的视频资源或自适应集通过预选择集合工具进行指示。The method of claim 16, wherein when M>1, it indicates that the immersive media is packaged in a multi-track form, and the video resource or adaptive set corresponding to one video track is indicated through a pre-selected collection tool.
  27. 一种沉浸媒体的处理方法,由媒体处理设备执行,包括:An immersive media processing method, performed by a media processing device, including:
    获取沉浸媒体的媒体文件中的目标视频轨道,所述沉浸媒体通过多个相机拍摄得到,任一相机通过拍摄所采集到的信息包括:视频图像信息和相应的视频辅助信息;且所述媒体文件是通过将所述沉浸媒体的视频位流封装到M个视频轨道中得到的;其中,M为正整数,在所述M个视频轨道中,存在至少一个视频轨道中的视频媒体头数据盒包括元数据信息,所述元数据信息用于指示相应视频轨道是否同时包含视频图像信息和相应的视频辅助信息;Obtain the target video track in the media file of the immersive media. The immersive media is captured by multiple cameras. The information collected by any camera includes: video image information and corresponding video auxiliary information; and the media file It is obtained by encapsulating the video bit stream of the immersive media into M video tracks; where M is a positive integer, and among the M video tracks, there is a video media header data box in at least one video track including Metadata information, the metadata information is used to indicate whether the corresponding video track contains both video image information and corresponding video auxiliary information;
    对所述目标视频轨道中的信息进行解码处理。Decode the information in the target video track.
  28. 如权利要求27所述的方法,其特征在于,所述媒体文件是由媒体制作设备制作封装得到的,所述视频位流被封装到M个视频轨道后,得到每个视频轨道对应的视频资源,所述媒体文件包括所述M个视频轨道对应的视频资源;所述获取沉浸媒体的媒体文件中的目标视频轨道,包括:The method of claim 27, wherein the media file is produced and encapsulated by a media production device, and after the video bit stream is encapsulated into M video tracks, video resources corresponding to each video track are obtained. , the media file includes the video resources corresponding to the M video tracks; the obtaining the target video track in the media file of the immersive media includes:
    与所述媒体制作设备建立流化传输;Establish streaming transmission with said media production device;
    从所述媒体制作设备处获取沉浸媒体的媒体文件对应的信令描述文件,所述信令描述文件中包括至少一个视频辅助信息描述子,视频辅助信息描述子用于描述视频资源的视频辅助信息的相关属性;Obtain the signaling description file corresponding to the media file of the immersive media from the media production device. The signaling description file includes at least one video auxiliary information descriptor. The video auxiliary information descriptor is used to describe the video auxiliary information of the video resource. related attributes;
    根据所述信令描述文件的指示,请求所述媒体制作设备下发目标视频资源;并对所述目标视频资源进行解封装处理,得到目标视频轨道。According to the instructions of the signaling description file, the media production device is requested to deliver the target video resource; and the target video resource is decapsulated to obtain the target video track.
  29. 一种沉浸媒体的处理装置,,包括:An immersive media processing device, including:
    处理单元,用于获取沉浸媒体的视频位流,所述沉浸媒体通过多个相机拍摄得到;任一相机通过拍摄所采集到的信息包括:视频图像信息和相应的视频辅助信息;A processing unit, used to obtain the video bit stream of the immersive media, which is captured by multiple cameras; the information collected by any camera through photography includes: video image information and corresponding video auxiliary information;
    封装单元,用于将所述视频位流封装到M个视频轨道中,得到所述沉浸媒体的媒体文件;其中,M为正整数,且在所述M个视频轨道中,存在至少一个视频轨道中的视频媒体头数据盒包括元数据信息,所述元数据信息用于指示相应视频轨道是否同时包含视频图像信息和相应的视频辅助信息。An encapsulating unit, configured to encapsulate the video bit stream into M video tracks to obtain the media file of the immersive media; where M is a positive integer, and among the M video tracks, there is at least one video track The video media header data box in includes metadata information, which is used to indicate whether the corresponding video track contains both video image information and corresponding video auxiliary information.
  30. 一种沉浸媒体的处理装置,包括:An immersive media processing device, including:
    获取单元,用于获取沉浸媒体的媒体文件中的目标视频轨道,所述沉浸媒体通过多个相机拍摄得到,任一相机通过拍摄所采集到的信息包括:视频图像信息和相应的视频辅助信息;且所述媒体文件是通过将所述沉浸媒体的视频位流封装到M个视频轨道中得到的;其中,M为正整数,在所述M个视频轨道中,存在至少一个视频轨道中的视频媒体头数据盒包括元数据信息,所述元数据信息用于指示相应视频轨道是否同时包含视频图像信息和相应的视频辅助信息;The acquisition unit is used to acquire the target video track in the media file of the immersive media. The immersive media is captured by multiple cameras. The information collected by any camera through photography includes: video image information and corresponding video auxiliary information; And the media file is obtained by encapsulating the video bit stream of the immersive media into M video tracks; where M is a positive integer, and among the M video tracks, there is a video in at least one video track The media header data box includes metadata information, the metadata information is used to indicate whether the corresponding video track contains both video image information and corresponding video auxiliary information;
    解码单元,用于对所述目标视频轨道中的信息进行解码处理。A decoding unit, configured to decode information in the target video track.
  31. 一种媒体处理设备,包括输入接口和输出接口,还包括:A media processing device includes an input interface and an output interface, and also includes:
    处理器,适于实现一条或多条计算机可读指令;以及,计算机可读存储介质,所述计算机可读存储介质存储有一条或多条计算机可读指令;A processor adapted to implement one or more computer-readable instructions; and, a computer-readable storage medium storing one or more computer-readable instructions;
    其中,所述一条或多条计算机可读指令适于由所述处理器加载并执行如权利要求1-26任一项所述的沉浸媒体的处理方法;或者,所述一条或多条指令适于由所述处理器加载并执行如权利要求27或28所述的沉浸媒体的处理方法。Wherein, the one or more computer-readable instructions are adapted to be loaded by the processor and execute the immersive media processing method according to any one of claims 1-26; or, the one or more instructions are adapted to The processor loads and executes the immersive media processing method as claimed in claim 27 or 28.
PCT/CN2022/137788 2022-04-26 2022-12-09 Immersive media processing method and apparatus, device, and storage medium WO2023207119A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210445405.0 2022-04-26
CN202210445405.0A CN114697631B (en) 2022-04-26 2022-04-26 Immersion medium processing method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2023207119A1 true WO2023207119A1 (en) 2023-11-02

Family

ID=82144412

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/137788 WO2023207119A1 (en) 2022-04-26 2022-12-09 Immersive media processing method and apparatus, device, and storage medium

Country Status (2)

Country Link
CN (2) CN116962756A (en)
WO (1) WO2023207119A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116962756A (en) * 2022-04-26 2023-10-27 腾讯科技(深圳)有限公司 Method, device, equipment and storage medium for processing immersion medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018058773A1 (en) * 2016-09-30 2018-04-05 华为技术有限公司 Video data processing method and apparatus
CN108696471A (en) * 2017-02-21 2018-10-23 科通环宇(北京)科技有限公司 Code stream packaging method, code stream, coding/decoding method and device based on AVS2
WO2020070379A1 (en) * 2018-10-03 2020-04-09 Nokia Technologies Oy Method and apparatus for storage and signaling of compressed point clouds
WO2021102953A1 (en) * 2019-11-29 2021-06-03 Zte Corporation Multi-view video processing method and apparatus
CN114697631A (en) * 2022-04-26 2022-07-01 腾讯科技(深圳)有限公司 Immersion medium processing method, device, equipment and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2989798B1 (en) * 2013-04-23 2018-10-03 Dolby Laboratories Licensing Corporation Transmitting display management metadata over hdmi
WO2021064293A1 (en) * 2019-10-02 2021-04-08 Nokia Technologies Oy Method and apparatus for storage and signaling of sub-sample entry descriptions

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018058773A1 (en) * 2016-09-30 2018-04-05 华为技术有限公司 Video data processing method and apparatus
CN108696471A (en) * 2017-02-21 2018-10-23 科通环宇(北京)科技有限公司 Code stream packaging method, code stream, coding/decoding method and device based on AVS2
WO2020070379A1 (en) * 2018-10-03 2020-04-09 Nokia Technologies Oy Method and apparatus for storage and signaling of compressed point clouds
WO2021102953A1 (en) * 2019-11-29 2021-06-03 Zte Corporation Multi-view video processing method and apparatus
CN114697631A (en) * 2022-04-26 2022-07-01 腾讯科技(深圳)有限公司 Immersion medium processing method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN116962756A (en) 2023-10-27
CN114697631A (en) 2022-07-01
CN114697631B (en) 2023-03-21

Similar Documents

Publication Publication Date Title
US11200701B2 (en) Method and apparatus for storage and signaling of static point cloud data
US11581022B2 (en) Method and apparatus for storage and signaling of compressed point clouds
KR102406887B1 (en) Method, device, and computer program for generating timed media data
US20220014827A1 (en) Method, device and computer program for encapsulating media data into a media file
JP2020526057A (en) Method, device and computer program for transmitting media content
CN105049920A (en) Method and device for recording multimedia files
KR20230028291A (en) Media access using scene descriptions
CN115053531A (en) Storing multiple atlas sets from one V-PCC elementary stream in ISOBMFF
WO2023029858A1 (en) Encapsulation and decapsulation methods and apparatuses for point cloud media file, and storage medium
EP4283996A1 (en) Volumetric media processing method and apparatus, and storage medium and electronic apparatus
WO2020093931A1 (en) Subtitle data processing method, apparatus and device, and computer storage medium
WO2023207119A1 (en) Immersive media processing method and apparatus, device, and storage medium
CN105049904A (en) Play method and play device of multi-media files
CN113574903B (en) Method and apparatus for late binding in media content
WO2024041239A1 (en) Data processing method and apparatus for immersive media, device, storage medium, and program product
WO2023130896A1 (en) Media data processing method and apparatus, computer device and storage medium
US20230025664A1 (en) Data processing method and apparatus for immersive media, and computer-readable storage medium
WO2022111348A1 (en) Data processing method, apparatus and device for point cloud media, and storage medium
WO2023169003A1 (en) Point cloud media decoding method and apparatus and point cloud media coding method and apparatus
WO2023169004A1 (en) Point cloud media data processing method and apparatus, device and medium
WO2022078148A1 (en) Media file encapsulation method, media file decapsulation method, and related device
TWI796989B (en) Immersive media data processing method, device, related apparatus, and storage medium
WO2022037423A1 (en) Data processing method, apparatus and device for point cloud media, and medium
CN117082262A (en) Point cloud file encapsulation and decapsulation method, device, equipment and storage medium
WO2023144439A1 (en) A method, an apparatus and a computer program product for video coding

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22939932

Country of ref document: EP

Kind code of ref document: A1